research-article

Open access

Strategies for reuse and sharing among data scientists in software teams

Authors:

Steven M. DruckerAuthors Info & Claims

ICSE-SEIP '22: Proceedings of the 44th International Conference on Software Engineering: Software Engineering in Practice

Pages 243 - 252

https://doi.org/10.1145/3510457.3513042

Published: 17 October 2022 Publication History

Abstract

Effective sharing and reuse practices have long been hallmarks of proficient software engineering. Yet the exploratory nature of data science presents new challenges and opportunities to support sharing and reuse of analysis code. To better understand current practices, we conducted interviews (N=17) and a survey (N=132) with data scientists at Microsoft, and extract five commonly used strategies for sharing and reuse of past work: personal analysis reuse, personal utility libraries, team shared analysis code, team shared template notebooks, and team shared libraries. We also identify factors that encourage or discourage data scientists from sharing and reusing. Our participants described obstacles to reuse and sharing including a lack of incentives to create shared code, difficulties in making data science code modular, and a lack of tool interoperability. We discuss how future tools might help meet these needs.

References

[1]

Rabe Abdalkareem, Olivier Nourry, Sultan Wehaibi, Suhaib Mujahid, and Emad Shihab. 2017. Why do developers use trivial packages? an empirical case study on npm. In Proceedings of the 2017 11th joint meeting on foundations of software engineering. 385--395.

Digital Library

[2]

Saleema Amershi, Andrew Begel, Christian Bird, Robert DeLine, Harald Gall, Ece Kamar, Nachiappan Nagappan, Besmira Nushi, and Thomas Zimmermann. 2019. Software engineering for machine learning: A case study. In 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). IEEE, 291--300.

Digital Library

[3]

Apache. 2021. Apache Arrow. Retrieved October 15, 2021 from https:https://arrow.apache.org/

[4]

Andrew Begel and Thomas Zimmermann. 2014. Analyze this! 145 questions for data scientists in software engineering. In Proceedings of the 36th International Conference on Software Engineering. 12--23.

Digital Library

[5]

Mike Bostock. 2018. Introduction to Imports. Retrieved August 25, 2021 from https://observablehq.com/@observablehq/introduction-to-imports

[6]

Mark Chen and et al. 2021. Evaluating Large Language Models Trained on Code. CoRR abs/2107.03374 (2021). arXiv:2107.03374 https://arxiv.org/abs/2107.03374

[7]

Databricks. 2021. Databricks. Retrieved August 25, 2021 from https://databricks.com

[8]

Databricks. 2021. Notebook workflows. Retrieved October 11, 2021 from https://docs.databricks.com/notebooks/notebook-workflows.html

[9]

William B Frakes and Kyo Kang. 2005. Software reuse research: Status and future. IEEE transactions on Software Engineering 31, 7 (2005), 529--536.

Digital Library

[10]

Git. 2021. Git. Retrieved August 25, 2021 from https://git-scm.com/

[11]

Martin L Griss. 1993. Software reuse: From library to factory. IBM systems journal 32, 4 (1993), 548--566.

[12]

Lars Heinemann, Florian Deissenboeck, Mario Gleirscher, Benjamin Hummel, and Maximilian Irlbeck. 2011. On the extent and nature of software reuse in open source java projects. In International Conference on Software Reuse. Springer, 207--222.

[13]

Mary Beth Kery and Brad A Myers. 2018. Interactions for untangling messy history in a computational notebook. In 2018 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC). IEEE, 147--155.

[14]

Miryung Kim, Thomas Zimmermann, Robert DeLine, and Andrew Begel. 2016. The emerging role of data scientists on software development teams. In 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE). IEEE, 96--107.

Digital Library

[15]

Miryung Kim, Thomas Zimmermann, Robert DeLine, and Andrew Begel. 2018. Data Scientists in Software Teams: State of the Art and Challenges. IEEE Transactions on Software Engineering 44, 11 (2018), 1024--1038.

[16]

Yongbeom Kim and Edward A Stohr. 1998. Software reuse: survey and research directions. Journal of Management Information Systems 14, 4 (1998), 113--147.

Digital Library

[17]

Charles W Krueger. 1992. Software reuse. ACM Computing Surveys (CSUR) 24, 2 (1992), 131--183.

Digital Library

[18]

Tim Menzies. 2016. How Not to Do It: Anti-Patterns for Data Science in Software Engineering. In Proceedings of the 38th International Conference on Software Engineering Companion (Austin, Texas) (ICSE '16). Association for Computing Machinery, New York, NY, USA, 887.

Digital Library

[19]

Tim Menzies, Ekrem Kocaguneli, Fayola Peters, Burak Turhan, and Leandro L. Minku. 2013. Data Science for Software Engineering. In Proceedings of the 2013 International Conference on Software Engineering (San Francisco, CA, USA) (ICSE '13). IEEE Press, 1484--1486.

[20]

Microsoft. 2021. Kusto query overview. Retrieved October 8, 2021 from https://docs.microsoft.com/en-us/azure/data-explorer/kusto/query/

[21]

Parastoo Mohagheghi and Reidar Conradi. 2007. Quality, productivity and economic benefits of software reuse: a review of industrial studies. Empirical Software Engineering 12 (2007), 471--516.

Digital Library

[22]

João Felipe Pimentel, Leonardo Murta, Vanessa Braganholo, and Juliana Freire. 2019. A Large-Scale Study About Quality and Reproducibility of Jupyter Notebooks. In 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR). 507--517.

Digital Library

[23]

Israel J Mojica Ruiz, Meiyappan Nagappan, Bram Adams, and Ahmed E Hassan. 2012. Understanding reuse in the android market. In 2012 20th IEEE International Conference on Program Comprehension (ICPC). IEEE, 113--122.

[24]

Adam Rule, Aurélien Tabard, and James D. Hollan. 2018. Exploration and Explanation in Computational Notebooks. Association for Computing Machinery, New York, NY, USA, 1--12.

Digital Library

[25]

Nischal Shrestha, Colton Botta, Titus Barik, and Chris Parnin. 2020. Here We Go Again: Why Is It Difficult for Developers to Learn Another Programming Language?. In 42nd International Conference on Software Engineering (ICSE).

Digital Library

[26]

Bowen Xu, Le An, Ferdian Thung, Foutse Khomh, and David Lo. 2020. Why reinventing the wheels? An empirical study on library reuse and re-implementation. Empirical Software Engineering 25, 1 (Jan. 2020), 755--789.

Digital Library

[27]

Amy X. Zhang, Michael J. Muller, and Dakuo Wang. 2020. How do Data Science Workers Collaborate? Roles, Workflows, and Tools. Proceedings of the ACM on Human-Computer Interaction 4 (2020), 1 -- 23.

Digital Library

Cited By

Kazemitabaar MWilliams JDrosos IGrossman THenley ANegreanu CSarkar A(2024)Improving Steering and Verification in AI-Assisted Data Analysis with Interactive Task DecompositionProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology10.1145/3654777.3676345(1-19)Online publication date: 13-Oct-2024
https://dl.acm.org/doi/10.1145/3654777.3676345
Mostafavi Ghahfarokhi MAsgari AAbolnejadian MHeydarnoori ASpinellis DConstantinou EBacchelli A(2024)DistilKaggle: A Distilled Dataset of Kaggle Jupyter NotebooksProceedings of the 21st International Conference on Mining Software Repositories10.1145/3643991.3644882(647-651)Online publication date: 15-Apr-2024
https://dl.acm.org/doi/10.1145/3643991.3644882
Buccella ACechich AVillegas CMontenegro AMuñoz ARodriguez A(2023)A Model of Reusable Assets in AIE Software SystemsJournal of Computer Science and Technology10.24215/16666038.23.e1323:2(e13)Online publication date: 25-Oct-2023
https://doi.org/10.24215/16666038.23.e13
Show More Cited By

Recommendations

On code reuse from StackOverflow

Context: Source code reuse has been widely accepted as a fundamental activity in software development. Recent studies showed that StackOverflow has emerged as one of the most popular resources for code reuse. Therefore, a plethora of work proposed ways ...
One-off events? An empirical study of hackathon code creation and reuse
AbstractContext
Hackathons have become popular events for teams to collaborate on projects and develop software prototypes. Most existing research focuses on activities during an event with limited attention to the evolution of the hackathon code.
Objective
...
Towards exploring the code reuse from stack overflow during software development
ICPC '22: Proceedings of the 30th IEEE/ACM International Conference on Program Comprehension

As one of the most well-known programmer Q&A websites, Stack Overflow (i.e., SO) is serving tens of thousands of developers every day. Previous work has shown that many developers reuse the code snippets on SO when they find an answer (from SO) that ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICSE-SEIP '22: Proceedings of the 44th International Conference on Software Engineering: Software Engineering in Practice

May 2022

371 pages

ISBN:9781450392266

DOI:10.1145/3510457

Conference Chairs:
Mark Harman
Facebook, Inc & University College London
,
Heather Miller
Carnegie Mellon University

Copyright © 2022 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

SIGSOFT: ACM Special Interest Group on Software Engineering

In-Cooperation

IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2022

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ICSE '22

Sponsor:

SIGSOFT

ICSE '22: 44th International Conference on Software Engineering

May 21 - 29, 2022

Pennsylvania, Pittsburgh

Upcoming Conference

ICSE 2025

2025 IEEE/ACM 46th International Conference on Software Engineering

April 26 - May 3, 2025

Ottawa , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
331
Total Downloads

Downloads (Last 12 months)168
Downloads (Last 6 weeks)44

Reflects downloads up to 15 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Kazemitabaar MWilliams JDrosos IGrossman THenley ANegreanu CSarkar A(2024)Improving Steering and Verification in AI-Assisted Data Analysis with Interactive Task DecompositionProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology10.1145/3654777.3676345(1-19)Online publication date: 13-Oct-2024
https://dl.acm.org/doi/10.1145/3654777.3676345
Mostafavi Ghahfarokhi MAsgari AAbolnejadian MHeydarnoori ASpinellis DConstantinou EBacchelli A(2024)DistilKaggle: A Distilled Dataset of Kaggle Jupyter NotebooksProceedings of the 21st International Conference on Mining Software Repositories10.1145/3643991.3644882(647-651)Online publication date: 15-Apr-2024
https://dl.acm.org/doi/10.1145/3643991.3644882
Buccella ACechich AVillegas CMontenegro AMuñoz ARodriguez A(2023)A Model of Reusable Assets in AIE Software SystemsJournal of Computer Science and Technology10.24215/16666038.23.e1323:2(e13)Online publication date: 25-Oct-2023
https://doi.org/10.24215/16666038.23.e13
Epperson WGorantla VMoritz DPerer A(2023)Dead or Alive: Continuous Data Profiling for Interactive Data ScienceIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2023.332736730:1(197-207)Online publication date: 30-Oct-2023
https://dl.acm.org/doi/10.1109/TVCG.2023.3327367
Nahar NZhang HLewis GZhou SKästner C(2023)A Meta-Summary of Challenges in Building Products with ML Components – Collecting Experiences from 4758+ Practitioners2023 IEEE/ACM 2nd International Conference on AI Engineering – Software Engineering for AI (CAIN)10.1109/CAIN58948.2023.00034(171-183)Online publication date: May-2023
https://doi.org/10.1109/CAIN58948.2023.00034

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten