Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3579027.3608974acmconferencesArticle/Chapter ViewAbstractPublication PagessplcConference Proceedingsconference-collections
short-paper

Taming the Diversity of Computational Notebooks

Published: 28 August 2023 Publication History

Abstract

In many applications of Computational Science and especially Data Science, notebooks are the cornerstone of knowledge and experiment sharing. Their diversity is multiple (problem addressed, input data, algorithm used, overall quality) and is not made explicit at all. As they are heavily reused through a clone-and-own approach, the tailoring process from an existing notebook to a specific problem is cumbersome, error-prone, and particularly uncertain. In this paper, we propose a tooled approach that captures the different dimensions of variability in computational notebooks. It allows one to seek an existing notebook that suits her requirements, or to generate most parts of a new one.

References

[1]
Ebrahim Khalil Abbasi, Arnaud Hubaux, and Patrick Heymans. 2011. A toolset for feature-based configuration workflows. In 2011 15th International Software Product Line Conference. IEEE, 65--69.
[2]
Mathieu Acher, Philippe Collet, Philippe Lahire, and Robert France. 2010. Managing Variability in Workflow with Feature Model Composition Operators. In 9th International Conference on Software Composition(SC'10) (Software Composition, Vol. LNCS). Springer, Malaga, Spain, 16.
[3]
Saleema Amershi, Andrew Begel, Christian Bird, Robert DeLine, Harald Gall, Ece Kamar, Nachiappan Nagappan, Besmira Nushi, and Thomas Zimmermann. 2019. Software Engineering for Machine Learning: A Case Study. In 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice, ICSE-SEIP 2019. IEEE, Montreal Quebec Canada, 291--300. https://doi.org/10.1109/ICSE-SEIP.2019.00042
[4]
Saleema Amershi, Maya Cakmak, William Bradley Knox, and Todd Kulesza. 2014. Power to the People: The Role of Humans in Interactive Machine Learning. AI Magazine 35, 4 (Dec. 2014), 105--120. https://doi.org/10.1609/aimag.v35i4.2513
[5]
Yassine El Amraoui, Mireille Blay-Fornarino, Philippe Collet, Frédéric Precioso, and Julien Muller. 2022. Evolvable SPL Management with Partial Knowledge: An Application to Anomaly Detection in Time Series. In Proc. of the 26th ACM International Systems and Software Product Line Conference - Volume A (Graz, Austria) (SPLC '22). ACM, New York, NY, USA, 222--233. https://doi.org/10.1145/3546932.3547008
[6]
Thorsten Berger, Ralf Rublack, Divya Nair, Joanne M Atlee, Martin Becker, Krzysztof Czarnecki, and Andrzej Wąsowski. 2013. A survey of variability modeling in industrial practice. In Proceedings of the seventh international workshop on variability modelling of software-intensive systems. ACM, New York, USA, 1--8.
[7]
Besim Bilalli, Alberto Abelló, and Tomàs Aluja-Banet. 2017. On the predictive power of meta-features in OpenML. International Journal of Applied Mathematics and Computer Science 27, 4 (2017), 697----712.
[8]
Goetz Botterweck, Steffen Thiel, Daren Nestor, Saad bin Abid, and Ciarán Cawley. 2008. Visual tool support for configuring and understanding software product lines. In 2008 12th International Software Product Line Conference. IEEE, Limerick, Ireland, 77--86.
[9]
Yann Brault, Yassine El Amraoui, Mireille Blay-Fornarino, Philippe Collet, Florent Jaillet, and Frédéric Precioso. 2023. SPLC'23 Reproduction Package. https://doi.org/10.5281/zenodo.8013518
[10]
Deepak Dhungana, Dominik Seichter, Goetz Botterweck, Rick Rabiser, Paul Grunbacher, David Benavides, and Jose A Galindo. 2011. Configuration of multi product lines by bridging heterogeneous variability modeling approaches. In 2011 15th International Software Product Line Conference. IEEE, 120--129.
[11]
Yael Dubinsky, Julia Rubin, Thorsten Berger, Slawomir Duszynski, Martin Becker, and Krzysztof Czarnecki. 2013. An exploratory study of cloning in industrial software product lines. In Proceedings of the European Conference on Software Maintenance and Re engineering, CSMR. IEEE, Genova, Italy, 25--34. https://doi.org/10.1109/CSMR.2013.13
[12]
Manuel Fernández-Delgado, Eva Cernadas, Senén Barro, and Dinani Amorim. 2014. Do we need hundreds of classifiers to solve real world classification problems? The Journal of Machine Learning Research 15, 1 (Jan. 2014), 3133--3181. https://jmlr.org/papers/v15/delgado14a.html
[13]
José A Galindo, Deepak Dhungana, Rick Rabiser, David Benavides, Goetz Botterweck, and Paul Grünbacher. 2015. Supporting distributed product configuration by integrating heterogeneous variability modeling approaches. Information and Software Technology 62 (2015), 78--100.
[14]
Eddy Ghabach, Mireille Blay-Fornarino, Franjieh El Khoury, and Badih Baz. 2018. Clone-and-Own software product derivation based on developer preferences and cost estimation. In Proceedings - International Conference on Research Challenges in Information Science, Vol. 2018-May. IEEE Computer Society, 1--6. https://doi. org/10.1109/RCIS.2018.8406682
[15]
Khan Mohammad Habibullah and Jennifer Horkoff. 2021. Non-functional requirements for machine learning: understanding current use and challenges in industry. In 2021 IEEE 29th International Requirements Engineering Conference (RE). IEEE, 13--23.
[16]
Herman Hartmann and Tim Trew. 2008. Using Feature Diagrams with Context Variability to Model Multiple Product Lines for Software Supply Chains. In SPLC'08. IEEE, 12--21.
[17]
Samuel Idowu, Daniel Struber, and Thorsten Berger. 2021. Asset Management in Machine Learning: A Survey. In Proceedings - International Conference on Software Engineering. IEEE, Virtual Event Spain, 51--60. https://doi.org/10.1109/ICSE-SEIP52600.2021.00014
[18]
Michael I Jordan and Tom M Mitchell. 2015. Machine learning: Trends, perspectives, and prospects. Science 349, 6245 (2015), 255--260.
[19]
Cory Kapser and Michael W Godfrey. 2003. Toward a taxonomy of clones in source code: A case study. Evolution of large scale industrial software architectures 16 (2003), 107--113.
[20]
Timo Kehrer, Thomas Thüm, Alexander Schultheiß, and Paul Maximilian Bittner. 2021. Bridging the gap between clone-and-own and software product lines. In 2021 IEEE/ACM 43rd International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER). IEEE, 21--25.
[21]
Mary Beth Kery, Marissa Radensky, Mahima Arya, Bonnie E. John, and Brad A. Myers. 2018. The Story in the Notebook: Exploratory Data Science using a Literate Programming Tool. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (CHI '18). Association for Computing Machinery, New York, NY, USA, 1--11. https://doi.org/10.1145/3173574.3173748
[22]
Andreas P. Koenzen, Neil A. Ernst, and Margaret-Anne D. Storey. 2020. Code Duplication and Reuse in Jupyter Notebooks. In 2020 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC). IEEE, Dunedin, New Zealand, 1--9. https://doi.org/10.1109/VL/HCC50065.2020.9127202
[23]
Jacob Krüger and Thorsten Berger. 2020. An empirical analysis of the costs of clone-and platform-oriented software reuse. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. ACM, 432--444.
[24]
Wardah Mahmood, Daniel Struber, Thorsten Berger, Ralf Lammel, and Mukelabai Mukelabai. 2021. Seamless variability management with the virtual platform. In Proceedings -International Conference on Software Engineering. ACM, 1658--1670.
[25]
Robert A. McDougal, Thomas M. Morse, Ted Carnevale, Luis Marenco, Rixin Wang, Michele Migliore, Perry L. Miller, Gordon M. Shepherd, and Michael L. Hines. 2017. Twenty years of ModelDB and beyond: building essential modeling tools for the future of neuroscience. Journal of Computational Neuroscience 42, 1 (feb 2017), 1--10. https://doi.org/10.1007/S10827-016-0623-7
[26]
Nadia Nahar, Haoran Zhang, Grace Lewis, Shurui Zhou, and Christian Kästner. 2023. A Meta-Summary of Challenges in Building Products with ML Components-Collecting Experiences from 4758+ Practitioners. arXiv preprint 2304.00078 (2023), 1--15. https://doi.org/10.48550/arXiv.2304.00078
[27]
Luca Negrini, Guruprerana Shabadi, and Caterina Urban. 2023. Static Analysis of Data Transformations in Jupyter Notebooks. In Proc. of the 12th ACM SIGPLAN International Workshop on the State Of the Art in Program Analysis. 8--13.
[28]
Samir Passi and Phoebe Sengers. 2020. Making data science systems work. Big Data & Society 7, 2 (2020), 1--13. https://doi.org/10.1177/2053951720939605
[29]
Samir Passi and Phoebe Sengers. 2020. Making data science systems work. Big Data and Society 7 (7 2020). Issue 2. https://doi.org/10.1177/2053951720939605
[30]
Jeffrey M. Perkel. 2018. Why Jupyter is data scientists' computational notebook of choice. Nature 563, 7729 (Oct. 2018), 145--146. https://doi.org/10.1038/d41586-018-07196-1
[31]
M-O Reiser and Matthias Weber. 2006. Managing highly complex product families with multi-level feature trees. In Requirements Engineering, 14th IEEE International Conference. IEEE, 149--158.
[32]
Julia Rubin, Krzysztof Czarnecki, and Marsha Chechik. 2013. Managing cloned variants: a framework and experience. In Proceedings of the 17th International Software Product Line Conference. 101--110.
[33]
Adam Rule, Aurélien Tabard, and James D. Hollan. 2018. Exploration and Explanation in Computational Notebooks. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (Montreal QC, Canada) (CHI '18). Association for Computing Machinery, New York, NY, USA, 1--12. https://doi.org/10.1145/3173574.3173606
[34]
D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, and Dan Dennison. 2015. Hidden Technical Debt in Machine Learning Systems. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2 (Montreal, Canada) (NIPS'15). MIT Press, Cambridge, MA, USA, 2503--2511.
[35]
Leopoldo Teixeira, Rohit Gheyi, and Paulo Borba. 2020. Safe evolution of product lines using configuration knowledge laws. In Formal Methods: Foundations and Applications: 23rd Brazilian Symposium, SBMF 2020, Ouro Preto, Brazil, November 25-27, 2020, Proceedings 23. Springer, 210--227.
[36]
Thomas Thum, Don Batory, and Christian Kastner. 2009. Reasoning about edits to feature models. In 2009 IEEE 31st International Conference on Software Engineering. IEEE, 254--264.
[37]
Joaquin Vanschoren, Jan N. van Rijn, Bernd Bischl, and Luis Torgo. 2013. OpenML: Networked Science in Machine Learning. SIGKDD Explorations 15, 2 (2013), 49--60. https://doi.org/10.1145/2641190.2641198 arXiv:1407.7722
[38]
April Yi Wang, Dakuo Wang, Jaimie Drozdal, Xuye Liu, Soya Park, Steve Oney, and Christopher Brooks. 2021. What Makes a Well-Documented Notebook? A Case Study of Data Scientists' Documentation Practices in Kaggle. In Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems (CHI EA '21). Association for Computing Machinery, New York, NY, USA, 1--7. https://doi.org/10.1145/3411763.3451617
[39]
Jiawei Wang, Tzu-yang Kuo, Li Li, and Andreas Zeller. 2020. Restoring Reproducibility of Jupyter Notebooks. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering: Companion Proceedings (Seoul, South Korea) (ICSE '20). Association for Computing Machinery, New York, NY, USA, 288--289. https://doi.org/10.1145/3377812.3390803
[40]
Rüdiger Wirth and Jochen Hipp. 2000. CRISP-DM: Towards a standard process model for data mining. In Proceedings of the 4th international conference on the practical applications of knowledge discovery and data mining, Vol. 1. Manchester, 29--39.
[41]
Matei Zaharia, Andrew Chen, Aaron Davidson, Ali Ghodsi, Sue Ann Hong, Andy Konwinski, Siddharth Murching, Tomas Nykodym, Paul Ogilvie, Mani Parkhe, Fen Xie, and Corey Zumar. 2018. Accelerating the machine learning lifecycle with MLflow. IEEE Data Engineering Bulletin 41, 4 (2018), 39--45.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SPLC '23: Proceedings of the 27th ACM International Systems and Software Product Line Conference - Volume A
August 2023
305 pages
ISBN:9798400700910
DOI:10.1145/3579027
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 August 2023

Permissions

Request permissions for this article.

Check for updates

Badges

Author Tags

  1. clone-and-own
  2. computational science
  3. software variability

Qualifiers

  • Short-paper
  • Research
  • Refereed limited

Conference

SPLC '23
Sponsor:

Acceptance Rates

Overall Acceptance Rate 167 of 463 submissions, 36%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 81
    Total Downloads
  • Downloads (Last 12 months)81
  • Downloads (Last 6 weeks)1
Reflects downloads up to 30 Aug 2024

Other Metrics

Citations

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media