Caching and Reproducibility: Making Data Science experiments faster and FAIRer

Schubotz, Moritz; Satpute, Ankit; Greiner-Petter, Andre; Aizawa, Akiko; Gipp, Bela

doi:10.3389/frma.2022.861944

Computer Science > Software Engineering

arXiv:2211.04049 (cs)

[Submitted on 8 Nov 2022 (v1), last revised 9 Nov 2022 (this version, v2)]

Title:Caching and Reproducibility: Making Data Science experiments faster and FAIRer

Authors:Moritz Schubotz, Ankit Satpute, Andre Greiner-Petter, Akiko Aizawa, Bela Gipp

View PDF

Abstract:Small to medium-scale data science experiments often rely on research software developed ad-hoc by individual scientists or small teams. Often there is no time to make the research software fast, reusable, and open access. The consequence is twofold. First, subsequent researchers must spend significant work hours building upon the proposed hypotheses or experimental framework. In the worst case, others cannot reproduce the experiment and reuse the findings for subsequent research. Second, suppose the ad-hoc research software fails during often long-running computationally expensive experiments. In that case, the overall effort to iteratively improve the software and rerun the experiments creates significant time pressure on the researchers. We suggest making caching an integral part of the research software development process, even before the first line of code is written. This article outlines caching recommendations for developing research software in data science projects. Our recommendations provide a perspective to circumvent common problems such as propriety dependence, speed, etc. At the same time, caching contributes to the reproducibility of experiments in the open science workflow. Concerning the four guiding principles, i.e., Findability, Accessibility, Interoperability, and Reusability (FAIR), we foresee that including the proposed recommendation in a research software development will make the data related to that software FAIRer for both machines and humans. We exhibit the usefulness of some of the proposed recommendations on our recently completed research software project in mathematical information retrieval.

Comments:	8 pages, 1 table
Subjects:	Software Engineering (cs.SE); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2211.04049 [cs.SE]
	(or arXiv:2211.04049v2 [cs.SE] for this version)
	https://doi.org/10.48550/arXiv.2211.04049
Journal reference:	Frontiers in Research Metrics and Analytics, volume 7, 2022
Related DOI:	https://doi.org/10.3389/frma.2022.861944

Submission history

From: Bela Gipp [view email]
[v1] Tue, 8 Nov 2022 07:11:02 UTC (584 KB)
[v2] Wed, 9 Nov 2022 14:45:50 UTC (584 KB)

Computer Science > Software Engineering

Title:Caching and Reproducibility: Making Data Science experiments faster and FAIRer

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Software Engineering

Title:Caching and Reproducibility: Making Data Science experiments faster and FAIRer

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators