Abstract
Advances in computing infrastructure and instrumentation have accelerated scientific discovery in addition to exploding the data volumes. Unfortunately, the unavailability of equally advanced data management infrastructure has led to ad hoc practices that diminish scientific productivity and exacerbate the reproducibility crisis. We discuss a system-wide solution that supports management needs at every stage of the data lifecycle. At the center of this system is DataFed - a general purpose, scientific data management system that addresses these challenges by federating data storage across facilities with central metadata and provenance management - providing simple and uniform data discovery, access, and collaboration capabilities. At the edge is a Data Gateway that captures raw data and context from experiments (even when performed on off-network instruments) into DataFed. DataFed can be integrated into analytics platforms to easily, correctly, and reliably work with datasets to improve reproducibility of such workloads. We believe that this system can significantly alleviate the burden of data management and improve compliance with the Findable Accessible Interoperable, Reusable (FAIR) data principles, thereby improving scientific productivity and rigor.
D. Stansberry et al.—Contributed Equally
This manuscript has been co-authored by UT-Battelle, LLC, under contract DE-AC05-00OR22725 with the US Department of Energy (DOE). The US government retains and the publisher, by accepting the article for publication, acknowledges that the US government retains a nonexclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for US government purposes. DOE will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Allan, C., et al.: Omero: flexible, model-driven data management for experimental biology. Nat. Methods 9(3), 245 (2012)
Allcock, W.: GridFTP: protocol extensions to ftp for the grid (2003). http://www.ggf.org/documents/GFD.20.pdf
Allcock, W., Bresnahan, J., Kettimuthu, R., Link, M., Dumitrescu, C., Raicu, I., Foster, I.: The globus striped GridFTP framework and server. In: Proceedings of the 2005 ACM/IEEE Conference on Supercomputing, p. 54. IEEE Computer Society (2005)
Arkin, A.P., et al.: The DOE systems biology knowledgebase (KBase). BioRxiv, p. 096354 (2016)
Baker, M.: 1,500 scientists lift the lid on reproducibility (2016)
Baker, M.: Biotech giant posts negative results. Nature 530(7589), 141–141 (2016)
Bartusch, F., Hanussek, M., Krüger, J., Kohlbacher, O.: Reproducible scientific workflows for high performance and cloud computing. In: 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), pp. 161–164 (2019)
Beaulieu-Jones, B.K., Greene, C.S.: Reproducibility of computational workflows is automated using continuous analysis. Nat. Biotechnol. 35(4), 342–346 (2017)
Blair, J., et al. High performance data management and analysis for tomography. In: Developments in X-Ray Tomography IX, vol. 9212, p. 92121G. International Society for Optics and Photonics (2014)
Fernández, L., Hagenrud, H., Zupanc, B., Laface, E., Korhonen, T., Andersson, R.: Jupyterhub at the ESS. An interactive python computing environment for scientists and engineers (2016)
Furche, T., Gottlob, G., Libkin, L., Orsi, G., Paton, N.W.: Data wrangling for big data: challenges and opportunities. In: EDBT, vol. 16, pp. 473–478 (2016)
Garonne, V., et al.: Rucio-the next generation of large scale distributed system for atlas data management. J. Phys: Conf. Ser. 513, 042021 (2014). IOP Publishing
Heidorn, P.B.: Shedding light on the dark data in the long tail of science. Libr. Trends 57(2), 280–299 (2008)
Hutson, M.: Artificial intelligence faces reproducibility crisis (2018)
Kalinin, S.V., et al. Big, deep, and smart data in scanning probe microscopy. ACS Nano, pp. 9068–9086 (2016)
Kluyver, T., et al.: Jupyter notebooks-a publishing format for reproducible computational workflows. In: ELPUB, pp. 87–90 (2016)
Marder, K., Patera, A., Astolfo A., Schneider, M., Weber, B., Stampanoni, M.: Investigating the microvessel architecture of the mouse brain: an approach for measuring, stitching, and analyzing 50 teravoxels of data. In: 12th International Conference on Synchrotron Radiation Instrumentation, p. 73. AIP (2015)
Marini, L., et al.: Clowder: open source data management for long tail data. In: Proceedings of the Practice and Experience on Advanced Research Computing, p. 40. ACM (2018)
Merkel, D.: Docker: lightweight linux containers for consistent development and deployment. Linux J. 2014(239), 2 (2014)
Miyakawa, T.: No raw data, no science: another possible source of the reproducibility crisis (2020)
Nosek, B.A., et al.: Promoting an open research culture. Science 348(6242), 1422–1425 (2015)
Pouchard, L., et al.: Computational reproducibility of scientific workflows at extreme scales. Int. J. High Perform. Comput. Appl. 33(5), 763–776 (2019)
Quintero, C., Tran, K., Szewczak, A.A.: High-throughput quality control of DMSO acoustic dispensing using photometric dye methods. J. Lab. Autom. 18(4), 296–305 (2013)
Raccuglia, P., et al.: Machine-learning-assisted materials discovery using failed experiments. Nature 533(7601), 73–76 (2016)
Rajasekar, A., Moore, R., Vernon, F.: iRODS: a distributed data management cyber infrastructure for observatories. In: AGU Fall Meeting Abstracts (2007)
Stansberry, D., Somnath, S., Breet, J., Shutt, G., Shankar, M.: DataFed: towards reproducible research via federated data management. In: 2019 International Conference on Computational Science and Computational Intelligence (CSCI), pp. 1312–1317. IEEE (2019)
Wilkinson, M.D., et al.: The FAIR guiding principles for scientific data management and stewardship. Sci. Data 3, 160018 (2016)
Acknowledgments
This research used resources of the Oak Ridge Leadership Computing Facility (OLCF) and of the Compute and Data Environment for Science (CADES) at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 This is a U.S. government work and not under copyright protection in the U.S.; foreign copyright protection may apply
About this paper
Cite this paper
Stansberry, D., Somnath, S., Shutt, G., Shankar, M. (2020). A Systemic Approach to Facilitating Reproducibility via Federated, End-to-End Data Management. In: Nichols, J., Verastegui, B., Maccabe, A.‘., Hernandez, O., Parete-Koon, S., Ahearn, T. (eds) Driving Scientific and Engineering Discoveries Through the Convergence of HPC, Big Data and AI. SMC 2020. Communications in Computer and Information Science, vol 1315. Springer, Cham. https://doi.org/10.1007/978-3-030-63393-6_6
Download citation
DOI: https://doi.org/10.1007/978-3-030-63393-6_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-63392-9
Online ISBN: 978-3-030-63393-6
eBook Packages: Computer ScienceComputer Science (R0)