A Systemic Approach to Facilitating Reproducibility via Federated, End-to-End Data Management

Stansberry, Dale; Somnath, Suhas; Shutt, Gregory; Shankar, Mallikarjun

doi:10.1007/978-3-030-63393-6_6

Dale Stansberry¹¹,
Suhas Somnath¹¹,
Gregory Shutt¹¹ &
…
Mallikarjun Shankar¹¹

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1315))

Included in the following conference series:

Smoky Mountains Computational Sciences and Engineering Conference

1293 Accesses

Abstract

Advances in computing infrastructure and instrumentation have accelerated scientific discovery in addition to exploding the data volumes. Unfortunately, the unavailability of equally advanced data management infrastructure has led to ad hoc practices that diminish scientific productivity and exacerbate the reproducibility crisis. We discuss a system-wide solution that supports management needs at every stage of the data lifecycle. At the center of this system is DataFed - a general purpose, scientific data management system that addresses these challenges by federating data storage across facilities with central metadata and provenance management - providing simple and uniform data discovery, access, and collaboration capabilities. At the edge is a Data Gateway that captures raw data and context from experiments (even when performed on off-network instruments) into DataFed. DataFed can be integrated into analytics platforms to easily, correctly, and reliably work with datasets to improve reproducibility of such workloads. We believe that this system can significantly alleviate the burden of data management and improve compliance with the Findable Accessible Interoperable, Reusable (FAIR) data principles, thereby improving scientific productivity and rigor.

D. Stansberry et al.—Contributed Equally

This manuscript has been co-authored by UT-Battelle, LLC, under contract DE-AC05-00OR22725 with the US Department of Energy (DOE). The US government retains and the publisher, by accepting the article for publication, acknowledges that the US government retains a nonexclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for US government purposes. DOE will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 109.00; Price excludes VAT (USA)

Softcover Book: USD 139.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Tapis: An API Platform for Reproducible, Distributed Computational Research

Large-scale data services for science: Present and future challenges

Article 04 September 2016

Rucio: Scientific Data Management

Article Open access 09 August 2019

References

Allan, C., et al.: Omero: flexible, model-driven data management for experimental biology. Nat. Methods 9(3), 245 (2012)
Article Google Scholar
Allcock, W.: GridFTP: protocol extensions to ftp for the grid (2003). http://www.ggf.org/documents/GFD.20.pdf
Allcock, W., Bresnahan, J., Kettimuthu, R., Link, M., Dumitrescu, C., Raicu, I., Foster, I.: The globus striped GridFTP framework and server. In: Proceedings of the 2005 ACM/IEEE Conference on Supercomputing, p. 54. IEEE Computer Society (2005)
Google Scholar
Arkin, A.P., et al.: The DOE systems biology knowledgebase (KBase). BioRxiv, p. 096354 (2016)
Google Scholar
Baker, M.: 1,500 scientists lift the lid on reproducibility (2016)
Google Scholar
Baker, M.: Biotech giant posts negative results. Nature 530(7589), 141–141 (2016)
Article Google Scholar
Bartusch, F., Hanussek, M., Krüger, J., Kohlbacher, O.: Reproducible scientific workflows for high performance and cloud computing. In: 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), pp. 161–164 (2019)
Google Scholar
Beaulieu-Jones, B.K., Greene, C.S.: Reproducibility of computational workflows is automated using continuous analysis. Nat. Biotechnol. 35(4), 342–346 (2017)
Article Google Scholar
Blair, J., et al. High performance data management and analysis for tomography. In: Developments in X-Ray Tomography IX, vol. 9212, p. 92121G. International Society for Optics and Photonics (2014)
Google Scholar
Fernández, L., Hagenrud, H., Zupanc, B., Laface, E., Korhonen, T., Andersson, R.: Jupyterhub at the ESS. An interactive python computing environment for scientists and engineers (2016)
Google Scholar
Furche, T., Gottlob, G., Libkin, L., Orsi, G., Paton, N.W.: Data wrangling for big data: challenges and opportunities. In: EDBT, vol. 16, pp. 473–478 (2016)
Google Scholar
Garonne, V., et al.: Rucio-the next generation of large scale distributed system for atlas data management. J. Phys: Conf. Ser. 513, 042021 (2014). IOP Publishing
Google Scholar
Heidorn, P.B.: Shedding light on the dark data in the long tail of science. Libr. Trends 57(2), 280–299 (2008)
Article Google Scholar
Hutson, M.: Artificial intelligence faces reproducibility crisis (2018)
Google Scholar
Kalinin, S.V., et al. Big, deep, and smart data in scanning probe microscopy. ACS Nano, pp. 9068–9086 (2016)
Google Scholar
Kluyver, T., et al.: Jupyter notebooks-a publishing format for reproducible computational workflows. In: ELPUB, pp. 87–90 (2016)
Google Scholar
Marder, K., Patera, A., Astolfo A., Schneider, M., Weber, B., Stampanoni, M.: Investigating the microvessel architecture of the mouse brain: an approach for measuring, stitching, and analyzing 50 teravoxels of data. In: 12th International Conference on Synchrotron Radiation Instrumentation, p. 73. AIP (2015)
Google Scholar
Marini, L., et al.: Clowder: open source data management for long tail data. In: Proceedings of the Practice and Experience on Advanced Research Computing, p. 40. ACM (2018)
Google Scholar
Merkel, D.: Docker: lightweight linux containers for consistent development and deployment. Linux J. 2014(239), 2 (2014)
Google Scholar
Miyakawa, T.: No raw data, no science: another possible source of the reproducibility crisis (2020)
Google Scholar
Nosek, B.A., et al.: Promoting an open research culture. Science 348(6242), 1422–1425 (2015)
Article Google Scholar
Pouchard, L., et al.: Computational reproducibility of scientific workflows at extreme scales. Int. J. High Perform. Comput. Appl. 33(5), 763–776 (2019)
Article Google Scholar
Quintero, C., Tran, K., Szewczak, A.A.: High-throughput quality control of DMSO acoustic dispensing using photometric dye methods. J. Lab. Autom. 18(4), 296–305 (2013)
Article Google Scholar
Raccuglia, P., et al.: Machine-learning-assisted materials discovery using failed experiments. Nature 533(7601), 73–76 (2016)
Article Google Scholar
Rajasekar, A., Moore, R., Vernon, F.: iRODS: a distributed data management cyber infrastructure for observatories. In: AGU Fall Meeting Abstracts (2007)
Google Scholar
Stansberry, D., Somnath, S., Breet, J., Shutt, G., Shankar, M.: DataFed: towards reproducible research via federated data management. In: 2019 International Conference on Computational Science and Computational Intelligence (CSCI), pp. 1312–1317. IEEE (2019)
Google Scholar
Wilkinson, M.D., et al.: The FAIR guiding principles for scientific data management and stewardship. Sci. Data 3, 160018 (2016)
Article Google Scholar

Download references

Acknowledgments

This research used resources of the Oak Ridge Leadership Computing Facility (OLCF) and of the Compute and Data Environment for Science (CADES) at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.

Author information

Authors and Affiliations

Oak Ridge National Laboratory, Oak Ridge, TN, 37831, USA
Dale Stansberry, Suhas Somnath, Gregory Shutt & Mallikarjun Shankar

Authors

Dale Stansberry
View author publications
You can also search for this author in PubMed Google Scholar
Suhas Somnath
View author publications
You can also search for this author in PubMed Google Scholar
Gregory Shutt
View author publications
You can also search for this author in PubMed Google Scholar
Mallikarjun Shankar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dale Stansberry .

Editor information

Editors and Affiliations

Oak Ridge National Laboratory, Oak Ridge, TN, USA
Jeffrey Nichols
Oak Ridge National Laboratory, Oak Ridge, TN, USA
Becky Verastegui
Oak Ridge National Laboratory, Oak Ridge, TN, USA
Arthur ‘Barney’ Maccabe
Oak Ridge National Laboratory, Oak Ridge, TN, USA
Oscar Hernandez
Oak Ridge National Laboratory, Oak Ridge, TN, USA
Suzanne Parete-Koon
Oak Ridge National Laboratory, Oak Ridge, TN, USA
Theresa Ahearn

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Stansberry, D., Somnath, S., Shutt, G., Shankar, M. (2020). A Systemic Approach to Facilitating Reproducibility via Federated, End-to-End Data Management. In: Nichols, J., Verastegui, B., Maccabe, A.‘., Hernandez, O., Parete-Koon, S., Ahearn, T. (eds) Driving Scientific and Engineering Discoveries Through the Convergence of HPC, Big Data and AI. SMC 2020. Communications in Computer and Information Science, vol 1315. Springer, Cham. https://doi.org/10.1007/978-3-030-63393-6_6

Download citation

DOI: https://doi.org/10.1007/978-3-030-63393-6_6
Published: 18 December 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-63392-9
Online ISBN: 978-3-030-63393-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Systemic Approach to Facilitating Reproducibility via Federated, End-to-End Data Management

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Tapis: An API Platform for Reproducible, Distributed Computational Research

Large-scale data services for science: Present and future challenges

Rucio: Scientific Data Management

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

A Systemic Approach to Facilitating Reproducibility via Federated, End-to-End Data Management

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Tapis: An API Platform for Reproducible, Distributed Computational Research

Large-scale data services for science: Present and future challenges

Rucio: Scientific Data Management

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation