Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

A Systemic Approach to Facilitating Reproducibility via Federated, End-to-End Data Management

  • Conference paper
  • First Online:
Driving Scientific and Engineering Discoveries Through the Convergence of HPC, Big Data and AI (SMC 2020)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1315))

Included in the following conference series:

  • 1293 Accesses

Abstract

Advances in computing infrastructure and instrumentation have accelerated scientific discovery in addition to exploding the data volumes. Unfortunately, the unavailability of equally advanced data management infrastructure has led to ad hoc practices that diminish scientific productivity and exacerbate the reproducibility crisis. We discuss a system-wide solution that supports management needs at every stage of the data lifecycle. At the center of this system is DataFed - a general purpose, scientific data management system that addresses these challenges by federating data storage across facilities with central metadata and provenance management - providing simple and uniform data discovery, access, and collaboration capabilities. At the edge is a Data Gateway that captures raw data and context from experiments (even when performed on off-network instruments) into DataFed. DataFed can be integrated into analytics platforms to easily, correctly, and reliably work with datasets to improve reproducibility of such workloads. We believe that this system can significantly alleviate the burden of data management and improve compliance with the Findable Accessible Interoperable, Reusable (FAIR) data principles, thereby improving scientific productivity and rigor.

D. Stansberry et al.—Contributed Equally

This manuscript has been co-authored by UT-Battelle, LLC, under contract DE-AC05-00OR22725 with the US Department of Energy (DOE). The US government retains and the publisher, by accepting the article for publication, acknowledges that the US government retains a nonexclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for US government purposes. DOE will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 109.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 139.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Allan, C., et al.: Omero: flexible, model-driven data management for experimental biology. Nat. Methods 9(3), 245 (2012)

    Article  Google Scholar 

  2. Allcock, W.: GridFTP: protocol extensions to ftp for the grid (2003). http://www.ggf.org/documents/GFD.20.pdf

  3. Allcock, W., Bresnahan, J., Kettimuthu, R., Link, M., Dumitrescu, C., Raicu, I., Foster, I.: The globus striped GridFTP framework and server. In: Proceedings of the 2005 ACM/IEEE Conference on Supercomputing, p. 54. IEEE Computer Society (2005)

    Google Scholar 

  4. Arkin, A.P., et al.: The DOE systems biology knowledgebase (KBase). BioRxiv, p. 096354 (2016)

    Google Scholar 

  5. Baker, M.: 1,500 scientists lift the lid on reproducibility (2016)

    Google Scholar 

  6. Baker, M.: Biotech giant posts negative results. Nature 530(7589), 141–141 (2016)

    Article  Google Scholar 

  7. Bartusch, F., Hanussek, M., Krüger, J., Kohlbacher, O.: Reproducible scientific workflows for high performance and cloud computing. In: 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), pp. 161–164 (2019)

    Google Scholar 

  8. Beaulieu-Jones, B.K., Greene, C.S.: Reproducibility of computational workflows is automated using continuous analysis. Nat. Biotechnol. 35(4), 342–346 (2017)

    Article  Google Scholar 

  9. Blair, J., et al. High performance data management and analysis for tomography. In: Developments in X-Ray Tomography IX, vol. 9212, p. 92121G. International Society for Optics and Photonics (2014)

    Google Scholar 

  10. Fernández, L., Hagenrud, H., Zupanc, B., Laface, E., Korhonen, T., Andersson, R.: Jupyterhub at the ESS. An interactive python computing environment for scientists and engineers (2016)

    Google Scholar 

  11. Furche, T., Gottlob, G., Libkin, L., Orsi, G., Paton, N.W.: Data wrangling for big data: challenges and opportunities. In: EDBT, vol. 16, pp. 473–478 (2016)

    Google Scholar 

  12. Garonne, V., et al.: Rucio-the next generation of large scale distributed system for atlas data management. J. Phys: Conf. Ser. 513, 042021 (2014). IOP Publishing

    Google Scholar 

  13. Heidorn, P.B.: Shedding light on the dark data in the long tail of science. Libr. Trends 57(2), 280–299 (2008)

    Article  Google Scholar 

  14. Hutson, M.: Artificial intelligence faces reproducibility crisis (2018)

    Google Scholar 

  15. Kalinin, S.V., et al. Big, deep, and smart data in scanning probe microscopy. ACS Nano, pp. 9068–9086 (2016)

    Google Scholar 

  16. Kluyver, T., et al.: Jupyter notebooks-a publishing format for reproducible computational workflows. In: ELPUB, pp. 87–90 (2016)

    Google Scholar 

  17. Marder, K., Patera, A., Astolfo A., Schneider, M., Weber, B., Stampanoni, M.: Investigating the microvessel architecture of the mouse brain: an approach for measuring, stitching, and analyzing 50 teravoxels of data. In: 12th International Conference on Synchrotron Radiation Instrumentation, p. 73. AIP (2015)

    Google Scholar 

  18. Marini, L., et al.: Clowder: open source data management for long tail data. In: Proceedings of the Practice and Experience on Advanced Research Computing, p. 40. ACM (2018)

    Google Scholar 

  19. Merkel, D.: Docker: lightweight linux containers for consistent development and deployment. Linux J. 2014(239), 2 (2014)

    Google Scholar 

  20. Miyakawa, T.: No raw data, no science: another possible source of the reproducibility crisis (2020)

    Google Scholar 

  21. Nosek, B.A., et al.: Promoting an open research culture. Science 348(6242), 1422–1425 (2015)

    Article  Google Scholar 

  22. Pouchard, L., et al.: Computational reproducibility of scientific workflows at extreme scales. Int. J. High Perform. Comput. Appl. 33(5), 763–776 (2019)

    Article  Google Scholar 

  23. Quintero, C., Tran, K., Szewczak, A.A.: High-throughput quality control of DMSO acoustic dispensing using photometric dye methods. J. Lab. Autom. 18(4), 296–305 (2013)

    Article  Google Scholar 

  24. Raccuglia, P., et al.: Machine-learning-assisted materials discovery using failed experiments. Nature 533(7601), 73–76 (2016)

    Article  Google Scholar 

  25. Rajasekar, A., Moore, R., Vernon, F.: iRODS: a distributed data management cyber infrastructure for observatories. In: AGU Fall Meeting Abstracts (2007)

    Google Scholar 

  26. Stansberry, D., Somnath, S., Breet, J., Shutt, G., Shankar, M.: DataFed: towards reproducible research via federated data management. In: 2019 International Conference on Computational Science and Computational Intelligence (CSCI), pp. 1312–1317. IEEE (2019)

    Google Scholar 

  27. Wilkinson, M.D., et al.: The FAIR guiding principles for scientific data management and stewardship. Sci. Data 3, 160018 (2016)

    Article  Google Scholar 

Download references

Acknowledgments

This research used resources of the Oak Ridge Leadership Computing Facility (OLCF) and of the Compute and Data Environment for Science (CADES) at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dale Stansberry .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 This is a U.S. government work and not under copyright protection in the U.S.; foreign copyright protection may apply

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Stansberry, D., Somnath, S., Shutt, G., Shankar, M. (2020). A Systemic Approach to Facilitating Reproducibility via Federated, End-to-End Data Management. In: Nichols, J., Verastegui, B., Maccabe, A.‘., Hernandez, O., Parete-Koon, S., Ahearn, T. (eds) Driving Scientific and Engineering Discoveries Through the Convergence of HPC, Big Data and AI. SMC 2020. Communications in Computer and Information Science, vol 1315. Springer, Cham. https://doi.org/10.1007/978-3-030-63393-6_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-63393-6_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-63392-9

  • Online ISBN: 978-3-030-63393-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics