Reproducible big data science: A case study in continuous FAIRness

doi:10.5281/zenodo.1310034

Published July 11, 2018 | Version v1

Other Open

Reproducible big data science: A case study in continuous FAIRness

Madduri, Ravi¹

1. Uni

Big biomedical data create exciting opportunities for discovery but make it difficult to capture analyses and outputs in forms that are findable, accessible, interoperable, and reusable (FAIR). In response, we describe tools that make it easy to capture, and assign identifiers to, data and code throughout the data lifecycle. We illustrate the use of these tools via a case study [1] involving a multi- step analysis that creates an atlas of putative transcription factor binding sites from terabytes of ENCODE DNase I hypersensitive sites sequencing data. We show how the tools automate routine but complex tasks, capture analysis algorithms in understandable and reusable forms, and harness fast networks and powerful cloud computers to process data rapidly, all without sacrificing usability or reproducibility--thus ensuring that big data are not hard-to-(re)use data.

In this talk, we will describe the enhancements made to the Galaxy framework to support working with datasets referred to by minids[2], support analyzing BagIt-based research objects called BDBags [2], execution using software encapsulated using docker containers with unique identifiers. We will describe the tools, services developed to create an end-to-end reproducible analysis pipelines while adhering to FAIR [3] principles.

Reproducible big data science: A case study in continuous FAIRness. Ravi K Madduri, Kyle Chard, Mike D'Arcy, Segun C Jung, Alexis Rodriguez, Dinanath Sulakhe, Eric W Deutsch, Cory Funk, Ben Heavner, Matthew Richards, Paul Shannon, Gustavo Glusman, Nathan Price, Carl Kesselman, Ian Foster. bioRxiv 268755; doi: https://doi.org/10.1101/268755
Chard K, D’Arcy M, Heavner B, Foster I, Kesselman C, Madduri R, et al. I’ll take that to go: Big data bags and minimal identifiers for exchange of large, complex datasets. In: IEEE International Conference on Big Data; 2016. p. 319–328.
Wilkinson MD, Dumontier M, Aalbersberg IJ, Appleton G, Axton M, Baak A, et al. The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data. 2016;3:160018.

Files

IEEE_RO2018_Abstract.pdf

Files (33.7 kB)

Name	Size	Download all
IEEE_RO2018_Abstract.pdf md5:63ed345f494bb58dc8019b287252f301	33.7 kB	Preview Download

	All versions	This version
Views	258	257
Downloads	121	121
Data volume	4.3 MB	4.3 MB

Reproducible big data science: A case study in continuous FAIRness

Creators

Description

Files

IEEE_RO2018_Abstract.pdf

Files (33.7 kB)