Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
47 views

Investigating Reproducibility and Tracking Provena

Uploaded by

Ahmad Solihin
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views

Investigating Reproducibility and Tracking Provena

Uploaded by

Ahmad Solihin
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Kanwal et al.

BMC Bioinformatics (2017) 18:337


DOI 10.1186/s12859-017-1747-0

RESEARCH ARTICLE Open Access

Investigating reproducibility and tracking


provenance – A genomic workflow case
study
Sehrish Kanwal1*†, Farah Zaib Khan1*†, Andrew Lonie2 and Richard O. Sinnott1

Abstract
Background: Computational bioinformatics workflows are extensively used to analyse genomics data, with different
approaches available to support implementation and execution of these workflows. Reproducibility is one of the core
principles for any scientific workflow and remains a challenge, which is not fully addressed. This is due to incomplete
understanding of reproducibility requirements and assumptions of workflow definition approaches. Provenance
information should be tracked and used to capture all these requirements supporting reusability of existing workflows.
Results: We have implemented a complex but widely deployed bioinformatics workflow using three representative
approaches to workflow definition and execution. Through implementation, we identified assumptions implicit in these
approaches that ultimately produce insufficient documentation of workflow requirements resulting in failed execution of
the workflow. This study proposes a set of recommendations that aims to mitigate these assumptions and guides the
scientific community to accomplish reproducible science, hence addressing reproducibility crisis.
Conclusions: Reproducing, adapting or even repeating a bioinformatics workflow in any environment requires
substantial technical knowledge of the workflow execution environment, resolving analysis assumptions and
rigorous compliance with reproducibility requirements. Towards these goals, we propose conclusive recommendations
that along with an explicit declaration of workflow specification would result in enhanced reproducibility of
computational genomic analyses.
Keywords: Reproducibility, Provenance, Workflow, Galaxy, Cpipe, Common Workflow Language (CWL)

Background extensively within these platforms (Fig. 1). Typically, a


Recent rapid evolution in the field of genomics, driven bioinformatics analysis of genomics data involves pro-
by advances in massively parallel DNA sequencing tech- cessing files through a series of steps and transforma-
nologies, and the uptake of genomics as a mechanism tions, called a workflow or a pipeline. Usually, these
for clinical genetic testing, have resulted in high expecta- steps are performed by deploying third party GUI or
tions from clinicians and the biomedical community at command line based software capable of implementing
large regarding the reliable, reproducible, effective and robust pipelines.
timely use of genomic data to realise the vision of Significant informatics knowledge, resources, tools and
personalized medicine and improved understanding of expertise are required to design workflows for the ana-
various diseases. There has been a contemporaneous lysis and interpretation of sequencing data to ultimately
recent upsurge in the number of techniques and plat- obtain highly specific knowledge that can be translated
forms developed to support genomic data analysis [1]. into clinical settings. Through efforts such as the 1000
Computational bioinformatics workflows are used Genomes project [1] and aligned approaches for analysis
of Next Generation Sequencing (NGS) data [2], a variety
* Correspondence: kanwals@unimelb.edu.au; khanf1@unimelb.edu.au of best practices for variant discovery are now available.

Equal contributors The resulting knowledge should be unambiguous and
1
Department of Computing and Information Systems, The University of
Melbourne, Melbourne, VIC 3010, Australia
consistent, repeatable (defined as a researcher redoing
Full list of author information is available at the end of the article their own experiment/analysis in the same environment
© The Author(s). 2017 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and
reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to
the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver
(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Kanwal et al. BMC Bioinformatics (2017) 18:337 Page 2 of 14

Fig. 1 Computational bioinformatics workflows are often deployed to deal with the data processing bottleneck. A typical workflow consists of a
series of linked steps that transform raw input (e.g. a fastq file produced as a result of NGS data) into meaningful or interpretable output (e.g. variant
calls). Typically, these steps are performed by specific tools developed to tackle a particular functional aspect of genomic sequence analysis. Workflows
can have variable number of steps depending on the type of analysis performed, hence can be simple or complex

for the same result/outcome [3]), and reproducible on the review of -omics-based tests for predicting patient
(defined as an independent researcher/lab confirming outcomes in clinical trials [10] attributed two primary
or redoing that experiment/analysis, potentially in a causes; inadequate design of the preclinical studies and
different environment [3]) and eventually translatable weak bioinformatics rigour, for this limited translation.
into clinical (healthcare) context. Bioinformatics data The scientific community has paid special attention with
analysis and variant discovery - in the clinical domain respect to benchmarking -omics analysis to establish
in particular - requires provenance [4] information to transparency and reproducibility of bioinformatics studies
be captured for reproducibility. This provenance cap- [11]. Nekrutenko and Taylor [12] discussed important
ture should store details of each workflow execution issues of accessibility, interpretation and reproducibility
including the software versions, the software parame- for analysis of NGS data. Only ten out of 299 articles that
ters, and the information including data produced at cited the 1000 Genomes project as their experimental ap-
each workflow step [5]. proach used the recommended tools and only four studies
The reproducibility of scientific research is becoming used the full workflow. Out of 50 randomly selected pa-
increasingly important for the scientific community, as pers that cited BWA [13] for alignment step, only seven
validation of scientific claims is a first step for any trans- studies provided complete information about parameter
lational effort. The standards of computational reprodu- setting and version of the tool. The unavailability of pri-
cibility are especially relevant in clinical settings mary data from two cancer studies [14] was a barrier to
following the establishment of Next Generation Sequen- achieve biological reproducibility of claimed results.
cing (NGS) approaches. It has become crucial to Ioannidis et al. [15] attributed unavailability of data,
optimize the NGS data processing and analysis to keep software and annotation details as reasons for non-
at pace with the exponentially increasing genomics data reproducibility of microarray gene expression studies.
production. The ability to determine DNA sequences Hothorn et al. [16] found that only 11% of the articles
has outrun the ability to store, transmit and interpret conducting simulation experiments provided access to
this data. Hence, the major bottleneck to support the both data and code. The authors reviewing 100 Bioinfor-
complex experiments involving NGS data is data pro- matics journal papers [17] claimed that along with the
cessing instead of data generation. Computational bio- textual descriptions, availability of valid data and code
informatics workflows consisting of various community for analysis is crucial for reproducibility of results. More-
generated tools [6] and libraries [7, 8] are often deployed over, the majority of papers that explained the software
to deal with the data processing bottleneck. environment, failed to mention version details, which
Despite the large number of published literature on the made it difficult to reproduce these studies.
use and importance of -omics data, only a few have been To facilitate genomic data analysis, various Workflow
actually translated into clinical settings [9]. The committee Management Systems (WMS) are specifically designed

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Kanwal et al. BMC Bioinformatics (2017) 18:337 Page 3 of 14

and available to meet the challenges associated with which can be used to mitigate the challenges associated
such data [18]. Typically WMS are designed to support with incomplete documentation of an analysis, hence sup-
the automation of data-driven repetitive tasks in addition porting reproducibility.
to capturing complex analysis processes associated with We have implemented a complex yet widely used ex-
data processing steps. Having sufficient provenance infor- emplar variant calling workflow [32] using three
mation plays a major role in understanding data process- approaches to workflow definition (detailed in section
ing steps incorporated in a workflow and ensures the Approaches to workflow definition and implementation)
consistency of the results with the known (current) best to identify assumptions implicit in these approaches.
practice [19]. Ludäscher et al. [20] reviewed common The intricate underlying details associated with workflow
requirements of any scientific workflow, most of which implementation, considered needless to be stated, lead
(such as data provenance, reliability and fault-tolerance, to various factors often hidden from the user. In this
smart reruns and smart semantic links) are directly linked study, we refer to such factors as assumptions and inves-
to provenance capture. In addition to workflow evolution tigate workflow definition approaches to highlight these
[21], prospective (defined as the specification of the work- assumptions that lead to limited or no understanding of
flow used in an analysis) as well as retrospective (defined reproducibility requirements due to lack of documenta-
as the run time environment of an execution of the work- tion and comprehensive provenance trace. Our study
flow in an analysis) provenance [22] was identified as an proposes a generalised set of recommendations for bio-
essential requirement for every computational process in a informatics researchers to minimise such assumptions
workflow to achieve reproducibility of a published analysis hence support reproducibility and the validity of gen-
and ultimately accountability in case of inconsistent re- omic workflow studies.
sults. Several provenance models have been proposed and
implemented to support retrospective and prospective Methods
provenance [23–25] but these are seldom used by WMS We have implemented an end-to-end complex variant
used in genomic studies. Despite high expectations, calling workflow based on the Genome Analysis Tool
various existing WMS [26–30] do not truly preserve all Kit (GATK) [32] recommended best practices, using
necessary provenance information to support reproduci- three different exemplars to workflow definition ap-
bility - particularly to the standards that might be ex- proaches: Galaxy [27], Cpipe [33] and CWL [34]. The
pected for clinical genomics. GATK best practice variant discovery workflow was
The inability to reproduce and use exactly the same selected because it provides clear, community advocated
procedures/workflows means that considerable effort step-by-step recommendations for executing variant dis-
and time is required on reproducing results produced by covery analysis with high throughput sequencing data
others [12, 16, 17, 31]. At present the consolidation of on human germline samples. The next section will
expertise and best practice workflows that support re- broadly discuss the classified approaches typically
producibility are not mature. Most of the time, this is followed for workflow design and implementation and
due to the lack of understanding of reproducibility re- justify our choices for the systems used in this case
quirements and incomplete provenance capture that can study.
make it difficult for other researchers to reuse existing
work. The sustainability of clinical genomics research re- Approaches to workflow definition and implementation
quires that reproducibility of results goes hand-in-hand In this section, we classify approaches to workflow defin-
with data production. We, as the scientific community, ition and implementation into three broad categories.
need to address this gap by proposing and implementing Specifically, these categories have been devised on the basis
practices that can ensure reproducibility, confirmation of the current most common practices in the computa-
and ultimately extension of existing work. tional genomic analysis such as the pre-built pipelines
Towards these objectives, this work contributes to driven by individual laboratories or groups; pre-configured
the classification of the available approaches to work- graphical interface based workbenches and standardized
flow definition. Further we identified assumptions im- workflow description implementations. This categorisation
plicit in the investigated representative workflow provides the basis for the selection of exemplar workflow
definition approaches. In a previous study [19] that systems investigated in this study.
investigated challenges of large scale biomedical work-
flows, we had proposed, for reproducibility of science, Bioinformatics specific pre-built pipelines
one of the most important steps is to record sum- Several automated bioinformatics-specific pipelines such
mary of assumptions followed in the workflow. To as Cpipe [33], bcbio-nextgen [35] and others [36, 37]
address the aforementioned assumptions, we propose have been developed using command line tools to sup-
a generalised set of recommendations to researchers, port genomic data analysis. These pipelines are driven

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Kanwal et al. BMC Bioinformatics (2017) 18:337 Page 4 of 14

and supported by individual laboratories, which have de- configuration settings to aid users in designing auto-
veloped customized pipelines for processing data. This mated and robust pipelines that provide managed access
approach has resulted in considerable variability in the to a library of systems with abstraction of the interaction
methods used for data interpretation and processing. layer and equipped with a workflow layer that captures
The advantages of these pipelines include editing pipe- tool versions and parameter information. The GUI
lines on remote servers without requiring access to GUI workbenches can be easily used with already existing
so that they are easily administered through source code tools but adding a new tool (plug in) or executable
management tools [38]. However, the command line wrapper requires an in-depth familiarity with acceptable
based pipeline frameworks such as bpipe [39], input file types, parameter settings, exception handling
Snakemake [38] and Ruffus [40] used to develop these and resource management.
systems are not flexible enough to support integration of However, these systems do not require any local instal-
new user-defined steps and analysis tools. Working with lations for analysis tools and customisation of analysis
such systems requires expertise with command-line pro- environment; hence have lower infrastructure mainten-
gramming and broad computational knowledge as these ance costs. On the other hand, the availability of external
systems extensively use individual scripts to tie together services and customised tool repositories poses a risk to
different components of the pipelines. These scripts con- reproducibility as it will be impossible to reproduce a
trol variables, dependencies and conditional logic for the workflow created using a service which has been chan-
efficient processing of the data and hence are often diffi- ged or is no longer available. Similarly workflows imple-
cult to be reproduced. These systems assume the mented on one system may not be reproducible when
provision of the same physical or virtualized infrastruc- imported into another system due to incompatibility be-
ture used to run the initial analysis, including scripts, tween locally customised environments.
test data, tools, reference data and databases. The imple-
mentation overheads of such pipelines include configur- Standardized approach to workflow definition
ation and installation of software packages, parameter The heterogeneity in the field of in silico genomic ana-
setting alteration, debugging and input/output inter- lysis has motivated researchers to work towards stan-
facing. In summary, considerable effort and excessive dardized workflow description languages such as
amount of time is required to create, understand and Common Workflow Language (CWL) and Workflow
reproduce a ready-to-use pipeline. Definition Language (WDL) [47]. A variety of software
platforms, such as individual workstation to high per-
Graphical User Interface (GUI) based integrative workbenches formance computing platforms (cloud, grid or cluster),
To tackle some of the challenges of pipelines created can be deployed to implement these systems. Such sys-
using command-line interface based pipeline frame- tems provide a formal specification covering all aspects
works, workbenches [20, 27, 30, 41–43] have been devel- of a workflow implementation including tool versions,
oped to allow easy and customised workflow definitions input data, customizable parameter settings and the
using a GUI. Few of the referenced workbenches assist workflow runtime environment that is completely inde-
researchers to specify the goals, requirements and con- pendent of the underlying compute environment. Such
straints for workflows using semantic reasoning, hence approaches provide software specifications that help re-
automating and validating complex data processing tasks searchers define and implement portable, easy to use
[44]. Semantic workflow management systems support and reproducible workflows. These specifications aim
setting up an analysis by providing parameter prefer- to describe a data and execution model allowing users
ences, alternate software tools and relevant datasets built to have full control for creating and running the
upon the analytic constraints articulated by the user workflow by explicitly declaring the relevant environ-
resulting in access to domain specific expertise for work- ment, resources and other customizable settings in
flow design and configuration [45]. The semantic de- the specification.
scriptions expect complex validation rules for input and
output data objects, hence haven’t been widely adopted Case study
because of the complications involved in modelling sys- To comprehensively understand and identify assumptions
tems, the rapid evolution of semantic web services and that are implicit in the approaches detailed in section Ap-
the majority of existing approaches adopting a non- proaches to workflow definition and implementation, we
semantic approach [46]. GUI based workbenches are consider the impact of reproducibility requirements on
typically expected to be highly featured and pre- real-world genomic systems. To this end, we have imple-
configured with the modular tools to offer interactive mented an end-to-end complex variant calling workflow
design to a wide range of audience with varying degree based on the GATK recommended best practices, using
of expertise. These often include reference datasets and three exemplar workflow definition approaches: Galaxy,

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Kanwal et al. BMC Bioinformatics (2017) 18:337 Page 5 of 14

Cpipe and CWL as major representatives of the existing specific customized analysis is not defined. Rather
workflow systems used to analyse genomics data. Galaxy, the prebuilt pipelines presume availability of
an example graphical user interface based integrative sufficient compute power to deal with data intensive
framework, is an open source, web-based platform for steps such as sequence alignment.
accessible, reproducible and transparent genomics re-  To cater for the storage requirement of the pipeline,
search. It supports degrees of workflow provenance with 1000GB volume was mounted to the cloud instance.
focus on assisting the capture of computational methods Similar to compute requirement, there is no
that are used. Cpipe an exemplar of bioinformatics spe- automated mechanism for explicitly recording
cific prebuilt pipelines, adopted by Melbourne Genomics storage requirement. As the genomic sequence
Health Alliance, uses a programmatic approach which analysis involves dealing with huge input and
effectively includes everything necessary for reproducing a intermediate datasets (including whole genome
given genomic analysis, provided the same physical or vir- reference data), the prebuilt pipelines assume
tualized infrastructure used to run the initial analysis, in- availability of sufficient capacity to deal with data
cluding scripts, test data, tools, reference data and storage requirements.
databases. CWL, an exemplar of declarative approach to  The installation script provided with cpipe compiled
workflow definition, enables full control to users for creat- tools such as BWA and downloaded databases such
ing and running the workflow using a specification which as Variant Effect Predictor (VEP) and human
is a standard descriptor for relevant environment, com- reference sequence files. The prebuilt pipelines
mand line tools and other customizable settings; hence connect to online resources to download and
making very few internal assumptions about the require- compile tools and reference datasets used in the
ments of the workflow. We have restricted our case study analysis. FTP clients and SSH transfer tools are used
to have one representative system from each category de- for moving datasets over distributed resources. The
fined in section Approaches to workflow definition and availability of high performance networking
implementation. infrastructure is assumed to move bulk data using
We used chromosome 21 data for this study. It was wide area network (WAN).
extracted from The Genome in a Bottle dataset  The base software dependencies for underlying
NA12878 which is widely used as test data because of programming frameworks such as Java and Python
the pre-existing and extensive analysis done on this sam- were required to execute tools in cpipe. The prebuilt
ple and the agreed variant call truth set [48] that can be pipelines assume that users are responsible to solve
used for comparative evaluation. Other files required for base software dependencies for the pipeline;
the GATK workflow execution include human reference otherwise the pipeline would fail to execute.
genome (hg19.fasta) and the known variant (vcf ) refer-  Cpipe requires downloading and pre-processing the
ence files (available at https://github.com/skanwal/ reference data set to generate secondary files since
GATK-CaseStudy), which were obtained from the the indexing step is not explicitly defined as part of
resource bundle provided by the Broad Institute.1 the pipeline but included in a separate script. The
pre-built pipelines expect users to perform pre-
Workflow enactment using the selected systems processing steps and hence assume availability of
This section details the enactment process of the GATK input data files to be made available before execution
variant calling workflow using three exemplar workflow of the pipeline.
definition approaches. We elaborate the assumptions  Cpipe uses a copyrighted tool, ANNOVAR, for
implicit in each approach while dealing with various annotating variant calls. The prebuilt pipelines
workflow features. deploying copyrighted or proprietary tools, instead
of open source software, assume users to ensure
Cpipe availability of all such licensed resources.
Cpipe belongs to the category of bioinformatics specific  Cpipe requires a specific directory structure in
prebuilt pipeline. It was deployed on the National order to execute the analysis on any sample. As the
eResearch Collaboration Tools and Resources prebuilt pipelines are customized to support
(NeCTAR) research cloud.2 The instructions on the explicit analysis requirements, these assume
official Cpipe GitHub page [49] were followed to setup availability of a specific analysis environment with a
the pipeline. set directory structure, having tools and datasets
appropriately located to support seamless execution
 The instance launched for executing cpipe had of the pipeline. Files and tools are expected to be
16cores and 64GB RAM. The automated mechanism placed according to particular file system hierarchy
to document and convey compute requirement for a since paths are hard coded in the scripts.

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Kanwal et al. BMC Bioinformatics (2017) 18:337 Page 6 of 14

Galaxy absence of an entire step of pre-processing, process-


Galaxy was selected from the category of GUI based ing or post processing data from the workflow de-
Integrative workbenches to implement GATK variant tails especially from visual representation leads to
calling workflow (Fig. 1-Additional file 1).3 The Genome incomplete workflow knowledge when attempted to
Virtual Laboratory (GVL) [50] was used for launching a be reproduced.
pre-configured Galaxy instance on an OpenStack-based  The Galaxy toolshed is populated with tools
cloud environment. configured using XML specifications requiring
technical and extensive programming expertise to
 Specifically, a GVL 8-core cloud instance with 32GB write XML configuration files for the tool versions
RAM was launched to provision a fully configured that are not available in the toolshed. A Galaxy
Galaxy for the analysis of NA12878. Similar to pre- workflow requires availability of uniform toolsheds
built pipelines, GUI based workbenches also assume across Galaxy instances, therefore a workflow
the availability of sufficient compute power to created using particular tool versions on one
process data, hence are user dependent for the instance will fail to execute on instances with a
provision of these resources. Galaxy workbench toolshed supporting different tool versions. This
lacks a prior check to ensure availability of sufficient renders it inflexible, static and a challenge to
compute resource. reproducibility. The workflow developers assume
 A 1000GB volume was mounted to the GVL uniformity of tool repositories across different
instance launched. GUI based workbenches require instances of a workbench. Hence, workflows created
the user to provide sufficient storage capacity to deal using GUI based workbenches are tied to specific
with the data storage, hence the workflows built versions of tools used to declare the workflows and
using these workbenches have little or no explicit the absence of these specific tool versions will result
declaration of such requirements. in failed execution of workflow.
 Chromosome 21 fastq files, known variant vcf
databases and hg19 reference sequence fasta file CWL
(provided in the supplementary material) were CWL aims for a standardized approach to workflow
uploaded to Galaxy. Galaxy uses inbuilt reference definition. It was cloned and installed following the
files if not provided by users but other databases are instructions from the GitHub repository [52].
expected to be provided by users. Even if a complete
workflow built on such systems is published, not  A reference implementation of CWL designed
only is the provision of input data the user’s specifically for Python 2.7 was cloned and installed
responsibility but these systems also assume the following the directions from the GitHub repository
availability of supporting data (such as reference manual.4 The availability of the specific underlying
sequence and variant databases) to generate results. language and its particular version for reference
 During implementation, it was observed that Galaxy implementation (Python in this case) is assumed for
automatically performs certain steps without successful installation and functioning of the
explicitly declaring them such as indexing the reference implementation.
provided reference genome, creating index files for  Working with CWL was challenging as compared
the BAM output file (using Picard [51] mark to Cpipe and Galaxy because it is an ongoing,
duplicates), generating a temporary reference constantly developing community effort and tool
sequence dictionary as part of the local realignment wrappers for most of the required tools for this
steps and creating a fasta index file for GATK tools study were not available. Implementing the GATK
(Fig. 2). GUI based workbenches simplify the workflow in CWL required the knowledge of Yet
interface and facilitate user by hiding the underlying Another Markup Language (YAML) and JavaScript
details from the user. This results in an inability to Object Notation (JSON) for development of a
replicate or reproduce the same workflow due to number of CWL definition files including YAML
incomplete or implicit documentation. tool wrappers, JSON job files containing the input
 In Galaxy, reference sequence indexing, SAM to parameters and YAML test files for conformance
BAM conversion and sorting the resulting BAM file tests (Fig. 2-Additional file 1).5 It is assumed that
is embedded in the alignment step and does not any user wanting to utilise these definition files
appear in the final workflow diagram. The visual along with the workflow definition should have
dataflow diagram produced by such systems is basic understanding of YAML and JSON. In
assumed to be a complete picture of the processes addition, if a newer version or different tool is
carried out during a workflow execution. The required for any step, the user is expected to

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Kanwal et al. BMC Bioinformatics (2017) 18:337 Page 7 of 14

Fig. 2 Screenshots of the Galaxy interface showing (a) A temporary sequence dictionary file creation using CreateSequenceDictionary as part of
RealignTargetCreator and IndelRealigner step and (b) “Map with BWA-MEM” step combining indexing reference data, SAM to BAM conversion
and sorting of the resultant aligned (BAM) file

develop the definition files for which in depth assumed to be available on the system executing the
knowledge of underlying languages is required. workflow. Although CWL encourages use of Docker,
Therefore, the standardized approaches on the one it also facilitates the local installation of required
hand provide users with the freedom to declare every tools which should not be preferred as it will lead to
aspect of the workflow but on the other hand assume localised solutions that fail to execute elsewhere. In
the implicit knowledge of underlying languages both cases, certain assumptions were made regard-
leading to steep learning curves for naïve users. ing availability of the underlying tool and their link
 The workflow implementation used tools such as with the tool definition. Hence, the standardized
BWA, GATK and Picard Toolkit which were approaches despite making efforts to explicitly
provided through container-based Docker6 images declare every step of the workflow assume the
including all required software packages. This step underlying software availability for enactment of a
required installation of Docker which again was workflow which is not always the case.

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Kanwal et al. BMC Bioinformatics (2017) 18:337 Page 8 of 14

 As genomic workflows usually involve working with this is still a challenge to be met as workflow implemen-
large datasets, the availability of compute and tation, storage, sharing and reuse significantly varies de-
storage resources is assumed to be managed by pending on the choice of approach and platform used by
users to successfully enact workflows. the researcher. A common phenomenon to every ap-
proach however is ‘workflow decay’ [59] caused by the
Results and discussion factors such as the evolution of technical environment
The expectation for science to be reproducible is consid- used to implement a workflow, updates in the state of
ered fundamental but often not tested. Every new dis- external factors such as databases and unavailability of
covery in science is built on already known knowledge, third party web resources. Our study contributes to un-
that is, published literature acts as a building block for derstanding the requirements of reproducibility of gen-
new findings or discoveries. Using this published litera- omic workflows by investigating a set of assumptions
ture as a base, the next level of understanding is devel- evident from practical implementation of the case study
oped and hence the cycle continues. Therefore, if we and providing standardised recommendations for com-
cannot reproduce already existing knowledge from the putational genomic workflow studies.
literature, we are wasting a lot of effort, resources and Owing to the production of exceptional amounts of
time in doing potentially wrong science [53] resulting in genomics data, a typical human exome sequence analysis
“reproducibility crisis” [54]. If a researcher claims a novel (for example the current case study) would require a
finding, someone else, interested in the study, should be terabyte of storage and up to 64GB RAM of compute
able to reproduce it. Reports are accumulating that most power. As the computational dependencies of workflows
of the scientific claims are not reproducible, hence ques- have grown complex from simple batch execution to
tioning the reliability of science and rendering literature distributed and parallel processing, researchers should
questionable [55, 56]. The true reproducibility of experi- document and provide the amount of storage and com-
ments in different systems has not been investigated pute power required by a workflow to run successfully.
rigorously in systematic fashion. For computational work Long term reproducibility of scientific results can be
like the one described in this paper, reproducibility not hard to achieve if the appropriate resources required to
only requires an in depth understanding of science but reproduce the workflow are not fully declared. Apart
also data, methods, tools and computational infrastruc- from declaration of compute and storage resources re-
ture, making it a non-trivial task. The challenges im- quired to successfully execute a workflow, comprehen-
posed by large-scale genomics data demand complex sive efforts by workflow developers could result in better
computational workflow environments. A key challenge management of dependencies. A tool or a workflow built
is how can we improve reproducibility of experiments on a specific computing platform requires the details of
involving complex software environments and large the exact version of the underlying base software to exe-
datasets. Although this question is pertinent to scientific cute successfully. One example is a requirement of a
community as a whole [57], here we have focused on particular version of Java (1.8) to execute tools from
genomic workflows. GATK or Picard toolkit used in a workflow. The absence
Reproducibility of an experiment often requires repli- of such information about the base software require-
cation of the precise software environment including the ments such as Java or Python would result in at least
operating system, the base software dependencies and one unsuccessful execution of the workflow. We rec-
configuration settings under which the original analysis ommend workflow developers devise a mechanism
was conducted. In addition, detailed provenance infor- (e.g. provide a script) that should implement check-
mation of required software versions and parameter points to analyse the suitability of computing platform
settings used for the workflow aids in the reusability of before the execution attempt. This will ideally guide
any workflow. Provenance tracking and reproducibility the researchers trying to reproduce a workflow who
go hand in hand as provenance traces contribute to otherwise would waste considerable time tackling the
make any research process auditable and results verifi- ‘dependency hell’. The burden obviously shifts to the
able [58]. The variant calling workflows (as our case workflow developers but in the longer run, it would
study) result in genetic variation data that serves to en- be helpful to declare and document the very basic
hance understanding of diseases when translated into a information, which is considered too obvious to state.
clinical setting resulting in improved healthcare. Keeping Genomic data analysis has grown complex with the in-
in view the critical application of the data generated, it is creased involvement of customized scripts and online re-
safe to state that entire process leading to such biological sources needed to carry out difficult tasks, increasing
comprehensions must be documented systematically to both the technical knowledge required and the chance
guarantee reproducibility of the research. However a that something will break. One of the major reasons for
generalised set of rules and recommendations to achieve non-reproducibility of workflows is use of volatile third

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Kanwal et al. BMC Bioinformatics (2017) 18:337 Page 9 of 14

party resources such as databases, tools or websites [59]. analysis environment to allow their workflows to be
Many workflows cannot be run because the third party more readily executable.
resources they rely on are no longer available and the In principle, workflow management systems such as
results could only be reproduced using the specific ver- Galaxy use a publicly shared repository for published
sion of the software, hence rendering workflows un- tools and workflows. In practise this is a challenge, as
usable. These factors can be considered out of control of there are many ways to set up the analysis to begin with.
the researchers as every time an analysis is repeated, it Galaxy allows the users to choose the computing plat-
may assume that the system it is being reproduced on form such as centralised public galaxy, galaxy on cloud
comes preconfigured with all the workflow dependen- or as a localised instance. There are more than 80 publi-
cies. Also, the download of large genomic datasets from cally shared galaxy servers8 each containing different
the third party online resources demands users ensure toolsets. Workflow developers can create a workflow
availability of high performance networking infrastruc- using their localised instances and later publish these
ture on their part. Volatile third party resources is an workflows assuming uniformity of tool repositories
open end problem to which several solutions have been across different platforms. This can result in static and
proposed, such as alternative resources or local copy of inflexible solutions, hence challenging to be reproduced
the resource, to mitigate the consequences [60]. How- as it assumes uniformity of repositories across different
ever, we believe that alternative resources might not result platform instances. The workflow developers are recom-
in the same output, hence a barrier to reproducibility of mended to ensure the availability of the tools used in the
results [19]. The services hosting third party resources are workflow implemented on local instances of any work-
generally in no agreement to continuously supply these re- flow management system. These tools should either be
sources. Even the most sophisticated and widely used shared via repositories associated with a certain work-
technologies such as container based approaches require flow system or using open source code sharing solutions
connection to the network and online resources at least e.g. through a git repository. The repository maintainers
once for building of the required software components. should make the process of adding tools to centralized
Third party services such as copyrighted or proprietary repositories straightforward and easy to implement. This
resources should be avoided in research involving use of would result in cost effective analysis encouraging
genomic datasets as they can result in an inability to ac- researchers to reuse the resources provided instead of
cess original resources or tools, overshadow the ramifi- reinventing the wheel.
cations of the research and halt reproducibility. The Input such as sequencing reads in FASTA files and ref-
possible solutions to reproduce the research involving erence datasets play a major role to enable reproducibil-
such tools can be through buying or re-implementing ity of genomic workflows and ultimately achieve
these software, which is often not a realistic expectation. repeatable results. Even in the case where the user has
Instead the community should push forward to work to- comprehensive understanding of the workflow analysis,
wards open source software and collaborative science absence of input data annotations hinders the successful
[61], which makes it easier to communicate and access execution of the workflow. Analysis tools usually require
scientific knowledge. The efforts such as Centre for strict adherence to file formats (e.g. reference sequence
Open Science7 are working towards encouraging open- should be a single reference sequence in FASTA format
ness and reproducibility of scholarly research, hence or the names and order of the contigs in the reference
accelerating scientific progress. used must exactly match that of one of the official
Additionally, explicit requirements for specific analysis reference canonical orderings). This demands provid-
environment, e.g. hard coded paths and names embed- ing access to primary data used in the analysis. How-
ded in source code, should be avoided in the pipeline ever, a major implication of this idea lies in the
definition. In our case study, creation of an analysis en- security and ethical consideration of genomics data.
vironment with a particular directory and file naming The community needs to address this issue by provid-
convention was required by Cpipe to execute the work- ing secure controlled access to sensitive genomic data.
flow successfully [33]. From our experience, we recom- Also, the size of the genomic datasets can be a prob-
mend that this should not be a rule as it adversely lem in sharing the datasets and providing them to
affects the portability of the workflow. An extra respon- workflow specifications. In such cases, where it is not
sibility on a researcher reproducing someone else’s possible to package or share datasets with the work-
workflow is to define the analysis environment and re- flow, comprehensive annotations will assist re-
lated parameters. We recommend avoiding the hard searchers to decide on the appropriate datasets for
coding file names, absolute file paths, host names, user the workflow. Public repositories9,10 and resources
names and IP addresses. Workflow developers should can also be used to archive, preserve and share gen-
ensure their workflows are independent of a specific omic datasets.

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Kanwal et al. BMC Bioinformatics (2017) 18:337 Page 10 of 14

With ever evolving repositories, services, tools and and output parameters and command-line executable.
data, workflow specification alone is rarely sufficient This results in archiving of the entire framework of the
to ensure reproducibility and reusability of scientific software environment that can be re-established to sup-
experiments, resulting in workflow decay. One way to port reproducibility. However, working with this kind of
avoid the workflow decay is to provide complete approach is not an easy task and requires lots of time,
provenance capture including annotations for every efforts and substantial technical support (in our case
process during workflow execution, the parameters study this was provided by the CWL team) to first learn
and links to third party resources including data and the principles of the language and then coding to
external software services. This information should be implement system configuration of a complex genome
available with the published workflow. The relevant analysis workflow.
parameter setting for each tool used in an analysis is Hence, the details vital to reproducibility of any
also essential to ensure reproducibility of results computational genomic analysis should be completely
hence should be provided with the workflow. Alterna- documented to ensure capture of critical provenance
tively workflow developers should package all associ- information. From our experience gained from this study
ated tools when the workflow is published. we posit that the workflow developers along with other
Workflows should be treated as first class data ob- mechanisms should collectively document the important
jects [62] and container technologies such as Docker, pieces of information through graphical representation
OpenVZ11 or LXC12containers should be used to of the workflow as indicated in Fig. 3. The flowchart in
package the environment and configuration together. the figure can be used as a model to record a high level
Approaches such as CWL utilise Docker containers, representation of the underlying complex workflow. It is
work on the principles of comprehensive declaration a blueprint containing all the artefacts including tools,
and make minimal internal assumptions about the pre- input data, and intermediate data products, supporting
cise software environment, base software dependencies, resources, processes and connections between these
configuration settings, alteration of parameters and soft- artefacts. To re-enact any workflow the users should be
ware versions. Such approaches aim to build flexible and directed to explicitly understand and declare all the
customized workflows including intricate details of every requirements mentioned in such workflow representation.
process in a workflow such as requirement declarations The proposed representation of the variant calling work-
for the runtime environment, data and metadata, input flow shown (Fig. 3) contains all the necessary artefacts

Fig. 3 Graphical representation of the GATK workflow representing artefacts and information necessary to be captured as part of workflow
execution. The description of main steps is depicted in the black rectangles whereas the tools responsible to carry out the steps are shown in grey
ellipses. Input and reference files (brown rounded rectangles) are shown separately and labelled by the dataset name. The primary and secondary output
files (if any) are shown in dark and light green snip diagonal corner rectangles respectively. The input and output data flow for each workflow step is
demonstrated through red and green dotted arrows respectively. The connection between processes in a workflow is represented by blue solid arrow.
The yellow highlighted parts of the workflow are the pivotal processes not explicitly declared in Galaxy and Cpipe. The red flag highlights the main
input and final output for the workflow

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Kanwal et al. BMC Bioinformatics (2017) 18:337 Page 11 of 14

needed to support reproducibility requirements and prov- readable ones) can help identify bottlenecks in the
enance tracking across the platforms. The concept of analysis and ultimately accelerate reproducibility of data
visual representation of the workflow is implemented in driven sciences.
only a few GUI based workbenches [27, 30, 63] but such
high level representation depicts an inadequate illustration Conclusion
of the analysis as evident from Fig. 4. Reproducibility of computational genomic studies has
During this study we observed that the ultimate been considered as a major issue in recent times. In this
Galaxy workflow diagram does not state the utilization context, we have characterised workflows on the basis of
of some tools such as BWA Index, SAMtools View, approach used for their definition and implementation.
SAMtools Sort, SAMtools Faidx and Picard CreateSe- To evaluate reproducibility and provenance require-
quenceDictionary. Therefore the incomplete Galaxy ments, we implemented a complex variant discovery
workflow diagram (Fig. 4) is challenging to be repro- workflow using three exemplar workflow definition ap-
duced on other platforms, as necessary information proaches. We identified numerous implicit assumptions
about each step is not recorded. Hence, platforms interpreted through the practical execution of the work-
making assumptions about some aspects of a work- flow, leading to recommendations for reproducibility
flow without documenting them as part of final work- and provenance, as shown in Table 1.
flow diagram result in incomplete understanding of Workflows are often (typically!) dependent on the
the reproducibility requirements. replication of complex software environments necessitat-
The workflows used to implement biomedical data ing substantial technical support to reproduce the
analyses have grown complex [64] making it difficult to configuration settings required for the analysis. This
understand and reproduce such experiments. A graph- varies depending on the different approaches taken to
ical representation (Fig. 3) allows visualization of mul- workflow design and execution. The assumptions
tiple aspects of workflow definition and implementation followed in each approach are one of the reasons for this
including data manipulation and interpretation. Enabling heterogeneity that subsequently results in incomplete
simplicity by representing complex workflows in human documentation of workflow requirements. Our case
readable formats can significantly reduce the complexity study illustrates the variability in workflow implementa-
of such analyses through improved understanding. As tions based on the platform selected that can impact on
the studies involving complex analysis tasks encompass crucial requirements for reproducibility and provenance
human judgments, it is important that the research com- that is currently missing from workflows. Ensuring
munity works in this direction to help researchers trans- reproducibility is highly dependent on the efforts of
fer their knowledge and expertise using proposed rich researchers to convey their analysis in a way that is
and easy to create representations. Further, the proposed comprehensive and understandable. We posit that
human readable description (along with the machine adhering to proposed recommendations along with an

Fig. 4 The variant calling workflow representation in Galaxy

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Kanwal et al. BMC Bioinformatics (2017) 18:337 Page 12 of 14

Table 1 Summary of assumptions (detailed in section Workflow enactment using the selected systems) and corresponding
recommendation for reproducibility
Assumptions Recommendations
Availability of sufficient storage and compute resources Workflow developers should provide complete documentation of compute
to deal with processing of big genomics data and storage requirements along with the workflow to achieve long-term
reproducibility of scientific results.
Availability of high performance networking infrastructure Considering the size and volume of genomic data, researchers reproducing
to move bulk genomics data any analysis should ensure that an appropriate networking structure for
data transfer is on hand
The computing platform is preconfigured with the base software Workflow developers should provide a mechanism with check points to
required by the workflow specification ensure compatibility of the computing platform deployed by a researcher
to reproduce the original analysis
Users are responsible to ensure access to copyrighted Community should encourage work leveraging open source software
or proprietary tools and collaborative approaches thereby avoiding use of copyrighted
or proprietary tools
Analysis environment with a particular directory structure and file Workflow developers should avoid hardcoding environmental parameters
naming conventions is setup before executing the workflow such as file names, absolute file paths and directory names that would
otherwise render their workflow dependent on a specific environment
setup and configuration
Appropriate datasets are used as input to the tools incorporated As bioinformatics analysis tools require strict adherence to input or
in the workflow reference file formats, data annotations and controlled access to
primary data can ultimately help reproduce the workflow precisely
Users will have a comprehensive understanding of the analysis and Workflow developers should provide a complete data flow diagram serving
the provided information (in the form of incomplete workflow diagram) as a blue print containing all the artefacts including tools, input data,
is sufficient to convey high level understanding of the workflow intermediate data products, supporting resources, processes and the
connection between these artefacts
Availability of specific tool versions and setting relevant Tools should either be packaged along with the workflow or made available
parameter space via public repositories to ensure accessibility to the exact same versions and
parameter settings as used in the analysis being reproduced, hence
supporting flexible and customizable workflows.
Users to have proficient knowledge of the specific reference This factor might be considered out of control of the workflow developers
implementation but detailed documentation of the underlying framework used and
community support can help in overcoming the associated learning curve

7
explicit declaration of workflow specification would https://cos.io/
8
result in enhanced reproducibility of computational wiki.galaxyproject.org/PublicGalaxyServers
9
genomic analyses. The graphical representation pro- http://www.data.cam.ac.uk/funders
10
posed in this study can potentially be translated using http://www.nature.com/sdata/policies/repositories
11
the available community accepted standards for prov- https://openvz.org/Main_Page
12
enance [65] and tested across different platforms to https://linuxcontainers.org/
generalise it for further extension to other workflows.
In future, it would be interesting to extend this case Additional file
study with other workflow systems such as Wings,
Kepler, WDL, VisTrails and Taverna to analyse the re- Additional file 1: This file contains details about enactment of GATK
producibility and provenance requirements, hence poten- variant calling workflow using Galaxy, Cpipe and CWL. (PDF 585 kb)
tially updating the recommendations if any assumption
specific to these systems is identified. Abbreviations
CWL: Common Workflow Language; GATK: Genome Analysis Toolkit;
Endnotes GUI: Graphical User Interface; WMS: Workflow Management System
1
www.broadinstitute.org/gatk/guide/article.php?id=1213
2 Acknowledgements
https://nectar.org.au/research-cloud/ We gratefully acknowledge the support of Common Workflow Language
3
https://github.com/skanwal/GATK-CaseStudy/blob/ (CWL), Melbourne Bioinformatics and Melbourne eResearch working groups
master/SupplementaryMaterial/AdditionalFile1.pdf for providing insightful suggestions for implementing the variant calling
4 workflow across different platforms.
https://github.com/common-workflow-language/cwltool
5
https://github.com/skanwal/GATK-CaseStudy/blob/
Funding
master/SupplementaryMaterial/AdditionalFile1.pdf This work was supported by MIRS and MIFRS Scholarships from The University
6
https://www.docker.com/ of Melbourne.

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Kanwal et al. BMC Bioinformatics (2017) 18:337 Page 13 of 14

Availability of data and materials 18. Leipzig J. A review of bioinformatic pipeline frameworks. Briefings in
The information for data, workflows and scripts used in the paper is available bioinformatics. 2017;18(3):530–536.
at https://github.com/skanwal/GATK-CaseStudy. 19. Kanwal S, et al. Challenges of Large-scale Biomedical Workflows on the
Cloud – A Case Study on the Need for Reproducibility of Results, in 28th
Authors’ contributions IEEE International Conference on Computer Based Medical Systems. 2015:
SK and FZK contributed equally to this work. Conceived and designed the Sao Paulo, Brazil.
case study: SK FZK AL RS. Performed the experiments: FZK SK. Analysed the 20. Ludäscher B, et al. Scientific workflow management and the Kepler
data: SK FZK. Wrote the paper: SK FZK RS AL. Work supervised by AL RS. All system. Concurrency and Computation: Practice and Experience. 2006;
authors read and approved the final manuscript. 18(10):1039–65.
21. Casati F, et al. Workflow evolution. Data Knowl Eng. 1998;24(3):211–38.
Ethics approval and consent to participate 22. Zhao Y, Wilde M, Foster I. Applying the virtual data provenance model.
Not applicable. In: International Provenance and Annotation Workshop. : Springer;
2006.
Consent for publication 23. Joglekar GS, Giridhar A, Reklaitis G. A workflow modeling system for
Not applicable. capturing data provenance. Comput Chem Eng. 2014;67:148–58.
24. Missier P, et al. D-PROV: extending the PROV provenance model with
Competing interests workflow structure. In: TaPP; 2013.
The authors declare that they have no competing interests. 25. Missier P, Goble C. Workflows to open provenance graphs, round-trip.
Future Generation Computer Systems-the International Journal of Grid
Computing and Escience. 2011;27(6):812–9.
Publisher’s Note 26. Bartocci E, et al. BioWMS: a web-based Workflow Management System for
Springer Nature remains neutral with regard to jurisdictional claims in published
bioinformatics. BMC Bioinformatics. 2007;8 Suppl 1:S2.
maps and institutional affiliations.
27. Goecks J, Nekrutenko A, Taylor J. Galaxy: a comprehensive approach for
supporting accessible, reproducible, and transparent computational
Author details
1 research in the life sciences. Genome Biol. 2010;11(8):R86.
Department of Computing and Information Systems, The University of
28. Hoon S, et al. Biopipe: A Flexible Framework for Protocol-Based
Melbourne, Melbourne, VIC 3010, Australia. 2Melbourne Bioinformatics, The
Bioinformatics Analysis. Genome Res. 2003;13(8):1904–15.
University of Melbourne, Melbourne, VIC 3010, Australia.
29. Neron B, et al. Mobyle: a new full web bioinformatics framework.
Received: 23 March 2017 Accepted: 4 July 2017 Bioinformatics. 2009;25(22):3005–11.
30. Wolstencroft K, et al. The Taverna workflow suite: designing and executing
workflows of Web Services on the desktop, web or in the cloud. Nucleic
References acids research. 2013;41(W1):W557–W561.
1. Siva N. 1000 Genomes project. Nat Biotechnol. 2008;26(3):256. 31. Baggerly KA, Coombes KR. Deriving chemosensitivity from cell lines:
2. Bell CJ, et al. Carrier testing for severe childhood recessive diseases by next- Forensic bioinformatics and reproducible research in high-throughput
generation sequencing. Sci Transl Med. 2011;3(65):65ra4. biology. The Annals of Applied Statistics. 2009:1309–34.
3. Vitek J, Kalibera T. Repeatability, reproducibility, and rigor in systems 32. McKenna A, et al. The Genome Analysis Toolkit: a MapReduce framework for
research. In: Proceedings of the ninth ACM international conference on analyzing next-generation DNA sequencing data. Genome Res. 2010;20(9):
Embedded software. : ACM; 2011. 1297–303.
4. Merriam-webster.com. (n.d.). Definition of PROVENANCE. [online] Available 33. Sadedin SP, et al. Cpipe: a shared variant detection pipeline designed for
at: https://www.merriam-webster.com/dictionary/provenance. Accessed 24 diagnostic settings. Genome medicine. 2015;7(1):68.
Jul 2015. 34. Peter, A., Robin Andeer, Brad Chapman, John Chilton, Michael R. Crusoe,
5. Davidson SB, Freire J. Provenance and scientific workflows: challenges and Roman Valls Guimerà, Guillermo Carrasco Hernandez, Sinisa Ivkovic, Andrey
opportunities. In: Proceedings of the 2008 ACM SIGMOD international Kartashov, John Kern, Dan Leehr, Hervé Ménager, Maxim Mikheev, Tim
conference on Management of data. Vancouver: ACM; 2008. p. 1345–50. Pierce, Josh Randall, Stian Soiland-Reyes, Luka Stojanovic, Nebojša Tijanić.
6. Rice P, L I, Bleasby A. EMBOSS: the European Molecular Biology Open Common Workflow Language, draft 3. 2016 figshare, March 2016.
Software Suite. Tends in Genetics. 2000;16(6):276–7. 35. Guimera RV. bcbio-nextgen: Automated, distributed next-gen sequencing
7. Stajich JE, et al. The Bioperl toolkit: Perl modules for the life sciences. pipeline. EMBnet journal. 2012;17(B):30.
Genome Res. 2002;12(10):1611–8. 36. Fisch KM, et al. Omics Pipe: a community-based framework for reproducible
8. Cock PJ, et al. Biopython: freely available Python tools for computational multi-omics data analysis. Bioinformatics. 2015;31(11):1724–1728.
molecular biology and bioinformatics. Bioinformatics. 2009;25(11):1422–3. 37. Golosova O, et al. Unipro UGENE NGS pipelines and components for variant
9. Ransohoff DF. Promises and limitations of biomarkers. In: Cancer Prevention calling, RNA-seq and ChIP-seq data analyses. PeerJ. 2014;2:e644.
II. : Springer; 2009. p. 55–9. 38. Köster J, Rahmann S. Snakemake—a scalable bioinformatics workflow
10. Gilbert Omenn, C M. Evolution of Translational Omics: Lessons Learned and engine. Bioinformatics. 2012;28(19):2520–2.
the Path Forward. 2012. Available from: http://www.nationalacademies.org/ 39. Sadedin SP, Pope B, Oshlack A. Bpipe: a tool for running and managing
hmd/Reports/2012/Evolution-of-Translational-Omics.aspx. Accessed 21 Aug bioinformatics pipelines. Bioinformatics. 2012;28(11):1525–6.
2014. 40. Goodstadt L. Ruffus: a lightweight Python library for computational
11. Zheng CL, et al. Use of semantic workflows to enhance transparency and pipelines. Bioinformatics. 2010;26(21):2778–9.
reproducibility in clinical omics. Genome medicine. 2015;7(1):73. 41. Callahan SP, et al. VisTrails: visualization meets data management. In:
12. Nekrutenko A, Taylor J. Next-generation sequencing data interpretation: Proceedings of the 2006 ACM SIGMOD international conference on
enhancing reproducibility and accessibility. Nature Reviews Genetics. 2012; Management of data; 2006. ACM.
13(9):667–672. 42. Gil Y, et al. Wings: Intelligent workflow-based design of computational
13. Li H, Durbin R. Fast and accurate short read alignment with Burrows–Wheeler experiments. IEEE Intell Syst. 2011;26(1):62–72.
transform. Bioinformatics. 2009;25(14):1754–60. 43. KNIME. [cited 2017; Available from: http://www.knime.com/.
14. Stransky N, et al. The mutational landscape of head and neck squamous cell 44. Sethi RJ, Gil Y. Reproducibility in computer vision: Towards open publication
carcinoma. Science. 2011;333(6046):1157–60. of image analysis experiments as semantic workflows. In: e-Science (e-
15. Ioannidis JP, et al. Repeatability of published microarray gene expression Science), 2016 IEEE 12th International Conference on; 2016. IEEE.
analyses. Nat Genet. 2009;41(2):149–55. 45. Hauder M, et al. Making data analysis expertise broadly accessible through
16. Hothorn T, Held L, Friede T. Biometrical journal and reproducible research. workflows. In: Proceedings of the 6th workshop on Workflows in support of
Biom J. 2009;51(4):553–5. large-scale science; 2011. ACM.
17. Hothorn T, Leisch F. Case studies in reproducibility. Brief Bioinform. 2011; 46. Zhao Z, Paschke A. A survey on semantic scientific workflow. : Semantic
12(3):288–300. Web J. IOS Press; 2012. p. 1–5.

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Kanwal et al. BMC Bioinformatics (2017) 18:337 Page 14 of 14

47. Azure, M. Workflow Definition Language. [cited 2017; Available from:


https://docs.microsoft.com/en-us/rest/api/logic/definition-language.
48. Zook, J. Want to better understand the accuracy of your human genome
sequencing? 2013 [cited 2015 December]; Available from: http://www.nist.
gov/mml/bbd/ppgenomeinabottle2.cfm.
49. Sadedin, S. Melbourne Genomics Cpipe. 2016. Available from: https://github.
com/MelbourneGenomics/cpipe. Accessed 28 Mar 2016.
50. Afgan E, et al. Genomics Virtual Laboratory: A Practical Bioinformatics
Workbench for the Cloud. PLoS One. 2015;10(10):e0140829.
51. Picard. Picard. [cited 2014 28 Aug]; Available from: http://broadinstitute.
github.io/picard/.
52. Common Workflow Language. 2015. Available from: https://github.com/
common-workflow-language. Accessed 15 Aug 2015.
53. Rehman J. Cancer research in crisis: Are the drugs we count on based on
bad science? 2013. Available from: http://www.salon.com/2013/09/01/is_
cancer_research_facing_a_crisis/. Accessed 14 Aug 2014.
54. Baker M. 1,500 scientists lift the lid on reproducibility. Nature. 2016;
533(7604):452–4.
55. Freedman, D.H. Lies, Damned Lies, and Medical Science. 2010; Available
from: https://www.theatlantic.com/magazine/archive/2010/11/lies-damned-
lies-and-medical-science/308269/.
56. Economist, T. Unreliable research - Trouble at the lab. 2013; Available from:
http://www.economist.com/news/briefing/21588057-scientists-think-science-
self-correcting-alarming-degree-it-not-trouble.
57. Begley CG, Ioannidis JP. Reproducibility in science. Circ Res. 2015;116(1):
116–26.
58. Curcin V, et al. Implementing interoperable provenance in biomedical
research. Futur Gener Comput Syst. 2014;34:1–16.
59. De Roure D, et al. Towards the preservation of scientific workflows. In: Procs.
of the 8th International Conference on Preservation of Digital Objects (iPRES
2011). : ACM. p. 2011.
60. Why workflows break:understanding and combating decay in Taverna
workflows. 2012.
61. Stodden V, et al. Enhancing reproducibility for computational methods.
Science. 2016;354(6317):1240–1.
62. Corcho, O., et al., Workflow-centric research objects: First class citizens in
scholarly discourse. 2012.
63. Freire J, Silva CT. Making Computations and Publications Reproducible with
VisTrails. Computing in Science & Engineering. 2012;14(4):18–25.
64. Metzker ML. Sequencing technologies—the next generation. Nat Rev
Genet. 2009;11(1):31–46.
65. Missier P, Belhajjame K, Cheney J. The W3C PROV family of specifications for
modelling provenance metadata. In: Proceedings of the 16th International
Conference on Extending Database Technology; 2013. ACM.

Submit your next manuscript to BioMed Central


and we will help you at every step:
• We accept pre-submission inquiries
• Our selector tool helps you to find the most relevant journal
• We provide round the clock customer support
• Convenient online submission
• Thorough peer review
• Inclusion in PubMed and all major indexing services
• Maximum visibility for your research

Submit your manuscript at


www.biomedcentral.com/submit

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users (“Users”), for small-
scale personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained. By
accessing, sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use (“Terms”). For these
purposes, Springer Nature considers academic use (by researchers and students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal
subscription. These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription
(to the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of the Creative Commons license used will
apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may also use these personal data internally within
ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not
otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as
detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may
not:

1. use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
control;
2. use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
3. falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
writing;
4. use bots or other automated methods to access the content or redirect messages
5. override any security feature or exclusionary protocol; or
6. share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at

onlineservice@springernature.com

You might also like