Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Software Tool Article

HDCytoData: Collection of high-dimensional cytometry benchmark datasets in Bioconductor object formats

[version 1; peer review: 2 approved with reservations]
PUBLISHED 19 Aug 2019
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

This article is included in the RPackage gateway.

This article is included in the Bioconductor gateway.

Abstract

Benchmarking is a crucial step during computational analysis and method development. Recently, a number of new methods have been developed for analyzing high-dimensional cytometry data. However, it can be difficult for analysts and developers to find and access well-characterized benchmark datasets. Here, we present HDCytoData, a Bioconductor package providing streamlined access to several publicly available high-dimensional cytometry benchmark datasets. The package is designed to be extensible, allowing new datasets to be contributed by ourselves or other researchers in the future. Currently, the package includes a set of experimental and semi-simulated datasets, which have been used in our previous work to evaluate methods for clustering and differential analyses. Datasets are formatted into standard SummarizedExperiment and flowSet Bioconductor object formats, which include complete metadata within the objects. Access is provided through Bioconductor's ExperimentHub interface. The package is freely available from http://bioconductor.org/packages/HDCytoData.

Keywords

benchmarking, high-dimensional cytometry, Bioconductor, ExperimentHub, clustering, differential analyses

Introduction

Benchmarking analyses are frequently used to evaluate and compare the performance of computational methods, for example by users interested in selecting a suitable method, or by developers to demonstrate performance improvements of a newly developed method. A critical part of any benchmark is the selection of appropriate benchmark datasets1,2. In some cases, suitable publicly available datasets may be found in the literature. Alternatively, new experimental or simulated datasets containing a known ground truth may be created by the authors of the benchmark1,2.

High-dimensional cytometry refers to a set of recently developed technologies that enable measurement of expression levels of up to dozens of proteins in hundreds to thousands of cells per second, using targeted antibodies labeled with various types of reporter tags. This includes multi-color flow cytometry, mass cytometry (or CyTOF), and sequence-based cytometry (or genomic cytometry). Due to the large size and high dimensionality of the resulting data, numerous computational methods have been developed for analyzing these datasets3. Many of these methods are based on the fundamental concept of analyzing cells in terms of cell populations, for example using clustering to define cell populations, or detecting differential cell populations between conditions.

In our previous work, we have collected a number of benchmark datasets to evaluate methods for clustering4 and differential analyses5 in high-dimensional cytometry data. This includes publicly available datasets previously published by other groups or our experimental collaborators, as well as new semi-simulated datasets that we generated. In these previous publications, we recorded links to original data sources and made all data available via FlowRepository6. FlowRepository is a widely used resource in the cytometry community, which has also been used by other authors to distribute benchmark datasets (e.g., 7,8). However, downloading and loading the data from these sources for further analysis in R requires customized code and matching of metadata (e.g., sample information), which can hinder accessibility and reproducibility.

Here, we introduce the HDCytoData package, which provides a resource for re-distributing high-dimensional cytometry benchmark datasets through Bioconductor’s ExperimentHub9, in order to improve accessibility. ExperimentHub provides a flexible platform for hosting datasets in the form of R/Bioconductor objects, which can be directly loaded within an R session. HDCytoData provides datasets in the form of standard SummarizedExperiment and flowSet Bioconductor object formats1012, which include all required metadata within the objects and facilitate interoperability with R/Bioconductor-based workflows. We envisage that these datasets will be useful for future benchmarking studies, as well as other activities such as teaching, examples, and tutorials. The package is extensible, allowing new datasets to be contributed by ourselves or other researchers in the future. The package is freely available from http://bioconductor.org/packages/HDCytoData.

Methods

Implementation

The benchmark datasets currently included in the HDCytoData package consist of experimental and semi-simulated data, and can be grouped into datasets useful for benchmarking algorithms for (i) clustering and (ii) differential analyses. Table 1 and Table 2 provide an overview of the datasets.

Table 1. Summary of benchmark datasets for evaluating clustering algorithms.

For more details on these datasets, see Table 2 in 4, or the HDCytoData help files.

DatasetExperimentHub
ID
Number
of cells
Number of
dimensions
Number of
reference
cell
populations
Type of
ground truth
FlowRepository
ID
Original
reference
Levine_
32dim
EH2240 – EH2241265,6273214Manual gatingFR-FCM-ZZPH16
Levine_
13dim
EH2242 – EH2243167,0441324Manual gatingFR-FCM-ZZPH16
Samusik_
01
EH2244 – EH224586,8643924Manual gatingFR-FCM-ZZPH17
Samusik_
all
EH2246 – EH2247841,6443924Manual gatingFR-FCM-ZZPH17
Nilsson_
rare
EH2248 – EH224944,140131 (rare
population)
Manual gatingFR-FCM-ZZPH18
Mosmann_
rare
EH2250 – EH2251396,460141 (rare
population)
Manual gatingFR-FCM-ZZPH19

Table 2. Summary of benchmark datasets for evaluating methods for differential analyses.

For more details on these datasets, see Supplementary Note 1 in 5, or the HDCytoData help files.

DatasetExperimentHub
ID
Type of dataNumber
of cells
Number of
dimensions
Type of
ground
truth
Type of
differential
analysis
FlowRepository
ID
Original
reference
Krieg_Anti_
PD_1
EH2252 – EH2253Experimental85,71524 (cell
type)
QualitativeDifferential
abundance
FR-FCM-ZYL820
Bodenmiller_
BCR_XL
EH2254 – EH2255Experimental172,79124 (10 cell
type; 14 cell
state)
QualitativeDifferential
states
FR-FCM-ZYL821
Weber_AML_
sim
EH3025 – EH3046Semi-
simulated
(multiple
simulation
scenarios)
157,593
(excluding
spike-in)
16 (cell
type)
Spike-in
cell labels
Differential
abundance
FR-FCM-ZYL85
Weber_BCR_
XL_sim
EH3047 – EH3064Semi-
simulated
(multiple
simulation
scenarios)
85,331
(main
simulation;
excluding
spike-in)
24 (10 cell
type; 14 cell
state)
Spike-in
cell labels
Differential
states
FR-FCM-ZYL85

The raw datasets were collected from various sources (Table 1 and Table 2), and have been extensively reformatted and documented for inclusion in the HDCytoData package. Each dataset is stored in both SummarizedExperiment and flowSet formats, since these are the most commonly used R/Bioconductor data structures for high-dimensional cytometry data. The objects each contain one or more tables of expression values, as well as all required metadata. Following standard conventions used for cytometry data13, rows contain cells, and columns contain protein markers. Row metadata includes sample IDs, group IDs, patient IDs, reference cell population labels (where available), and labels identifying ‘spiked in’ cells (where available). Column metadata includes channel names, protein marker names, and protein marker classes (cell type or cell state). Note that raw expression values should be transformed prior to performing any downstream analyses. Standard transformations include the inverse hyperbolic sine (asinh) with cofactor parameter equal to 5 for mass cytometry or 150 for flow cytometry data (14, Supplementary Figure S2); several other alternatives also exist15.

Most of these datasets include a known ground truth, enabling the calculation of statistical performance metrics. The ground truth information consists of reference cell population labels for the clustering datasets, and labels identifying computationally ‘spiked in’ cells for the differential analysis datasets. The datasets without a ground truth instead consist of experimental datasets that contain a known biological signal, which can be used to evaluate methods in qualitative terms; i.e., whether methods can reproduce the known biological result.

Extensive documentation is available via the help files for each dataset — including descriptions of the datasets, details on accessor functions required to access the expression tables and metadata, and links to original sources. In addition, reproducible R scripts demonstrating how the formatted SummarizedExperiment and flowSet objects were generated from the original raw data files are included within the source code of the package. New datasets may be contributed by ourselves or other authors by providing (i) formatted SummarizedExperiment and flowSet objects containing the data as well as all necessary metadata, (ii) reproducible R scripts showing how the formatted objects were generated from the original raw data files, and (iii) comprehensive documentation.

Operation

The HDCytoData package can be installed by following standard Bioconductor package installation procedures. All datasets listed in Table 1 and Table 2 are available in Bioconductor version 3.10 and above. Minimum system requirements include a recent version of R (3.6 or later; this paper was prepared using R version 3.6.1), on a Mac, Windows, or Linux system. Example installation code is shown below.

# install BiocManager
install.packages("BiocManager")

# install HDCytoData package
BiocManager::install("HDCytoData")

Once the HDCytoData package is installed, the datasets can be downloaded from ExperimentHub and loaded directly into an R session using only a few lines of R code. This can be done by either (i) referring to named functions for each dataset, or (ii) creating an ExperimentHub instance and referring to the dataset IDs. Example code for each option for one of the datasets is shown below. Note that each dataset is available in both SummarizedExperiment and flowSet formats. After an object has been downloaded, the ExperimentHub client stores it in a local cache for faster retrieval. For more details on accessing ExperimentHub resources, refer to the ExperimentHub vignette available from Bioconductor.

# load HDCytoData package
library(HDCytoData)

# option 1: load datasets using named functions
d_SE <- Bodenmiller_BCR_XL_SE()
d_flowSet <- Bodenmiller_BCR_XL_flowSet()

# option 2: load datasets by creating ExperimentHub instance
ehub <- ExperimentHub()
query(ehub, "HDCytoData")
d_SE <- ehub[["EH2254"]]
d_flowSet <- ehub[["EH2255"]]

Once the datasets have been downloaded and loaded, they are available to the user as R objects within the R session. They can then be inspected and manipulated using standard accessor and subsetting functions (for either the SummarizedExperiment or flowSet object class). Example code to inspect a SummarizedExperiment is displayed below. For more details on how to load and inspect datasets, including the expected output from each function shown here, refer to the HDCytoData vignette available from Bioconductor.

# inspect SummarizedExperiment object
d_SE
assays(d_SE)
rowData(d_SE)
colData(d_SE)
metadata(d_SE)

Documentation describing each dataset is available in the help files for the objects, which can be accessed using the standard R help interface, as shown below.

# display documentation (help files)
?Bodenmiller_BCR_XL
help(Bodenmiller_BCR_XL)

Use cases

The datasets currently included in the HDCytoData package (Table 1 and Table 2) can be used to benchmark methods for either (i) clustering or (ii) differential analyses. In addition, these datasets may be useful for other activities such as teaching, examples, and tutorials (e.g., demonstrating how to use a new computational tool).

For benchmarks using the clustering datasets (Table 1), performance can be evaluated by calculating metrics such as the mean F1 score or adjusted Rand index, which measure the similarity between two sets of cell labels (i.e., the cluster labels and the ground truth reference cell population labels)1. For examples (including reproducible R code), see the evaluations in our previous study4. An additional visual example is displayed in Figure 1, which compares the performance of three different dimensionality reduction algorithms (principal component analysis [PCA], t-distributed stochastic neighbor embedding [tSNE]22,23, and uniform manifold approximation and projection [UMAP]24,25) in visually separating the known cell populations in the Levine_32dim dataset (see Table 1). R code to reproduce Figure 1 using data downloaded from the HDCytoData package is available at http://github.com/lmweber/HDCytoData-example.

e30a110e-e4e3-4e68-a6e6-ca8d5a8eb30e_figure1.gif

Figure 1. Example of use case for benchmark datasets in the HDCytoData package.

This example compares performance (in visual terms) of three dimensionality reduction algorithms — principal component analysis (PCA), t-distributed stochastic neighbor embedding (tSNE), and uniform manifold approximation and projection (UMAP) — for representing known cell populations in the Levine_32dim dataset (Table 1).

For benchmarks using the differential analysis datasets (Table 2), methods can be evaluated by their ability to recover the known differential signals, either in quantitative terms using the ground truth spike-in cell labels (for the semi-simulated datasets), or in qualitative terms (for the experimental datasets). The differential signals consist of either differential abundance of cell populations, or differential states within cell populations (i.e., differential expression of additional functional markers within cell populations), providing conceptually distinct differential analysis tasks. For examples (including reproducible R code), see the evaluations in our previous study5.

Summary

The HDCytoData package is an extensible resource providing streamlined access to a number of publicly available benchmark datasets used in our previous work on high-dimensional cytometry data analysis. Datasets are provided in standard Bioconductor object formats, and are hosted on Bioconductor’s ExperimentHub platform. By facilitating access to these datasets, we hope they will be useful for other researchers interested in designing rigorous benchmarks for method development or other computational analyses, as well as other activities such as teaching, examples, and tutorials.

Data availability

All data underlying the results are available as part of the article and no additional source data are required.

Software availability

Software available from: http://bioconductor.org/packages/HDCytoData

Source code available from: https://github.com/lmweber/HDCytoData

Archived source code at time of publication: https://doi.org/10.5281/zenodo.336284726

Licence: MIT License

Comments on this article Comments (0)

Version 2
VERSION 2 PUBLISHED 19 Aug 2019
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Weber LM and Soneson C. HDCytoData: Collection of high-dimensional cytometry benchmark datasets in Bioconductor object formats [version 1; peer review: 2 approved with reservations] F1000Research 2019, 8:1459 (https://doi.org/10.12688/f1000research.20210.1)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 1
VERSION 1
PUBLISHED 19 Aug 2019
Views
26
Cite
Reviewer Report 02 Sep 2019
Laurent Gatto, De Duve Institute, University of Louvain (UCLouvain), Brussels, Belgium 
Approved with Reservations
VIEWS 26
Weber and Soneson present HDCytoData, a Bioconductor data package providing pre-formatted high-dimensional cytometry data. The preparation of the datasets as SummarizedExperiment and flowSet objects makes these amendable for benchmarking, a crucial step when developing new methods.

My ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Gatto L. Reviewer Report For: HDCytoData: Collection of high-dimensional cytometry benchmark datasets in Bioconductor object formats [version 1; peer review: 2 approved with reservations]. F1000Research 2019, 8:1459 (https://doi.org/10.5256/f1000research.22200.r52679)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 04 Dec 2019
    Lukas Weber, SIB Swiss Institute of Bioinformatics, Zurich, 8057, Switzerland
    04 Dec 2019
    Author Response
    Thank you for your comments and suggestions. As suggested, we have provided significant additional material on the procedure for contributing new datasets. We have expanded the section in the text ... Continue reading
COMMENTS ON THIS REPORT
  • Author Response 04 Dec 2019
    Lukas Weber, SIB Swiss Institute of Bioinformatics, Zurich, 8057, Switzerland
    04 Dec 2019
    Author Response
    Thank you for your comments and suggestions. As suggested, we have provided significant additional material on the procedure for contributing new datasets. We have expanded the section in the text ... Continue reading
Views
25
Cite
Reviewer Report 28 Aug 2019
Shila Ghazanfar, Cancer Research UK Cambridge Institute (CRUK CI), Cambridge, UK 
Approved with Reservations
VIEWS 25
Weber and Soneson have written a software article presenting HDCytoData (currently version 1.4.0), a Bioconductor package aimed at making multiple high-dimensional cytometry (HDC) datasets available in a consistent R friendly format as either SummarizedExperiment or flowSet objects. The authors have ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Ghazanfar S. Reviewer Report For: HDCytoData: Collection of high-dimensional cytometry benchmark datasets in Bioconductor object formats [version 1; peer review: 2 approved with reservations]. F1000Research 2019, 8:1459 (https://doi.org/10.5256/f1000research.22200.r52681)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 04 Dec 2019
    Lukas Weber, SIB Swiss Institute of Bioinformatics, Zurich, 8057, Switzerland
    04 Dec 2019
    Author Response
    Thank you for your comments and suggestions. We have updated the text, vignettes, and help files to clarify each of the issues raised above. Below are also responses to the ... Continue reading
COMMENTS ON THIS REPORT
  • Author Response 04 Dec 2019
    Lukas Weber, SIB Swiss Institute of Bioinformatics, Zurich, 8057, Switzerland
    04 Dec 2019
    Author Response
    Thank you for your comments and suggestions. We have updated the text, vignettes, and help files to clarify each of the issues raised above. Below are also responses to the ... Continue reading

Comments on this article Comments (0)

Version 2
VERSION 2 PUBLISHED 19 Aug 2019
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions