Keywords
benchmarking, high-dimensional cytometry, Bioconductor, ExperimentHub, clustering, differential analyses
This article is included in the RPackage gateway.
This article is included in the Bioconductor gateway.
benchmarking, high-dimensional cytometry, Bioconductor, ExperimentHub, clustering, differential analyses
Benchmarking analyses are frequently used to evaluate and compare the performance of computational methods, for example by users interested in selecting a suitable method, or by developers to demonstrate performance improvements of a newly developed method. A critical part of any benchmark is the selection of appropriate benchmark datasets1,2. In some cases, suitable publicly available datasets may be found in the literature. Alternatively, new experimental or simulated datasets containing a known ground truth may be created by the authors of the benchmark1,2.
High-dimensional cytometry refers to a set of recently developed technologies that enable measurement of expression levels of up to dozens of proteins in hundreds to thousands of cells per second, using targeted antibodies labeled with various types of reporter tags. This includes multi-color flow cytometry, mass cytometry (or CyTOF), and sequence-based cytometry (or genomic cytometry). Due to the large size and high dimensionality of the resulting data, numerous computational methods have been developed for analyzing these datasets3. Many of these methods are based on the fundamental concept of analyzing cells in terms of cell populations, for example using clustering to define cell populations, or detecting differential cell populations between conditions.
In our previous work, we have collected a number of benchmark datasets to evaluate methods for clustering4 and differential analyses5 in high-dimensional cytometry data. This includes publicly available datasets previously published by other groups or our experimental collaborators, as well as new semi-simulated datasets that we generated. In these previous publications, we recorded links to original data sources and made all data available via FlowRepository6. FlowRepository is a widely used resource in the cytometry community, which has also been used by other authors to distribute benchmark datasets (e.g., 7,8). However, downloading and loading the data from these sources for further analysis in R requires customized code and matching of metadata (e.g., sample information), which can hinder accessibility and reproducibility.
Here, we introduce the HDCytoData package, which provides a resource for re-distributing high-dimensional cytometry benchmark datasets through Bioconductor’s ExperimentHub9, in order to improve accessibility. ExperimentHub provides a flexible platform for hosting datasets in the form of R/Bioconductor objects, which can be directly loaded within an R session. HDCytoData provides datasets in the form of standard SummarizedExperiment and flowSet Bioconductor object formats10–12, which include all required metadata within the objects and facilitate interoperability with R/Bioconductor-based workflows. We envisage that these datasets will be useful for future benchmarking studies, as well as other activities such as teaching, examples, and tutorials. The package is extensible, allowing new datasets to be contributed by ourselves or other researchers in the future. The package is freely available from http://bioconductor.org/packages/HDCytoData.
The benchmark datasets currently included in the HDCytoData package consist of experimental and semi-simulated data, and can be grouped into datasets useful for benchmarking algorithms for (i) clustering and (ii) differential analyses. Table 1 and Table 2 provide an overview of the datasets.
Dataset | ExperimentHub ID | Number of cells | Number of dimensions | Number of reference cell populations | Type of ground truth | FlowRepository ID | Original reference |
---|---|---|---|---|---|---|---|
Levine_ 32dim | EH2240 – EH2241 | 265,627 | 32 | 14 | Manual gating | FR-FCM-ZZPH | 16 |
Levine_ 13dim | EH2242 – EH2243 | 167,044 | 13 | 24 | Manual gating | FR-FCM-ZZPH | 16 |
Samusik_ 01 | EH2244 – EH2245 | 86,864 | 39 | 24 | Manual gating | FR-FCM-ZZPH | 17 |
Samusik_ all | EH2246 – EH2247 | 841,644 | 39 | 24 | Manual gating | FR-FCM-ZZPH | 17 |
Nilsson_ rare | EH2248 – EH2249 | 44,140 | 13 | 1 (rare population) | Manual gating | FR-FCM-ZZPH | 18 |
Mosmann_ rare | EH2250 – EH2251 | 396,460 | 14 | 1 (rare population) | Manual gating | FR-FCM-ZZPH | 19 |
Dataset | ExperimentHub ID | Type of data | Number of cells | Number of dimensions | Type of ground truth | Type of differential analysis | FlowRepository ID | Original reference |
---|---|---|---|---|---|---|---|---|
Krieg_Anti_ PD_1 | EH2252 – EH2253 | Experimental | 85,715 | 24 (cell type) | Qualitative | Differential abundance | FR-FCM-ZYL8 | 20 |
Bodenmiller_ BCR_XL | EH2254 – EH2255 | Experimental | 172,791 | 24 (10 cell type; 14 cell state) | Qualitative | Differential states | FR-FCM-ZYL8 | 21 |
Weber_AML_ sim | EH3025 – EH3046 | Semi- simulated (multiple simulation scenarios) | 157,593 (excluding spike-in) | 16 (cell type) | Spike-in cell labels | Differential abundance | FR-FCM-ZYL8 | 5 |
Weber_BCR_ XL_sim | EH3047 – EH3064 | Semi- simulated (multiple simulation scenarios) | 85,331 (main simulation; excluding spike-in) | 24 (10 cell type; 14 cell state) | Spike-in cell labels | Differential states | FR-FCM-ZYL8 | 5 |
The raw datasets were collected from various sources (Table 1 and Table 2), and have been extensively reformatted and documented for inclusion in the HDCytoData package. Each dataset is stored in both SummarizedExperiment and flowSet formats, since these are the most commonly used R/Bioconductor data structures for high-dimensional cytometry data. The objects each contain one or more tables of expression values, as well as all required metadata. Following standard conventions used for cytometry data13, rows contain cells, and columns contain protein markers. Row metadata includes sample IDs, group IDs, patient IDs, reference cell population labels (where available), and labels identifying ‘spiked in’ cells (where available). Column metadata includes channel names, protein marker names, and protein marker classes (cell type or cell state). Note that raw expression values should be transformed prior to performing any downstream analyses. Standard transformations include the inverse hyperbolic sine (asinh) with cofactor parameter equal to 5 for mass cytometry or 150 for flow cytometry data (14, Supplementary Figure S2); several other alternatives also exist15.
Most of these datasets include a known ground truth, enabling the calculation of statistical performance metrics. The ground truth information consists of reference cell population labels for the clustering datasets, and labels identifying computationally ‘spiked in’ cells for the differential analysis datasets. The datasets without a ground truth instead consist of experimental datasets that contain a known biological signal, which can be used to evaluate methods in qualitative terms; i.e., whether methods can reproduce the known biological result.
Extensive documentation is available via the help files for each dataset — including descriptions of the datasets, details on accessor functions required to access the expression tables and metadata, and links to original sources. In addition, reproducible R scripts demonstrating how the formatted SummarizedExperiment and flowSet objects were generated from the original raw data files are included within the source code of the package. New datasets may be contributed by ourselves or other authors by providing (i) formatted SummarizedExperiment and flowSet objects containing the data as well as all necessary metadata, (ii) reproducible R scripts showing how the formatted objects were generated from the original raw data files, and (iii) comprehensive documentation.
The HDCytoData package can be installed by following standard Bioconductor package installation procedures. All datasets listed in Table 1 and Table 2 are available in Bioconductor version 3.10 and above. Minimum system requirements include a recent version of R (3.6 or later; this paper was prepared using R version 3.6.1), on a Mac, Windows, or Linux system. Example installation code is shown below.
# install BiocManager
install.packages("BiocManager")
# install HDCytoData package
BiocManager::install("HDCytoData")
Once the HDCytoData package is installed, the datasets can be downloaded from ExperimentHub and loaded directly into an R session using only a few lines of R code. This can be done by either (i) referring to named functions for each dataset, or (ii) creating an ExperimentHub instance and referring to the dataset IDs. Example code for each option for one of the datasets is shown below. Note that each dataset is available in both SummarizedExperiment and flowSet formats. After an object has been downloaded, the ExperimentHub client stores it in a local cache for faster retrieval. For more details on accessing ExperimentHub resources, refer to the ExperimentHub vignette available from Bioconductor.
# load HDCytoData package
library(HDCytoData)
# option 1: load datasets using named functions
d_SE <- Bodenmiller_BCR_XL_SE()
d_flowSet <- Bodenmiller_BCR_XL_flowSet()
# option 2: load datasets by creating ExperimentHub instance
ehub <- ExperimentHub()
query(ehub, "HDCytoData")
d_SE <- ehub[["EH2254"]]
d_flowSet <- ehub[["EH2255"]]
Once the datasets have been downloaded and loaded, they are available to the user as R objects within the R session. They can then be inspected and manipulated using standard accessor and subsetting functions (for either the SummarizedExperiment or flowSet object class). Example code to inspect a SummarizedExperiment is displayed below. For more details on how to load and inspect datasets, including the expected output from each function shown here, refer to the HDCytoData vignette available from Bioconductor.
# inspect SummarizedExperiment object
d_SE
assays(d_SE)
rowData(d_SE)
colData(d_SE)
metadata(d_SE)
Documentation describing each dataset is available in the help files for the objects, which can be accessed using the standard R help interface, as shown below.
# display documentation (help files)
?Bodenmiller_BCR_XL
help(Bodenmiller_BCR_XL)
The datasets currently included in the HDCytoData package (Table 1 and Table 2) can be used to benchmark methods for either (i) clustering or (ii) differential analyses. In addition, these datasets may be useful for other activities such as teaching, examples, and tutorials (e.g., demonstrating how to use a new computational tool).
For benchmarks using the clustering datasets (Table 1), performance can be evaluated by calculating metrics such as the mean F1 score or adjusted Rand index, which measure the similarity between two sets of cell labels (i.e., the cluster labels and the ground truth reference cell population labels)1. For examples (including reproducible R code), see the evaluations in our previous study4. An additional visual example is displayed in Figure 1, which compares the performance of three different dimensionality reduction algorithms (principal component analysis [PCA], t-distributed stochastic neighbor embedding [tSNE]22,23, and uniform manifold approximation and projection [UMAP]24,25) in visually separating the known cell populations in the Levine_32dim dataset (see Table 1). R code to reproduce Figure 1 using data downloaded from the HDCytoData package is available at http://github.com/lmweber/HDCytoData-example.
For benchmarks using the differential analysis datasets (Table 2), methods can be evaluated by their ability to recover the known differential signals, either in quantitative terms using the ground truth spike-in cell labels (for the semi-simulated datasets), or in qualitative terms (for the experimental datasets). The differential signals consist of either differential abundance of cell populations, or differential states within cell populations (i.e., differential expression of additional functional markers within cell populations), providing conceptually distinct differential analysis tasks. For examples (including reproducible R code), see the evaluations in our previous study5.
The HDCytoData package is an extensible resource providing streamlined access to a number of publicly available benchmark datasets used in our previous work on high-dimensional cytometry data analysis. Datasets are provided in standard Bioconductor object formats, and are hosted on Bioconductor’s ExperimentHub platform. By facilitating access to these datasets, we hope they will be useful for other researchers interested in designing rigorous benchmarks for method development or other computational analyses, as well as other activities such as teaching, examples, and tutorials.
All data underlying the results are available as part of the article and no additional source data are required.
Software available from: http://bioconductor.org/packages/HDCytoData
Source code available from: https://github.com/lmweber/HDCytoData
Archived source code at time of publication: https://doi.org/10.5281/zenodo.336284726
Licence: MIT License
LMW was supported by a Forschungskredit (Candoc) grant from the University of Zurich [FK-17-100].
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
The authors thank Mark D. Robinson (University of Zurich, Switzerland) for supervising the projects where these datasets were previously used for benchmarking, and feedback on the manuscript; and Lori Shepherd (Bioconductor Core Team and Roswell Park Cancer Institute, Buffalo, NY, USA) for assistance in making the datasets available through Bioconductor’s ExperimentHub.
Views | Downloads | |
---|---|---|
F1000Research | - | - |
PubMed Central
Data from PMC are received and updated monthly.
|
- | - |
Is the rationale for developing the new software tool clearly explained?
Yes
Is the description of the software tool technically sound?
Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?
Yes
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?
Partly
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?
Yes
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Computational biology, method development, research software engineering.
Is the rationale for developing the new software tool clearly explained?
Partly
Is the description of the software tool technically sound?
Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?
Yes
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?
Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?
Yes
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: statistics, high throughput genomics, transcriptomics, R software, high-dimensional data analysis
Alongside their report, reviewers assign a status to the article:
Invited Reviewers | ||
---|---|---|
1 | 2 | |
Version 2 (revision) 04 Dec 19 |
read | read |
Version 1 19 Aug 19 |
read | read |
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
Comments on this article Comments (0)