Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
bioRxiv preprint doi: https://doi.org/10.1101/2022.03.22.485261; this version posted March 23, 2022. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. scMAGS: Marker gene selection from scRNA-seq data for spatial transcriptomics studies Yusuf Baran, Berat Doğan* Department of Biomedical Engineering, Inonu University, Malatya, TURKEY *berat.dogan@inonu.edu.tr Abstract Single-Cell RNA sequencing (scRNA-seq) has provided unprecedented opportunities for exploring gene expression and thus uncovering regulatory relationships between genes at the single cell level. However, scRNA-seq relies on isolating cells from tissues. Thus, the spatial context of the regulatory processes is lost. A recent technological innovation, spatial transcriptomics, allows to measure gene expression while preserving spatial information. A first step in the spatial transcriptomic analysis is to identify the cell type which requires a careful selection of cell-specific marker genes. For this purpose, currently scRNA-seq data is used to select limited number of marker genes from among all genes that distinguish cell types from each other. This study proposes scMAGS (single-cell MArker Gene Selection), a new approach for marker gene selection from scRNA-seq data for spatial transcriptomics studies. scMAGS uses a filtering step in which the candidate genes are extracted prior to the marker gene selection step. For the selection of marker genes, cluster validity indices, Silhouette index or Calinski-Harabasz index (for large datasets) are utilized. Experimental results showed that, in comparison to the existing methods, scMAGS is scalable, fast and accurate. Even for the large datasets with millions of cells, scMAGS could find the required number of marker genes in a reasonable amount of time with less memory requirements. Keywords: Marker gene selection, scRNA-seq, spatial transcriptomics Introduction Organ systems consist of different cellular subpopulations. The location of these cellular subpopulations within the tissue is closely related to the function of the cells [1]. In recent years, scRNA-seq technology has allowed us to measure RNA levels (transcriptomics) of single cells which provides valuable insights into our understanding of cell types and functions. However, scRNA-seq experiments require isolation of single cells from tissue and this necessity leads to the loss of the spatial position of the cells in the tissue and their spatial proximity to each other. Recent studies show that cell-cell communication (CCC) regulated by biochemical signals is an important aspect of tissue structure and function, regulates individual cellular processes bioRxiv preprint doi: https://doi.org/10.1101/2022.03.22.485261; this version posted March 23, 2022. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. and is effective in intercellular interaction [2]. Knowing the spatial positions of cells in the tissue is critical to understanding normal development and disease pathology. Spatial transcriptomics offers a solution to this problem by providing information about the location of cell types identified by the mRNA readouts in tissue. By combining spatial information with scRNA-seq data, several methods have been developed to measure gene activity in a tissue sample and to map where the activity is occurring [3-10]. Current spatial transcriptomics technologies are primarily categorized as sequencing-based and imaging-based methods [11-17]. Fundamental aspects of these spatial transcriptomics technologies can vary widely, both in terms of the number of genes that can be probed and the size of tissue that can be assayed [18]. For instance, the number of genes that can be probed in single-molecule fluorescence in situ hybridization (smFISH), one of the imagingbased methods, is experimentally constrained by the product of the number of fluorescent channels and the hybridization cycles [19]. Currently, state-of-the-art smFISH methods use around 40 genes [20]. Thus, using labeled scRNA-seq data to optimally select a limited number of marker genes that distinguish cell types from among all genes is a combinatorially difficult problem. In literature, recently a number methods have been proposed for selection of marker genes from scRNA-seq data to distinguish cell types from each other. One of these methods is the COMET [21] in which k marker genes are selected and sorted by one-vs-all methodology. However, the complexity of this method is Gk, where G represents the number of genes. Thus, for even k = 4, it becomes quite infeasible to select markers. Indeed, the COMET provides at most 4 marker genes. In addition, one-vs-all approach cannot offer a solution to the problem when the number of cell types is greater than the number of markers [22]. In a recent study, authors proposed the scGeneFit for optimal selection of marker genes [20]. scGeneFit selects gene markers that jointly optimize cell label recovery using label-aware compressive classification methods. However, this method only focuses on the selection of markers that correctly separates cells with different labels from each other. This may lead a high classification accuracy with the selected subset of markers but not necessarily provide the ideal set of marker genes for spatial transcriptomics studies. An ideal marker gene must be highly expressed in a certain cell type and lowly expressed (ideally zero expression) in all other cell types. A recent study [23] showed that, scGeneFit selects markers some of which are highly expressed in a number of cell types which is not suitable for spatial transcriptomics studies. Selecting the proper marker genes require a careful filtering process. In [23], authors filter out the lowly and highly expressed genes so that only the genes those are expressed in greater than 30% of the classes of interest and in less than 75% of cells with more than 50% of the classes of interest are retained. The filtered genes are then ranked according to an bioRxiv preprint doi: https://doi.org/10.1101/2022.03.22.485261; this version posted March 23, 2022. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. ensemble learning model or a deep neural network, generating a final set of markers for each cell type. The overall framework is called as SMaSH (Scalable Marker (gene) Signal Hunter) which is publicly available as a python package. Although, SMaSH could select appropriate marker genes for some datasets, it has still some problems for most of the datasets. SMaSH rely on supervised selection of marker genes by a number of classification methods. Therefore, in cases where the number of samples is not sufficient for a certain type of cell, the model used in the training phase of SMaSH could fail to provide the appropriate markers. Additionally, as in the scGeneFit, a high classification accuracy does not necessarily mean the algorithm provides the appropriate markers for the spatial transcriptomics studies. In cases where a specific gene is differentially expressed among different cell types is not a challenge for nonlinear classification algorithms to provide a high classification accuracy. Indeed, the gene filtering approach used in the SMaSH lacks foresight for such scenarios. Therefore, for a number of datasets, SMaSH selects marker genes that are expressed in a number of cell types which is not appropriate for spatial transcriptomics studies. Another shortcoming of the SMaSH package is that it cannot work with sparse matrices. Recently, scRNA-seq experiments are conducted for millions of cells and the generated massive data need to be stored in more efficient formats such as the sparse matrix format. This shortcoming limits the usability of the method for large-scale scRNA-seq data. In a more recent study, authors proposed the COSine similarity-based marker Gene identification (COSG), to select the cell specific marker genes from scRNA-seq, scATAC-seq and spatially resolved transcriptomics data [24]. In comparison to the previous studies, COSG is fast and scalable for ultra-large datasets of million-scale cells. However, it has a significant memory requirement and for some cases the cell-specific marker genes found by the COSG are shown to be expressed not only in the target cell type but also in other cell types. This study proposes scMAGS (single-cell MArker Gene Selection), a new approach for marker gene selection for spatial transcriptomics studies. scMAGS uses a filtering step in which the candidate genes are extracted prior to the marker gene selection step. For the selection of marker genes, cluster validity indices, Silhouette index or Calinski-Harabasz index (for large datasets) are utilized. Comparison with the existing methods indicated that, scMAGS selects markers genes that are exclusive to each cell type such that, the corresponding markers genes are highly expressed in a specific cell type while lowly expressed (either zero expression) in other cell types which is desired for the spatial transcriptomics studies. Moreover, experiments showed that the scMAGS is computationally efficient and requires less memory and execution time. Even for the large datasets with millions of cells, scMAGS could find the marker genes in a reasonable amount of time with less memory requirements. bioRxiv preprint doi: https://doi.org/10.1101/2022.03.22.485261; this version posted March 23, 2022. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. Materials and Methods Datasets The input of the scMAGS is a scRNA-seq expression data with cells in rows and genes in columns. Expression data in a sparse matrix form is also allowed. The cell labels must be provided as a separate input. Table S1 lists the details of nineteen datasets used to evaluate the performance of the method. All datasets were obtained from publicly available databases. Filtering candidate genes from the log-transformed scRNA-seq data Gene filtering is one of the most important features that distinguishes the scMAGS from other methods. When selecting marker genes for each cell type, it is not necessary to search for all genes. For the large scale scRNA-seq datasets, the execution of marker selection with all genes can cause computational difficulties and requires high memory. scMAGS identifies candidate marker genes for all cell types with a cluster-specific gene filtering step, and marker selection is carried out on the identified candidate genes. This reduces the memory usage and thus the computational cost. The following criteria should be considered when filtering the marker genes for spatial transcriptomic probes: i) High expression rate in the specific cell type ii) Low (ideally zero) expression rate in the other cell types iii) High average expression value in the specific cell type iv) Low average expression value (ideally zero) in other cell types. A classic scRNA-seq experiment usually contains more than 20,000 genes. Elimination of genes that do not differentiate between cell types is crucial for the efficiency and scalability of the algorithm. A typical drop-based (drop-seq) scRNA-seq data may contain up to 90% zero values in the expression matrix [25]⁠. The high number of genes can be greatly reduced by filtering out genes that are not expressed in more than a few cells or are expressed at the same levels in all cells, thus do not provide information about cellular heterogeneity [26]⁠. scMAGS first calculates the expression rates of all genes in all cell types. As a preprocessing step prior to filtering, it removes genes with less than 20% expression in all cell types. After eliminating genes that do not inform about cellular heterogeneity, the next step is normalization. Count values are converted to expression levels by the normalization process. The count matrix should be normalized to reduce the bias of experimental effects [27]. The number of reads for a gene in each cell is expected to be proportional to the gene-specific expression level and cell-specific scaling factors [28]. However, even if the cells are identical, the count depths may be different for each cell. Difficulty in preparing libraries from cells bioRxiv preprint doi: https://doi.org/10.1101/2022.03.22.485261; this version posted March 23, 2022. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. containing minimal mRNA material, technical bias in cDNA capture or PCR amplification can cause this problem. Figure 1. A step-by-step description of the gene filtering process providing the candidate marker genes for gene selection step. Each four steps are repeated for each cluster to find the cluster-specific candidate marker genes. The normalization step scales the count data to obtain accurate gene expression values across cells, in other words, to bring all cells under the same conditions [28]⁠. Without normalization, incorrect marker selection can be made due to technical biases. Therefore, before gene filtering, log(1+x) transform is applied to the entire count matrix. log(1+x) transform reduces the skewness in data. Let 𝑋 ∈ 𝑅 , 𝑋 = {𝑋 , 𝑋 , … , 𝑋 } be the resulting preprocessed and log(1+x) transformed expression matrix with n cells and m genes. 𝑋 could be considered as a combination of 𝑘 matrices where 𝑘 represents the number of different cell types or states. Each cell type or state bioRxiv preprint doi: https://doi.org/10.1101/2022.03.22.485261; this version posted March 23, 2022. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. could be considered as a separate cluster. For computational efficiency, gene filtering is performed for each cluster separately and then the marker gene selection process is carried out on these candidate genes for each cluster. For cluster-specific gene filtering, firstly, withincluster expression rates and within-cluster average expression values of all genes are computed over the 𝑋 matrix. Let 𝐴 ∈ 𝑅 be the expression rate matrix computed from 𝑋. For each gene at each cluster, the expression rate is computed by dividing the number of cells with non-zero expression to the total number of cells within that particular cluster. Let 𝐵 ∈ 𝑅 be the average expression matrix computed from 𝑋. For each gene at each cluster, the average expression value is computed by averaging the expression values of that particular gene at that particular cluster. With the help of 𝐴 and 𝐵 matrices, cluster-specific candidate marker genes are found by following the below steps for each cluster 𝑘. A step-by-step description of the candidate marker gene filtering process is also shown in Figure 1. 1) By using the matrix 𝐴, for cluster 𝑘, find the genes those have an expression rate above a threshold. For each cluster 𝑘, the threshold is calculated with Eq.1. In Eq.1, 𝑄 represents the 75th percentile and 𝐼𝑄𝑅 represents the interquartile range (difference between 75th and 25th percentiles). (1) 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 = 0.5 × (2 × 𝑄 + 1.5 × 𝐼𝑄𝑅) 2) For cluster k, form the matrix 𝐶 ∈ 𝑅 from matrix 𝐴 and form the matrix 𝐷 ∈ 𝑅 from matrix 𝐵, by using the genes those have an expression rate above threshold. Here, s represents the number of genes that are above the threshold. 3) From matrix 𝐷, find the genes with maximum average expression in cluster 𝑘, and form the matrix 𝐹 ∈ 𝑅 . Here, 𝑙 represents the number of genes with maximum average expression in cluster k. By using the same genes, form the matrix 𝐸 ∈ 𝑅 from matrix 𝐶. 4) For each gene 𝑔, compute a dispersion metric. To do this, by using matrices 𝐸 and 𝐹 find the following constants. a: average expression of gene 𝑔 in cluster 𝑘 (from matrix 𝐹) b: maximum of average expression values of gene 𝑔 in other clusters (from matrix 𝐹) c: expression rate of gene 𝑔 in cluster 𝑘 (from matrix 𝐸) d: average expression rate of gene 𝑔 in other clusters (from matrix 𝐸) e: constant that adjusts the importance of expression rate for gene 𝑔 By using the above defined constant values, for each gene 𝑔 the dispersion metric is computed as in Eq.2. Then, genes are sorted according to their dispersion values and top 𝑡 genes with the highest dispersion values are selected as the candidate marker genes of cluster 𝑘. 𝑑𝑖𝑠𝑝 = (𝑎 − 𝑏) ∗ 𝑐 𝑑 (2) bioRxiv preprint doi: https://doi.org/10.1101/2022.03.22.485261; this version posted March 23, 2022. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. Selecting marker genes from the set of filtered candidate genes To select the marker genes from the cluster-specific candidate marker gene set, either Silhouette index or for large datasets (>100.000 cells), Calinski-Harabasz index is used. The Silhouette index is a measure of how similar a sample is to its own cluster compared to other clusters and it has been previously shown to be the most efficient method to evaluate the clustering validity among thirty different validity indices [29]. The Silhouette index ranges from −1 to +1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. On the other hand, the Calinski-Harabasz index, is the ratio of between-cluster and within-cluster distribution. The score is higher when the clusters are dense and well separated from each other. Because its computationally efficient, it is preferred over Silhouette index in large datasets. Moreover, it was shown to be the second most efficient method to evaluate the clustering validity [29]. Considering the marker gene selection problem, the set of genes that leads to the optimal partition of cells is expected to maximize both the Silhouette or Calinski-Harabasz scores of overall data. Silhouette index To obtain the silhouette score for an object 𝑜 , first the within-cluster average distance 𝑎(𝑖) in cluster 𝐶 is computed as in Eq.3. In Eq.3 𝑛 between 𝑜 ∈ 𝐶 and all other objects 𝑜 represents the number of objects in cluster 𝑘 and 𝑑(𝑜 , 𝑜 ) represents the Euclidean distance between the object 𝑜 and 𝑜 . 𝑎(𝑖) = Then, the average distances 𝛿(𝑜 , 𝐶 1 𝑛 −1 ∈ (3) ) between 𝑜 and all other objects in other clusters 𝐶 is computed as in Eq.4. 𝛿(𝑜 , 𝐶 𝑑(𝑜 , 𝑜 ) )= 1 𝑛 ∈ 𝑑(𝑜 , 𝑜 ) Let 𝑏(𝑖) be the smallest of these average distances: ) (5) 𝑏(𝑖) − 𝑎(𝑖) 𝑚𝑎𝑥(𝑎(𝑖), 𝑏(𝑖)) (6) 𝑏(𝑖) = min 𝛿(𝑜 , 𝐶 For each object 𝑜 , the Silhouette score 𝑠(𝑖) is then computed as in Eq.6. 𝑠(𝑖) = (4) The average of Silhouette score for a given cluster 𝐶 is called as the cluster mean Silhouette and is computed as in Eq.7. 𝜎 = 1 𝑛 ∈ 𝑠(𝑖) (7) bioRxiv preprint doi: https://doi.org/10.1101/2022.03.22.485261; this version posted March 23, 2022. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. Finally, the global Silhouette index of the data is computed by averaging the mean Silhouette score of all clusters as in Eq.8. 𝑆= Calinski-Harabasz index 1 𝐾 (8) 𝜎 To obtain the Calinski-Harabasz score of a labeled dataset containing 𝑘 clusters, first the between-cluster scattering matrix 𝐵 and within-cluster scattering matrix 𝑊 should be computed. The matrices 𝐵 and 𝑊 can be computed with Eqs. 9-10. 𝐵 = 𝑊 = 𝑛 𝜇 −𝜇 𝜇 −𝜇 𝑥−𝜇 𝑥−𝜇 (9) (10) In Eqs.9-10, 𝑘 represents the number of clusters, 𝑛 represents the number of samples in cluster 𝑞, 𝜇 represents the center of the cluster 𝑞 and 𝜇 represents the center of overall data. Having computed the matrices 𝐵 and 𝑊, the Calinski-Harabasz score of a dataset can be computed as in Eq.11. 𝑆= 𝑡𝑟(𝐵 ) 𝑛−𝑘 × 𝑡𝑟(𝑊 ) 𝑘−1 (11) In Eq.11, 𝑛 represents the number of samples in the dataset. Results The performance of the scMAGS is compared to the existing methods scGeneFit, SMaSH and COSG by using the datasets listed in Table S1. The methods scGeneFit and SMaSH were only run for some of the datasets either because the method required more computational resources or it failed to run because of an algorithmic problem. All experiments were carried out on a Dell Precision T7820 workstation equipped with two Intel Xeon® Silver 4210R 2.40GHz 20 core processors and 128 GB RAM. Peak memory requirements and computation time To evaluate memory usage and computation time, all methods were run with their default parameters to select three marker genes for each cell type. The scGeneFit and SMaSH are not able to run with sparce matrices. Therefore, for some of the datasets, the sparce matrices were converted into dense matrices for these two methods. However, for the large datasets the conversion process requires extensive memory requirements (> 400GB RAM) and therefore, it was not possible to evaluate the performance of these methods for the large bioRxiv preprint doi: https://doi.org/10.1101/2022.03.22.485261; this version posted March 23, 2022. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. datasets. In addition, the SMaSH uses classification algorithms which need partitioning of a dataset into training and test sets. However, for some of the datasets, this leads the method to fail during the execution when the size of the test set for a certain cell type is not sufficient to keep the algorithm running. Therefore, the method was not able to run for most of the datasets listed in Table S1. Figure 2. Confusion matrices of the k-Nearest Neighbor (k-NN) classification results obtained by using marker genes found for the Zeisel dataset (a) scMAGS (b) SMaSH (c) scGeneFit (d) COSG On the other hand, because of the high memory requirements, scGeneFit was not able to run with datasets with high number of genes, independent of number of cells. Even if it worked with a dataset, it exhibited the worst performance in terms of computational time. In Table S2, the computation time required to run each method is listed for each dataset. As shown in Table S2, scMAGS and COSG were able to run with all datasets in a quite reasonable time. Although, bioRxiv preprint doi: https://doi.org/10.1101/2022.03.22.485261; this version posted March 23, 2022. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. the COSG is much faster for most of the datasets, the results provided by scMAGS is comparable and it takes only 4.8 minutes to run Bhaduri dataset with 1.3M cells, for which COSG needs 4.6 minutes to find the marker genes. On the other hand, considering the peak memory requirements of the methods (Table S3), the scMAGS performs best and overcomes the COSG with a peak memory requirement of 46949 MB for the Bhaduri dataset which is much smaller than the one required by COSG, 112249 MB. Figure 3. Dotplots of the marker genes found for the Zeisel dataset. Marker genes found by the scGeneFit and SMaSH methods are highly expressed in more than one cell type which make them poor candidates to be considered as marker genes. Expression profiles of selected marker genes Selected marker genes by each method are evaluated based on the criteria provided earlier. A marker gene must have a high expression rate within a specific cell type with an average high expression value. Although, the differential expression of a gene in different cell types provides a high classification accuracy, this does not necessarily mean it is a good candidate to be considered as a marker gene. This fact is shown from the confusion matrices (Figure 2) and dotplots [30] (Figure 3) obtained for each method with the Zeisel dataset, which is one of the datasets that the four methods were able to run commonly. As shown in Figure 2, the kNearest Neighbor (k-NN) classification results obtained by using the selected marker genes for each method do not exhibit a significant difference. However, the dotplots shown in Figure 3 suggests that the genes selected by the scGeneFit and SMaSH methods are highly expressed in more than one cell type which make them poor candidates to be considered as marker genes. On the other hand, the genes selected by the scMAGS and COSG have high bioRxiv preprint doi: https://doi.org/10.1101/2022.03.22.485261; this version posted March 23, 2022. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. expression rate and high average expression values in a specific cell type with low expression rates and low average expression values in other cell types which makes them ideal candidates to be considered as marker genes. This fact was further verified with another dataset that was commonly used to evaluate the performance of the four methods. In Figure 4, the dotplots obtained for the Kleshchevnikov dataset are shown. From this figure, it is obvious that, the scGeneFit and SMaSH methods again fails to find the proper marker genes. Although, some of the genes selected by scMAGS and COSG have also expression in other cell types, in general the selected genes meet the criteria to be considered as marker genes. Figure 4. Dotplots of the marker genes found for the Kleshchevnikov dataset. Marker genes found by the scGeneFit and SMaSH methods are highly expressed in more than one cell type which make them poor candidates to be considered as marker genes. A comparative analysis of the scMAGS and COSG for the other datasets are also provided in Figures S1-S5. As shown from these dotplots, for most of the datasets cell specific marker genes selected by COSG are expressed in a number of cell types which is not desired. The dotplot is one of the standard plots used in different studies to compare the expression rates and average expression values [23-24]. However, because it scales the expression rate and average expression of a gene to [0,1] interval among the different cell types, it might be misleading to solely rely on this plot. Therefore, for each dataset, distribution of the expression rates for the marker genes selected by the scMAGS and COSG methods are also compared to see whether there is a statistically significant difference between two groups. bioRxiv preprint doi: https://doi.org/10.1101/2022.03.22.485261; this version posted March 23, 2022. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. Figure 5. Marker genes selected by the scMAGS and COSG methods are evaluated by statistical comparison of their within-cell expression rates and expression rates in other cell types. A Mann-Whitney U test along with the box-whisker plots of the expression rates validated that, the marker genes selected by the COSG method have high expression rates in other cell types which is not desired. bioRxiv preprint doi: https://doi.org/10.1101/2022.03.22.485261; this version posted March 23, 2022. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. In Figure 5, it is clear that, the marker genes selected by scMAGS have high within-cell expression rates and have low expression rates in other cell types. Note that, as in the scMAGS, for some of the datasets the marker genes selected by COSG also have high withincell expression rates. On the other hand, for a number of datasets the interquartile range of boxplots is quite large, which means COSG selects marker genes not only with high withincell expression rates, but genes with low within-cell expression rates are also selected. Nonetheless, for within-cell expression rates of the marker genes, in general, there is not any statistically significant difference (p > 0.05) between the results obtained by the scMAGS. However, the expression rates of the marker genes selected by COSG are also high in other cell types and, in comparison to scMAGS, for this case the difference is statistically significant (p < 0.05) for most of the datasets. This problem is more obvious and could be better investigated from the t-SNE plots and heatmaps shown in Figure 6 for one of the datasets (Baron Human 2 dataset). In this figure, even the best marker genes (first column of the heatmap for each cell type) selected by COSG for alpha and beta cells have expression in almost all other cell types. Discussion Understanding the spatial distribution of gene expression has become possible with the recently developed technologies. This has contributed to the solution of many fundamental problems of developmental biology. A major step in determining the spatial distribution of the gene expression across various tissues is the selection of accurate marker genes. Here, we present the scMAGS as a scalable, fast and accurate method for the marker gene selection from scRNA-seq data. scMAGS is implemented in Python and is made publicly available at GitHub. The main difference of the scMAGS from the previous studies is its accurate gene filtering step by which the optimal set of candidate genes are extracted from the large-scale scRNA-seq data. The filtering step positively affects the memory requirement of the method and makes it computationally lightweight in comparison to the available methods. Next, by utilizing cluster validity indices (Silhouette and Calinski-Harabasz), scMAGS selects the ideal marker genes from the set of filtered candidate genes. The recently proposed COSG method is fast in comparison to the scMAGS. However, it requires more memory and for most of the datasets it selects genes that are not only expressed in the target cell type but also in other cell types which is not desired for spatial transcriptomics studies. Moreover, for a number of datasets, some of the marker genes selected by the COSG have low within-cell expression rates. bioRxiv preprint doi: https://doi.org/10.1101/2022.03.22.485261; this version posted March 23, 2022. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. Figure 6. (a) Two-dimensional t-SNE plot of the Baron Human 2 dataset. (b) The expression of the best marker gene selected by the scMAGS method at each cell type (c) A heatmap of the marker genes selected by the scMAGS method (d) The expression of the best marker gene selected by the COSG method at each cell type (e) A heatmap of the marker genes selected by the COSG method By default, the scMAGS selects three marker genes. However, this number can be changed based on the user requirements. Selected marker genes can be visualized by a number of plots such as the dotplot, heatmap, t-SNE distribution of marker genes in 2D and the confusion matrix obtained by the k-NN classification result. In conclusion, the unique features of the scMAGS method in comparison to the previously published studies makes it a good alternative for marker gene selection problem. With its outstanding performance, it can serve as a general tool not only for spatial transcriptomics studies but also in other biological studies that require the accurate and fast identification of marker genes. bioRxiv preprint doi: https://doi.org/10.1101/2022.03.22.485261; this version posted March 23, 2022. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. Funding This study was supported by The Scientific and Technological Research Council of Turkey (TÜBİTAK) [120C152 to B.D.]. Code availability The scMAGS is made freely available as a Python package. A CLI (Command Line Interface) has been built internally in the package, so it can be run from the terminal environment without the need for any IDE (Integrated Development Environment). The source code of the scMAGS package is freely available at https://github.com/doganlab/scmags and can be downloaded from the Python Package Directory (PyPI) software repository with the command pip install scmags. References 1. Longo SK, Guo MG, Ji AL, Khavari PA. Integrating single-cell and spatial transcriptomics to elucidate intercellular tissue dynamics. Nat Rev Genet. 2021;22(10):627-644. Doi:10.1038/s41576-021-00370-8 2. Almet AA, Cang Z, Jin S, Nie Q. The landscape of cell-cell communication through singlecell transcriptomics. 15urrO pin Syst Biol. 2021;26:12-23. Doi:10.1016/j.coisb.2021.03.007 3. Zhu Q, Shah S, Dries R, Cai L, Yuan GC. Identification of spatially associated subpopulations by combining scRNAseq and sequential fluorescence in situ hybridization data. Nat Biotechnol. 2018;36(12):1183-1190. Doi:10.1038/nbt.4260 4. Karaiskos N, Wahle P, Alles J, et al. The Drosophila embryo at single-cell transcriptome resolution. Science. 2017;358(6360):194-199. Doi:10.1126/science.aan3235 5. Bageritz J, Willnow P, Valentini E, Leible S, Boutros M, Teleman AA. Gene expression atlas of a developing tissue by single cell expression correlation analysis. Nat Methods. 2019;16(8):750-756. Doi:10.1038/s41592-019-0492-x 6. Achim K, Pettit JB, Saraiva LR, et al. High-throughput spatial mapping of single-cell RNAseq data to tissue of origin. Nat Biotechnol. 2015;33(5):503-509. Doi:10.1038/nbt.3209 7. Satija R, Farrell JA, Gennert D, Schier AF, Regev A. Spatial reconstruction of single-cell gene expression data. Nat Biotechnol. 2015;33(5):495-502. Doi:10.1038/nbt.3192 8. Halpern KB, Shenhav R, Matcovitch-Natan O, et al. Single-cell spatial reconstruction reveals global division of labour in the mammalian liver. Nature. 2017;542(7641):352356. Doi:10.1038/nature21065 9. Butler A, Hoffman P, Smibert P, Papalexi E, Satija R. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat Biotechnol. 2018;36(5):411-420. Doi:10.1038/nbt.4096 bioRxiv preprint doi: https://doi.org/10.1101/2022.03.22.485261; this version posted March 23, 2022. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. 10. Stuart T, Butler A, Hoffman P, et al. Comprehensive integration of single-cell data. Cell. 2019;177(7):1888-1902.e21. doi:10.1016/j.cell.2019.05.031 11. Zhuang X. Spatially resolved single-cell genomics and transcriptomics by imaging. Nat Methods. 2021;18(1):18-22. Doi:10.1038/s41592-020-01037-8 12. Larsson L, Frisén J, Lundeberg J. Spatially resolved transcriptomics adds a new dimension to genomics. Nat Methods. 2021;18(1):15-18. Doi:10.1038/s41592-02001038-7 13. Crosetto N, Bienko M, van Oudenaarden A. Spatially resolved transcriptomics and beyond. Nat Rev Genet. 2015;16(1):57-66. Doi:10.1038/nrg3832 14. Moor AE, Itzkovitz S. Spatial transcriptomics: paving the way for tissue-level systems biology. 16urrO pin Biotechnol. 2017;46:126-133. Doi:10.1016/j.copbio.2017.02.004 15. Asp M, Bergenstråhle J, Lundeberg J. Spatially resolved transcriptomes-next generation tools for tissue exploration. Bioessays. 2020;42(10):e1900221. Doi:10.1002/bies.201900221 16. Waylen LN, Nim HT, Martelotto LG, Ramialison M. From whole-mount to single-cell spatial assessment of gene expression in 3D. Commun Biol. 2020;3(1):602. Doi:10.1038/s42003-020-01341-1 17. Teves JM, Won KJ. Mapping cellular coordinates through advances in spatial transcriptomics technology. Mol Cells. 2020;43(7):591-599. Doi:10.14348/molcells.2020.0020 18. Rao A, Barkley D, França GS, Yanai I. Exploring tissue architecture using spatial transcriptomics. Nature. 2021;596(7871):211-220. Doi:10.1038/s41586-021-03634-9 19. Codeluppi S, Borm LE, Zeisel A, et al. Spatial organization of the somatosensory cortex revealed by cyclic smFISH. bioRxiv. Published online 2018. Doi:10.1101/276097 20. Dumitrascu B, Villar S, Mixon DG, Engelhardt BE. Optimal marker gene selection for cell type discrimination in single cell analyses. Nat Commun. 2021;12(1):1186. Doi:10.1038/s41467-021-21453-4 21. Delaney C, Schnell A, Cammarata LV, et al. Combinatorial prediction of marker panels from single-cell transcriptomic data. Mol Syst Biol. 2019;15(10):e9005. Doi:10.15252/msb.20199005 22. Reboredo H, Renna F, Calderbank R, Rodrigues MRD. Bounds on the number of measurements for reliable compressive classification. IEEE Trans Signal Process. 2016;64(22):5778-5793. Doi:10.1109/tsp.2016.2599496 23. Nelson ME, Riva SG, Cvejic A. SMaSH: A scalable, general marker gene identification framework for single-cell RNA sequencing and Spatial Transcriptomics. bioRxiv. Published online 2021. Doi:10.1101/2021.04.08.438978 24. Dai M, Pei X, Wang XJ. Accurate and fast cell marker gene identification with COSG. Brief Bioinform. Published online 2022. Doi:10.1093/bib/bbab579 25. Klein AM, Mazutis L, Akartuna I, et al. Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell. 2015;161(5):1187-1201. Doi:10.1016/j.cell.2015.04.044 bioRxiv preprint doi: https://doi.org/10.1101/2022.03.22.485261; this version posted March 23, 2022. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. 26. Zheng GXY, Terry JM, Belgrader P, et al. Massively parallel digital transcriptional profiling of single cells. Nat Commun. 2017;8(1):14049. Doi:10.1038/ncomms14049 27. Luecken MD, Theis FJ. Current best practices in single-cell RNA-seq analysis: a tutorial. Mol Syst Biol. 2019;15(6):e8746. Doi:10.15252/msb.20188746 28. Cao J, Spielmann M, Qiu X, et al. The single-cell transcriptional landscape of mammalian organogenesis. Nature. 2019;566(7745):496-502. Doi:10.1038/s41586-019-0969-x 29. Arbelaitz O, Gurrutxaga I, Muguerza J, Pérez JM, Perona I. An extensive comparative study of cluster validity indices. Pattern Recognit. 2013;46(1):243-256. Doi:10.1016/j.patcog.2012.07.021 30. Wolf FA, Angerer P, Theis FJ. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 2018;19(1). doi:10.1186/s13059-017-1382-0