Abstract
Single Cell RNA Sequencing (scRNA-seq) technology has enabled the biological research community to explore gene expression at a single-cell resolution. By studying differences in gene expression, it is possible to differentiate cell clusters and types within tissues. One of the major challenges in a scRNA-seq study is feature selection in high dimensional data. Several statistical and machine learning algorithms are available to solve this problem, but their performances across data sets lack systematic comparison. In this research, we benchmark different penalized regression methods, which are suitable for scRNA-seq data. Results from four different scRNA-seq data sets show that Sparse Group Lasso (SGL) implemented by the SGL package in R performs better than other methods in terms of area under the receiver operating curve (AUC). The computation time for different algorithms varies between data sets with SGL having the least average computation time. Based on our findings, we propose a new method that applies SGL on a smaller pre-selected subset of genes to select the differentially expressed genes in scRNA-seq data. The reduction in the number of genes before SGL reduce the computation hardware requirement from 32 GB RAM to 8 GB RAM. The proposed method also demonstrates a consistent improvement in AUC over SGL.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Slovin, S., et al.: Single-cell RNA sequencing analysis: a step-by-step overview. RNA Bioinform. 343–365 (2021). https://doi.org/10.1007/978-1-0716-1307-8_19
Kiselev, V.Y., Andrews, T.S., Hemberg, M.: Challenges in unsupervised clustering of single-cell RNA-seq data. Nat. Rev. Genet. 20(5), 273–282 (2019)
Kaymaz, Y., Ganglberger, F., Tang, M., Fernandez-Albert, F., Lawless, N., Sackton, T.B.: HieRFIT: Hierarchical Random Forest for Information Transfer. bioRxiv (2020). https://doi.org/10.1101/2020.09.16.300822
Pouyan, M.B., Kostka, D.: Random forest based similarity learning for single cell RNA sequencing data. Bioinformatics 34(13), i79–i88 (2018)
Chen, X.W., Jeong, J.C.: Enhanced recursive feature elimination. In: Sixth International Conference on Machine Learning and Applications (ICMLA 2007), pp. 429–435. IEEE (2007)
Tibshirani, R.: Regression shrinkage and selection via the lasso: a retrospective. J. Roy. Stat. Soc. Ser. B (Stat. Methodol.) 73(3), 273–282 (2011)
Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. Roy. Stat. Soc. Ser. B (Stat. Methodol.) 67(2), 301–320 (2005)
Khalfaoui, B., Vert, J.P.: DropLasso: a robust variant of Lasso for single cell RNA-seq data. arXiv preprint arXiv:1802.09381 (2018)
Yuan, M., Lin, Y.: Model selection and estimation in regression with grouped variables. J. Roy. Stat. Soc. Ser. B (Stat. Methodol.) 68(1), 49–67 (2006)
Simon, N., Friedman, J., Hastie, T., Tibshirani, R.: A sparse-group lasso. J. Comput. Graph. Stat. 22(2), 231–245 (2013)
Zeng, Y., Breheny, P.: The biglasso package: a memory-and computation-efficient solver for lasso model fitting with big data in R. arXiv preprint arXiv:1701.05936 (2017)
Tibshirani, R., Saunders, M., Rosset, S., Zhu, J., Knight, K.: Sparsity and smoothness via the fused lasso. J. Roy. Stat. Soc. Ser. B (Stat. Methodol.) 67(1), 91–108 (2005)
Zou, H.: The adaptive lasso and its oracle properties. J. Am. Stat. Assoc. 101(476), 1418–1429 (2006)
Jiang, Y., He, Y., Zhang, H.: Variable selection with prior information for generalized linear models via the prior lasso method. J. Am. Stat. Assoc. 111(513), 355–376 (2016)
Scialdone, A., et al.: Computational assignment of cell-cycle stage from single-cell transcriptome data. Methods 85, 54–61 (2015)
Cao, X., Xing, L., Majd, E., He, H., Gu, J., Zhang, X.: A systematic evaluation of methods for cell phenotype classification using single-cell RNA sequencing data. arXiv preprint arXiv:2110.00681 (2021)
Zou, H., Hastie, T.: Regression shrinkage and selection via the elastic net, with applications to microarrays. JR Stat. Soc. Ser. B 67, 301–20 (2003)
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)
Rani, Y., Rohil, H.: A study of hierarchical clustering algorithm. ter S on Te SIT 2, 113 (2013)
Hartigan, J.A., Wong, M.A.: Algorithm AS 136: a K-means clustering algorithm. J. Roy. Stat. Soc. Ser. C (Appl. Stat.) 28(1), 100–108 (1979). https://doi.org/10.2307/2346830
Hua, J., Liu, H., Zhang, B., Jin, S.: Lak: lasso and K-means based single-cell RNA-seq data clustering analysis. IEEE Access 8, 129679–129688 (2020)
Bates, S., Hastie, T., Tibshirani, R.: Cross-validation: what does it estimate and how well does it do it? arXiv preprint arXiv:2104.00673 (2021)
Park, S.H., Goo, J.M., Jo, C.H.: Receiver operating characteristic (ROC) curve: practical review for radiologists. Korean J. Radiol. 5(1), 11–18 (2004). https://doi.org/10.3348/kjr.2004.5.1.11
Hossin, M., Sulaiman, M.N.: A review on evaluation metrics for data classification evaluations. Int. J. Data Mining Knowl. Manag. Process 5(2), 1 (2015)
Soneson, C., Robinson, M.D.: Bias, robustness and scalability in differential expression analysis of single-cell RNA-Seq data. bioRxiv, 143289 (2017)
Kumar, R.M., et al.: Deconstructing transcriptional heterogeneity in pluripotent stem cells. Nature 516(7529), 56–61 (2014). https://doi.org/10.1038/nature13920
Tasic, B., et al.: Adult mouse cortical cell taxonomy revealed by single cell transcriptomics. Nat. Neurosci. 19(2), 335–346 (2016). https://doi.org/10.1038/nn.4216
Li, H., et al.: Reference component analysis of single-cell transcriptomes elucidates cellular heterogeneity in human colorectal tumors. Nat. Genet. 49(5), 708–718 (2017). https://doi.org/10.1038/ng.3818
Denyer, T., Ma, X., Klesen, S., Scacchi, E., Nieselt, K., Timmermans, M.C.: Spatiotemporal developmental trajectories in the Arabidopsis root revealed using high-throughput single-cell RNA sequencing. Dev. Cell 48(6), 840–852 (2019)
Girard, A., Sachidanandam, R., Hannon, G., et al.: A germline-specific class of small RNAs binds mammalian Piwi proteins. Nature 442, 199–202 (2006). https://doi.org/10.1038/nature04917
Calm2 calmodulin 2 [Mus musculus (house mouse)] [Internet]. Bethesda (MD): National Library of Medicine (US), National Center for Biotechnology Information (2022). https://www.ncbi.nlm.nih.gov/gene/12314. Accessed 17 Jan 2022
Snap25 synaptosomal-associated protein 25 [Mus musculus (house mouse)] [Internet]. Bethesda (MD): National Library of Medicine (US), National Center for Biotechnology Information (2022). https://www.ncbi.nlm.nih.gov/gene/20614. Accessed 17 Jan 2022
Fabp1 fatty acid binding protein 1, liver [Mus musculus (house mouse)] [Internet]. Bethesda (MD): National Library of Medicine (US), National Center for Biotechnology Information (2022). https://www.ncbi.nlm.nih.gov/gene/14080. Accessed 17 Jan 2022
SAT1 spermidine/spermine N1-acetyltransferase 1 [Homo sapiens (human)] [Internet]. Bethesda (MD): National Library of Medicine (US), National Center for Biotechnology Information (2022). https://www.ncbi.nlm.nih.gov/gene/6303. Accessed 17 Jan 2022
LGALS4 galectin 4 [Homo sapiens (human)] [Internet]. Bethesda (MD): National Library of Medicine (US), National Center for Biotechnology Information (2022). https://www.ncbi.nlm.nih.gov/gene/3960. Accessed 17 Jan 2022
HSP90AA1 heat shock protein 90 alpha family class A member 1 [Homo sapiens (human)] [Internet]. Bethesda (MD): National Library of Medicine (US), National Center for Biotechnology Information (2022). https://www.ncbi.nlm.nih.gov/gene/3320. Accessed 17 Jan 2022
HNRNPH1 heterogeneous nuclear ribonucleoprotein H1 [Homo sapiens (human)] [Internet]. Bethesda (MD): National Library of Medicine (US), National Center for Biotechnology Information (2022). https://www.ncbi.nlm.nih.gov/gene/3187. Accessed 17 Jan 2022
König, R., et al.: Global analysis of host-pathogen interactions that regulate early-stage HIV-1 replication. Cell 135(1), 49–60 (2008). https://doi.org/10.1016/j.cell.2008.07.032
Nunnari, G., Smith, J.A., Daniel, R.: HIV-1 Tat and AIDS-associated cancer: targeting the cellular anti-cancer barrier? J. Exp. Clin. Cancer Res. 27(1), 1–8 (2008)
Corbeil, J., et al.: Productive in vitro infection of human umbilical vein endothelial cells and three colon carcinoma cell lines with HIV-1. Immunol. Cell Biol. 73(2), 140–145 (1995)
Alfano, M., Graziano, F., Genovese, L., Poli, G.: Macrophage polarization at the crossroad between HIV-1 infection and cancer development. Arterioscler. Thromb. Vasc. Biol. 33(6), 1145–1152 (2013)
The Arabidopsis Information Resource (TAIR). https://www.arabidopsis.org/servlets/TairObject?type=locus&name=At2g43610. www.arabidopsis.org. Accessed 17 Jan 2022
The Arabidopsis Information Resource (TAIR). https://www.arabidopsis.org/servlets/TairObject?type=locus&id=126703. www.arabidopsis.org. Accessed 17 Jan 2022
The Arabidopsis Information Resource (TAIR). https://www.arabidopsis.org/servlets/TairObject?type=locus&name=At2g07698. www.arabidopsis.org. Accessed 17 Jan 2022
The Arabidopsis Information Resource (TAIR). https://www.arabidopsis.org/servlets/TairObject?type=locus&name=At3g51750. www.arabidopsis.org. Accessed 17 Jan 2022
Sun, Q., Zhang, H.: Targeted inference involving high-dimensional data using nuisance penalized regression. J. Am. Stat. Assoc. 116(535), 1472–1486 (2021)
Klosa, J., Simon, N., Westermark, P.O., Liebscher, V., Wittenburg, D.: Seagull: lasso, group lasso and sparse-group lasso regularization for linear regression models via proximal gradient descent. BMC Bioinform. 21(1), 1–8 (2020)
Vincent, M., Hansen, N.R.: Sparse group lasso and high dimensional multinomial classification. Computat. Stat. Data Anal. 71, 771–786 (2014)
Acknowledgements
The authors would like to acknowledge the funding for this research from,
1. TRU Internal Research Fund (IRF) awarded to Dr. Jabed Tomal, Department of Mathematics and Statistics, Thompson Rivers University and Dr. Yan Yan, Department of Computing Science, Thompson Rivers University.
2. Natural Sciences and Engineering Research Council of Canada (NSERC) awarded to Dr. Jabed Tomal, Department of Mathematics and Statistics, Thompson Rivers University and Dr. Yan Yan, Department of Computing Science, Thompson Rivers University.
The authors also acknowledge Compute Canada for hosting the 32 GB Linux remote server which is used for computation in this research.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Puliparambil, B.S., Tomal, J., Yan, Y. (2022). Benchmarking Penalized Regression Methods in Machine Learning for Single Cell RNA Sequencing Data. In: Jin, L., Durand, D. (eds) Comparative Genomics. RECOMB-CG 2022. Lecture Notes in Computer Science(), vol 13234. Springer, Cham. https://doi.org/10.1007/978-3-031-06220-9_17
Download citation
DOI: https://doi.org/10.1007/978-3-031-06220-9_17
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-06219-3
Online ISBN: 978-3-031-06220-9
eBook Packages: Computer ScienceComputer Science (R0)