Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Benchmarking Penalized Regression Methods in Machine Learning for Single Cell RNA Sequencing Data

  • Conference paper
  • First Online:
Comparative Genomics (RECOMB-CG 2022)

Abstract

Single Cell RNA Sequencing (scRNA-seq) technology has enabled the biological research community to explore gene expression at a single-cell resolution. By studying differences in gene expression, it is possible to differentiate cell clusters and types within tissues. One of the major challenges in a scRNA-seq study is feature selection in high dimensional data. Several statistical and machine learning algorithms are available to solve this problem, but their performances across data sets lack systematic comparison. In this research, we benchmark different penalized regression methods, which are suitable for scRNA-seq data. Results from four different scRNA-seq data sets show that Sparse Group Lasso (SGL) implemented by the SGL package in R performs better than other methods in terms of area under the receiver operating curve (AUC). The computation time for different algorithms varies between data sets with SGL having the least average computation time. Based on our findings, we propose a new method that applies SGL on a smaller pre-selected subset of genes to select the differentially expressed genes in scRNA-seq data. The reduction in the number of genes before SGL reduce the computation hardware requirement from 32 GB RAM to 8 GB RAM. The proposed method also demonstrates a consistent improvement in AUC over SGL.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Slovin, S., et al.: Single-cell RNA sequencing analysis: a step-by-step overview. RNA Bioinform. 343–365 (2021). https://doi.org/10.1007/978-1-0716-1307-8_19

  2. Kiselev, V.Y., Andrews, T.S., Hemberg, M.: Challenges in unsupervised clustering of single-cell RNA-seq data. Nat. Rev. Genet. 20(5), 273–282 (2019)

    Article  Google Scholar 

  3. Kaymaz, Y., Ganglberger, F., Tang, M., Fernandez-Albert, F., Lawless, N., Sackton, T.B.: HieRFIT: Hierarchical Random Forest for Information Transfer. bioRxiv (2020). https://doi.org/10.1101/2020.09.16.300822

  4. Pouyan, M.B., Kostka, D.: Random forest based similarity learning for single cell RNA sequencing data. Bioinformatics 34(13), i79–i88 (2018)

    Article  Google Scholar 

  5. Chen, X.W., Jeong, J.C.: Enhanced recursive feature elimination. In: Sixth International Conference on Machine Learning and Applications (ICMLA 2007), pp. 429–435. IEEE (2007)

    Google Scholar 

  6. Tibshirani, R.: Regression shrinkage and selection via the lasso: a retrospective. J. Roy. Stat. Soc. Ser. B (Stat. Methodol.) 73(3), 273–282 (2011)

    Article  MathSciNet  Google Scholar 

  7. Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. Roy. Stat. Soc. Ser. B (Stat. Methodol.) 67(2), 301–320 (2005)

    Article  MathSciNet  Google Scholar 

  8. Khalfaoui, B., Vert, J.P.: DropLasso: a robust variant of Lasso for single cell RNA-seq data. arXiv preprint arXiv:1802.09381 (2018)

  9. Yuan, M., Lin, Y.: Model selection and estimation in regression with grouped variables. J. Roy. Stat. Soc. Ser. B (Stat. Methodol.) 68(1), 49–67 (2006)

    Article  MathSciNet  Google Scholar 

  10. Simon, N., Friedman, J., Hastie, T., Tibshirani, R.: A sparse-group lasso. J. Comput. Graph. Stat. 22(2), 231–245 (2013)

    Article  MathSciNet  Google Scholar 

  11. Zeng, Y., Breheny, P.: The biglasso package: a memory-and computation-efficient solver for lasso model fitting with big data in R. arXiv preprint arXiv:1701.05936 (2017)

  12. Tibshirani, R., Saunders, M., Rosset, S., Zhu, J., Knight, K.: Sparsity and smoothness via the fused lasso. J. Roy. Stat. Soc. Ser. B (Stat. Methodol.) 67(1), 91–108 (2005)

    Article  MathSciNet  Google Scholar 

  13. Zou, H.: The adaptive lasso and its oracle properties. J. Am. Stat. Assoc. 101(476), 1418–1429 (2006)

    Article  MathSciNet  Google Scholar 

  14. Jiang, Y., He, Y., Zhang, H.: Variable selection with prior information for generalized linear models via the prior lasso method. J. Am. Stat. Assoc. 111(513), 355–376 (2016)

    Article  MathSciNet  Google Scholar 

  15. Scialdone, A., et al.: Computational assignment of cell-cycle stage from single-cell transcriptome data. Methods 85, 54–61 (2015)

    Article  Google Scholar 

  16. Cao, X., Xing, L., Majd, E., He, H., Gu, J., Zhang, X.: A systematic evaluation of methods for cell phenotype classification using single-cell RNA sequencing data. arXiv preprint arXiv:2110.00681 (2021)

  17. Zou, H., Hastie, T.: Regression shrinkage and selection via the elastic net, with applications to microarrays. JR Stat. Soc. Ser. B 67, 301–20 (2003)

    Article  Google Scholar 

  18. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)

    MathSciNet  MATH  Google Scholar 

  19. Rani, Y., Rohil, H.: A study of hierarchical clustering algorithm. ter S on Te SIT 2, 113 (2013)

    Google Scholar 

  20. Hartigan, J.A., Wong, M.A.: Algorithm AS 136: a K-means clustering algorithm. J. Roy. Stat. Soc. Ser. C (Appl. Stat.) 28(1), 100–108 (1979). https://doi.org/10.2307/2346830

  21. Hua, J., Liu, H., Zhang, B., Jin, S.: Lak: lasso and K-means based single-cell RNA-seq data clustering analysis. IEEE Access 8, 129679–129688 (2020)

    Article  Google Scholar 

  22. Bates, S., Hastie, T., Tibshirani, R.: Cross-validation: what does it estimate and how well does it do it? arXiv preprint arXiv:2104.00673 (2021)

  23. Park, S.H., Goo, J.M., Jo, C.H.: Receiver operating characteristic (ROC) curve: practical review for radiologists. Korean J. Radiol. 5(1), 11–18 (2004). https://doi.org/10.3348/kjr.2004.5.1.11

    Article  Google Scholar 

  24. Hossin, M., Sulaiman, M.N.: A review on evaluation metrics for data classification evaluations. Int. J. Data Mining Knowl. Manag. Process 5(2), 1 (2015)

    Article  Google Scholar 

  25. Soneson, C., Robinson, M.D.: Bias, robustness and scalability in differential expression analysis of single-cell RNA-Seq data. bioRxiv, 143289 (2017)

    Google Scholar 

  26. Kumar, R.M., et al.: Deconstructing transcriptional heterogeneity in pluripotent stem cells. Nature 516(7529), 56–61 (2014). https://doi.org/10.1038/nature13920

    Article  Google Scholar 

  27. Tasic, B., et al.: Adult mouse cortical cell taxonomy revealed by single cell transcriptomics. Nat. Neurosci. 19(2), 335–346 (2016). https://doi.org/10.1038/nn.4216

    Article  Google Scholar 

  28. Li, H., et al.: Reference component analysis of single-cell transcriptomes elucidates cellular heterogeneity in human colorectal tumors. Nat. Genet. 49(5), 708–718 (2017). https://doi.org/10.1038/ng.3818

    Article  Google Scholar 

  29. Denyer, T., Ma, X., Klesen, S., Scacchi, E., Nieselt, K., Timmermans, M.C.: Spatiotemporal developmental trajectories in the Arabidopsis root revealed using high-throughput single-cell RNA sequencing. Dev. Cell 48(6), 840–852 (2019)

    Article  Google Scholar 

  30. Girard, A., Sachidanandam, R., Hannon, G., et al.: A germline-specific class of small RNAs binds mammalian Piwi proteins. Nature 442, 199–202 (2006). https://doi.org/10.1038/nature04917

    Article  Google Scholar 

  31. Calm2 calmodulin 2 [Mus musculus (house mouse)] [Internet]. Bethesda (MD): National Library of Medicine (US), National Center for Biotechnology Information (2022). https://www.ncbi.nlm.nih.gov/gene/12314. Accessed 17 Jan 2022

  32. Snap25 synaptosomal-associated protein 25 [Mus musculus (house mouse)] [Internet]. Bethesda (MD): National Library of Medicine (US), National Center for Biotechnology Information (2022). https://www.ncbi.nlm.nih.gov/gene/20614. Accessed 17 Jan 2022

  33. Fabp1 fatty acid binding protein 1, liver [Mus musculus (house mouse)] [Internet]. Bethesda (MD): National Library of Medicine (US), National Center for Biotechnology Information (2022). https://www.ncbi.nlm.nih.gov/gene/14080. Accessed 17 Jan 2022

  34. SAT1 spermidine/spermine N1-acetyltransferase 1 [Homo sapiens (human)] [Internet]. Bethesda (MD): National Library of Medicine (US), National Center for Biotechnology Information (2022). https://www.ncbi.nlm.nih.gov/gene/6303. Accessed 17 Jan 2022

  35. LGALS4 galectin 4 [Homo sapiens (human)] [Internet]. Bethesda (MD): National Library of Medicine (US), National Center for Biotechnology Information (2022). https://www.ncbi.nlm.nih.gov/gene/3960. Accessed 17 Jan 2022

  36. HSP90AA1 heat shock protein 90 alpha family class A member 1 [Homo sapiens (human)] [Internet]. Bethesda (MD): National Library of Medicine (US), National Center for Biotechnology Information (2022). https://www.ncbi.nlm.nih.gov/gene/3320. Accessed 17 Jan 2022

  37. HNRNPH1 heterogeneous nuclear ribonucleoprotein H1 [Homo sapiens (human)] [Internet]. Bethesda (MD): National Library of Medicine (US), National Center for Biotechnology Information (2022). https://www.ncbi.nlm.nih.gov/gene/3187. Accessed 17 Jan 2022

  38. König, R., et al.: Global analysis of host-pathogen interactions that regulate early-stage HIV-1 replication. Cell 135(1), 49–60 (2008). https://doi.org/10.1016/j.cell.2008.07.032

    Article  Google Scholar 

  39. Nunnari, G., Smith, J.A., Daniel, R.: HIV-1 Tat and AIDS-associated cancer: targeting the cellular anti-cancer barrier? J. Exp. Clin. Cancer Res. 27(1), 1–8 (2008)

    Article  Google Scholar 

  40. Corbeil, J., et al.: Productive in vitro infection of human umbilical vein endothelial cells and three colon carcinoma cell lines with HIV-1. Immunol. Cell Biol. 73(2), 140–145 (1995)

    Article  Google Scholar 

  41. Alfano, M., Graziano, F., Genovese, L., Poli, G.: Macrophage polarization at the crossroad between HIV-1 infection and cancer development. Arterioscler. Thromb. Vasc. Biol. 33(6), 1145–1152 (2013)

    Article  Google Scholar 

  42. The Arabidopsis Information Resource (TAIR). https://www.arabidopsis.org/servlets/TairObject?type=locus&name=At2g43610. www.arabidopsis.org. Accessed 17 Jan 2022

  43. The Arabidopsis Information Resource (TAIR). https://www.arabidopsis.org/servlets/TairObject?type=locus&id=126703. www.arabidopsis.org. Accessed 17 Jan 2022

  44. The Arabidopsis Information Resource (TAIR). https://www.arabidopsis.org/servlets/TairObject?type=locus&name=At2g07698. www.arabidopsis.org. Accessed 17 Jan 2022

  45. The Arabidopsis Information Resource (TAIR). https://www.arabidopsis.org/servlets/TairObject?type=locus&name=At3g51750. www.arabidopsis.org. Accessed 17 Jan 2022

  46. Sun, Q., Zhang, H.: Targeted inference involving high-dimensional data using nuisance penalized regression. J. Am. Stat. Assoc. 116(535), 1472–1486 (2021)

    Article  MathSciNet  Google Scholar 

  47. Klosa, J., Simon, N., Westermark, P.O., Liebscher, V., Wittenburg, D.: Seagull: lasso, group lasso and sparse-group lasso regularization for linear regression models via proximal gradient descent. BMC Bioinform. 21(1), 1–8 (2020)

    Article  Google Scholar 

  48. Vincent, M., Hansen, N.R.: Sparse group lasso and high dimensional multinomial classification. Computat. Stat. Data Anal. 71, 771–786 (2014)

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

The authors would like to acknowledge the funding for this research from,

1. TRU Internal Research Fund (IRF) awarded to Dr. Jabed Tomal, Department of Mathematics and Statistics, Thompson Rivers University and Dr. Yan Yan, Department of Computing Science, Thompson Rivers University.

2. Natural Sciences and Engineering Research Council of Canada (NSERC) awarded to Dr. Jabed Tomal, Department of Mathematics and Statistics, Thompson Rivers University and Dr. Yan Yan, Department of Computing Science, Thompson Rivers University.

The authors also acknowledge Compute Canada for hosting the 32 GB Linux remote server which is used for computation in this research.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bhavithry Sen Puliparambil .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Puliparambil, B.S., Tomal, J., Yan, Y. (2022). Benchmarking Penalized Regression Methods in Machine Learning for Single Cell RNA Sequencing Data. In: Jin, L., Durand, D. (eds) Comparative Genomics. RECOMB-CG 2022. Lecture Notes in Computer Science(), vol 13234. Springer, Cham. https://doi.org/10.1007/978-3-031-06220-9_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-06220-9_17

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-06219-3

  • Online ISBN: 978-3-031-06220-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics