Abstract
The single-cell assay for transposase-accessible chromatin using sequencing (scATAC-seq) technology provides insight into gene regulation and epigenetic heterogeneity at single-cell resolution, but cell annotation from scATAC-seq remains challenging due to high dimensionality and extreme sparsity within the data. Existing cell annotation methods mostly focus on the cell peak matrix without fully utilizing the underlying genomic sequence. Here we propose a method, SANGO, for accurate single-cell annotation by integrating genome sequences around the accessibility peaks within scATAC data. The genome sequences of peaks are encoded into low-dimensional embeddings, and then iteratively used to reconstruct the peak statistics of cells through a fully connected network. The learned weights are considered as regulatory modes to represent cells, and utilized to align the query cells and the annotated cells in the reference data through a graph transformer network for cell annotations. SANGO was demonstrated to consistently outperform competing methods on 55 paired scATAC-seq datasets across samples, platforms and tissues. SANGO was also shown to be able to detect unknown tumor cells through attention edge weights learned by the graph transformer. Moreover, from the annotated cells, we found cell-type-specific peaks that provide functional insights/biological signals through expression enrichment analysis, cis-regulatory chromatin interaction analysis and motif enrichment analysis.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 /Â 30Â days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
$99.00 per year
only $8.25 per issue
Buy this article
- Purchase on Springer Link
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
We downloaded the raw scATAC matrix data directly from the website and followed previous works5,47,54 to binarize the matrix. (1) The datasets BoneMarrowA, BoneMarrowB, LungA, LungB, Kidney, Liver, Heart, LargeIntestineA, LargeIntestineB, SmallIntestine, WholeBrainA, WholeBrainB, Cerebellum and PreFrontalCortex are derived from the adult mouse atlas data55, downloading from either GEO accession number GSE111586 or the website http://atlas.gs.washington.edu/mouse-atac/data/. These datasets are sequenced using the sciATAC-seq technology56 and annotated through the mm9 reference genome. (2) The anterior datasets (MosA1, MosA2), middle datasets (MosM1, MosM2) and posterior datasets (MosP1, MosP2) are from the different sections of the secondary motor cortex in mouse brain57, which can be accessed through GEO accession number GSE126724. These datasets are sequenced using snATAC-seq technology58 and annotated through the GRCm38 reference genome. (3) The Mouse Brain (10x) dataset and the normal cortex dataset are sequenced using the 10x sequencing technology and annotated using the mm10 reference genome. These two datasets can be downloaded from https://support.10xgenomics.com/single-cell-atac/datasets/1.1.0/atac_v1_adult_brain_fresh_5k and https://www.10xgenomics.com/resources/datasets/fresh-cortex-from-adult-mouse-brain-p-50-1-standard-1-2-0, respectively. (4) The forebrain dataset can be downloaded through GEO accession number GSE100033, which is sequenced by the snATAC and annotated using the mm9 reference genome. (5) The PBMC atlas data, BCCâTIL and the basal cell carcinoma sample data are obtained from refs. 3,15. These datasets are annotated using the ENCODE hg19 reference genome and can be accessed through GEO accession number GSE129785 or the download website https://www.synapse.org/#!Synapse:syn52559388/files/. (6) The PBMC (10x) data are obtained from the official 10x website: https://support.10xgenomics.com/single-cell-multiome-atac-gex/datasets/1.0.0/pbmc_granulocyte_sorted_10k, which is annotated using the GRCh38 reference genome. (7) The raw HHLA data can be obtained from GEO accession number GSE184462 and the processed data can be downloaded from the website https://www.synapse.org/#!Synapse:syn52559388/files/. All of these datasets were preprocessed as described in Dataset preprocessing. Source data are provided with this paper.
Code availability
All source codes used in our experiments have been deposited at https://github.com/cquzys/SANGO. A Zenodo version is also available at https://doi.org/10.5281/zenodo.10826453 (ref. 59).
References
Buenrostro, J. D. et al. Single-cell chromatin accessibility reveals principles of regulatory variation. Nature 523, 486â490 (2015).
Chen, H. et al. Single-cell trajectories reconstruction, exploration and mapping of omics data with STREAM. Nat. Commun. 10, 1903 (2019).
Satpathy, A. T. et al. Massively parallel single-cell chromatin landscapes of human immune cell development and intratumoral T cell exhaustion. Nat. Biotechnol. 37, 925â936 (2019).
Xiong, L. et al. SCALE method for single-cell ATAC-seq analysis via latent feature extraction. Nat. Commun. 10, 4576 (2019).
Luecken, M. D. et al. Benchmarking atlas-level data integration in single-cell genomics. Nat. Methods 19, 41â50 (2022).
Chen, H. et al. Assessment of computational methods for the analysis of single-cell ATAC-seq data. Genome Biol. 20, 241 (2019).
Granja, J. M. et al. ArchR is a scalable software package for integrative single-cell chromatin accessibility analysis. Nat. Genet. 53, 403â411 (2021).
Pliner, H. A. et al. Cicero predicts cis-regulatory DNA interactions from single-cell chromatin accessibility data. Mol. Cell 71, 858â871 (2018).
Satija, R., Farrell, J. A., Gennert, D., Schier, A. F. & Regev, A. Spatial reconstruction of single-cell gene expression data. Nat. Biotechnol. 33, 495â502 (2015).
Aran, D. et al. Reference-based analysis of lung single-cell sequencing reveals a transitional profibrotic macrophage. Nat. Immunol. 20, 163â172 (2019).
Tan, Y. & Cahan, P. SingleCellNet: a computational tool to classify single cell RNA-Seq data across platforms and across species. Cell Syst. 9, 207â213 (2019).
Kimmel, J. C. & Kelley, D. R. Semisupervised adversarial neural networks for single-cell classification. Genome Res. 31, 1781â1793 (2021).
Ma, W., Lu, J. & Wu, H. Cellcano: supervised cell type identification for single cell ATAC-seq data. Nat. Commun. 14, 1864 (2023).
Chen, X. et al. Cell type annotation of single-cell chromatin accessibility data via supervised Bayesian embedding. Nat. Mach. Intell. 4, 116â126 (2022).
Jiang, Y. et al. scATAnno: automated cell type annotation for single-cell ATAC sequencing data. Preprint at bioRxiv https://doi.org/10.1101/2023.06.01.543296 (2024).
Srivastava, D. & Mahony, S. Sequence and chromatin determinants of transcription factor binding and the establishment of cell type-specific binding patterns. Biochim. Biophys. Acta 1863, 194443 (2020).
Schwessinger, R., Deasy, J., Woodruff, R. T., Young, S. & Branson, K. M. Single-cell gene expression prediction from DNA sequence at large contexts. Preprint at bioRxiv https://doi.org/10.1101/2023.07.26.550634 (2023).
Yuan, H. & Kelley, D. R. scBasset: sequence-based modeling of single-cell ATAC-seq using convolutional neural networks. Nat. Methods 19, 1088â1096 (2022).
Tayyebi, Z., Pine, A. R. & Leslie, C. S. Scalable sequence-informed embedding of single-cell ATAC-seq data with CellSpace. Preprint at bioRxiv https://doi.org/10.1101/2022.05.02.490310 (2023).
Chen, K., Zhao, H. & Yang, Y. Capturing large genomic contexts for accurately predicting enhancerâpromoter interactions. Brief. Bioinform. 23, bbab577 (2022).
OâShea, K. & Nash, R. An introduction to convolutional neural networks. Preprint at arXiv https://doi.org/10.48550/arXiv.1511.08458 (2015).
Tran, H. T. N. et al. A benchmark of batch-effect correction methods for single-cell RNA sequencing data. Genome Biol. 21, 12 (2020).
Stuart, T. et al. Comprehensive integration of single-cell data. Cell 177, 1888â1902.e21 (2019).
DomÃnguez Conde, C. et al. Cross-tissue immune cell analysis reveals tissue-specific features in humans. Science 376, eabl5197 (2022).
Mackay, M. et al. Selective dysregulation of the FcγIIB receptor on memory B cells in SLE. J. Exp. Med. 203, 2157â2164 (2006).
Sundell, T. et al. Single-cell RNA sequencing analyses: interference by the genes that encode the B-cell and T-cell receptors. Brief. Funct. Genom. 22, 263â273 (2023).
Loo, L. et al. Single-cell transcriptomic analysis of mouse neocortical development. Nat. Commun. 10, 134 (2019).
Ruan, C. & Elyaman, W. A new understanding of TMEM119 as a marker of microglia. Front. Cell. Neurosci. 16, 902372 (2022).
Stuart, T., Srivastava, A., Madad, S., Lareau, C. A. & Satija, R. Single-cell chromatin state analysis with Signac. Nat. Methods 18, 1333â1341 (2021).
Hu, J. et al. SpaGCN: integrating gene expression, spatial location and histology to identify spatial domains and spatially variable genes by graph convolutional network. Nat. Methods 18, 1342â1351 (2021).
Xu, C. et al. Automatic cell type harmonization and integration across Human Cell Atlas datasets. Cell 186, 5876â5891 (2023).
Theodoris, C. V. et al. Transfer learning enables predictions in network biology. Nature 618, 616â624 (2023).
Hao, Z.-Z. et al. Single-cell transcriptomics of adult macaque hippocampus reveals neural precursor cell populations. Nat. Neurosci. 25, 805â817 (2022).
Zappia, L. & Theis, F. J. Over 1000 tools reveal trends in the single-cell RNA-seq analysis landscape. Genome Biol. 22, 301 (2021).
Chen, S., Zhang, B., Chen, X., Zhang, X. & Jiang, R. stPlus: a reference-based method for the accurate enhancement of spatial transcriptomics. Bioinformatics 37, i299âi307 (2021).
Song, Q., Suand, J. & Zhang, W. scGCN is a graph convolutional networks algorithm for knowledge transfer in single cell omics. Nat. Commun. 12, 3826 (2021).
Wang, Q. et al. ECA-Net: efficient channel attention for deep convolutional neural networks. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 11534â11542 (IEEE, 2020).
Wu, Q., Zhao, W., Li, Z., Wipf, D. P. & Yan, J. Nodeformer: a scalable graph structure learning transformer for node classification. Adv. Neural Inf. Process. Syst. 35, 27387â27401 (2022).
Rahimi, A. & Recht, B. Random features for large-scale kernel machines. Adv. Neural Inf. Process. Syst. 20, 1177â1184 (2007).
Jang, E., Gu, S. & Poole, B. Categorical reparameterization with Gumbel-Softmax. Preprint at arXiv https://doi.org/10.48550/arXiv.1611.01144 (2016).
Kingma, D. P., Salimans, T. & Welling, M. Variational dropout and the local reparameterization trick. Adv. Neural Inf. Process. Syst. 28, 2575â2583 (2015).
Maddison, C. J., Mnih, A. & Teh, Y. W. The concrete distribution: a continuous relaxation of discrete random variables. Preprint at arXiv https://doi.org/10.48550/arXiv.1611.00712 (2016).
Zeng, Y., Zhou, X., Rao, J., Lu, Y. & Yang, Y. Accurately clustering single-cell RNA-seq data by capturing structural relations between cells through graph convolutional network. In Proc. 2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) 519â522 (IEEE, 2020).
Zeng, Y., Wei, Z., Pan, Z., Lu, Y. & Yang, Y. A robust and scalable graph neural network for accurate single-cell classification. Brief. Bioinform. 23, bbab570 (2022).
Slowikowski, K., Hu, X. & Raychaudhuri, S. SNPsea: an algorithm to identify cell types, tissues and pathways affected by risk loci. Bioinformatics 30, 2496â2497 (2014).
Su, A. I. et al. A gene atlas of the mouse and human protein-encoding transcriptomes. Proc. Natl Acad. Sci. USA 101, 6062â6067 (2004).
Ma, A. et al. Single-cell biological network inference using a heterogeneous graph transformer. Nat. Commun. 14, 964 (2023).
Zamanighomi, M. et al. Unsupervised clustering and epigenetic classification of single cells. Nat. Commun. 9, 2410 (2018).
Zhang, K. et al. A single-cell atlas of chromatin accessibility in the human genome. Cell 184, 5985â6001 (2021).
Schneider, V. A. et al. Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. Genome Res. 27, 849â864 (2017).
Zhao, H. et al. CrossMap: a versatile tool for coordinate conversion between genome assemblies. Bioinformatics 30, 1006â1007 (2014).
Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841â842 (2010).
Yang, F. et al. scBERT as a large-scale pretrained deep language model for cell type annotation of single-cell RNA-seq data. Nat. Mach. Intell. 4, 852â866 (2022).
Wu, K. E., Yost, K. E., Chang, H. Y. & Zou, J. BABEL enables cross-modality translation between multiomic profiles at single-cell resolution. Proc. Natl Acad. Sci. USA 118, e2023070118 (2021).
Cusanovich, D. A. et al. A single-cell atlas of in vivo mammalian chromatin accessibility. Cell 174, 1309â1324 (2018).
Cusanovich, D. A. et al. Multiplex single-cell profiling of chromatin accessibility by combinatorial cellular indexing. Science 348, 910â914 (2015).
Fang, R. et al. Comprehensive analysis of single cell ATAC-seq data with SnapATAC. Nat. Commun. 12, 1337 (2021).
Preissl, S. et al. Single-nucleus analysis of accessible chromatin in developing mouse forebrain reveals cell-type-specific transcriptional regulation. Nat. Neurosci. 21, 432â439 (2018).
Zeng,Y. et al. Deciphering cell types by integrating scATAC-seq data with genome sequences. Zenodo https://doi.org/10.5281/zenodo.10826453 (2024).
Acknowledgements
This study has been supported by the National Key R&D Program of China (2022YFF1203100), the National Natural Science Foundation of China (T2394502), the Research and Development Project of Pazhou Lab (Huangpu) (2023K0606) and the Postdoctoral Fellowship Program of CPSF (GZC20233321).
Author information
Authors and Affiliations
Contributions
Y.Y. conceived and supervised the project. Y.Z., M.L. and N.S developed and implemented the SANGO algorithm. Y.Y., W.Y. and Y.Z. validated the methods and wrote the paper. P.S., J.F. and J.X. conducted the biological analysis. Y.L. and K.C. discussed and performed the rebuttal experiments. All authors read and approved the final paper.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Computational Science thanks Andrea Tangherloni, Guangyu Wang and Hao Wu for their contribution to the peer review of this work. Primary Handling Editor: Kaitlin McCardle, in collaboration with the Nature Computational Science team.
Additional information
Publisherâs note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary Information
Supplementary Figs. 1â25, Note 1 and References.
Source data
Source Data Fig. 2
A single file containing all source data for Fig. 2.
Source Data Fig. 3
A single file containing all source data for Fig. 3.
Source Data Fig. 4
A single file containing all source data for Fig. 4.
Source Data Fig. 5
A single file containing all source data for Fig. 5.
Source Data Fig. 6
A single file containing all source data for Fig. 6.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zeng, Y., Luo, M., Shangguan, N. et al. Deciphering cell types by integrating scATAC-seq data with genome sequences. Nat Comput Sci 4, 285â298 (2024). https://doi.org/10.1038/s43588-024-00622-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s43588-024-00622-7
This article is cited by
-
A variational expectation-maximization framework for balanced multi-scale learning of protein and drug interactions
Nature Communications (2024)