Discrete latent embedding of single-cell chromatin accessibility sequencing data for uncovering cell heterogeneity


Single-cell epigenomic data has been growing continuously at an unprecedented pace, but their characteristics such as high dimensionality and sparsity pose substantial challenges to downstream analysis. Although deep learning models—especially variational autoencoders—have been widely used to capture low-dimensional feature embeddings, the prevalent Gaussian assumption somewhat disagrees with real data, and these models tend to struggle to incorporate reference information from abundant cell atlases. Here we propose CASTLE, a deep generative model based on the vector-quantized variational autoencoder framework to extract discrete latent embeddings that interpretably characterize single-cell chromatin accessibility sequencing data. We validate the performance and robustness of CASTLE for accurate cell-type identification and reasonable visualization compared with state-of-the-art methods. We demonstrate the advantages of CASTLE for effective incorporation of existing massive reference datasets in a weakly supervised or supervised manner. We further demonstrate CASTLE’s capacity for intuitively distilling cell-type-specific feature spectra that unveil cell heterogeneity and biological implications quantitatively.

Fig. 1: Overview of CASTLE framework.
Fig. 2: Evaluation of CASTLE compared with baseline methods.
Fig. 3: Performance of batch correction for CASTLE compared with baseline methods.
Fig. 4: Robustness analysis for CASTLE compared with baseline methods.
Fig. 5: Performance of reference incorporation for CASTLE compared with other baseline methods.
Fig. 6: Feature spectrum analysis and biological implications of the cell-type-specific peaks identified by CASTLE.

Data availability

The splenocyte dataset, with peaks in the GRCm38/mm10 genome, was downloaded from ArrayExpress via accession no. E-MTAB-6714 (ref. 66). The InSilico dataset, with peaks in the GRCh37/hg19 genome, was constructed by computationally putting together six scCAS experiments that were performed on different cell lines individually and was downloaded in Gene Expression Omnibus (GEO) under accession no. GSE65360 (ref. 1). The droplet dataset, with peaks in the GRCh37/hg19 genome, measures chromatin accessibility across 136,463 resting and stimulated human bone marrow-derived cells and was downloaded in GEO under accession no. GSE123580 (ref. 25). The mouse chromatin accessibility atlas datasets with peaks in the genome of GRCm37/mm9 were downloaded from http://atlas.gs.washington.edu/mouse-atac (ref. 37). The human cell atlas of fetal chromatin accessibility, with peaks in the GRCh37/hg19 genome, was downloaded under accession no. GSE149683 (ref. 26). The brain dataset, with peaks in the GRCm38/mm10 genome, consists of two batches assayed by single nucleus assay for transposase-accessible chromatin using sequencing, snATAC and 10X (refs. 12,32) and was downloaded in GEO under accession no. GSE126724 and https://support.10xgenomics.com/single-cell-atac/datasets/1.1.0/atac_v1_adult_brain_fresh_5k. The immune dataset with peaks in the genome of GRCh37/hg19, derived from human peripheral blood and bone marrow dataset, was downloaded in GEO under accession no. GSE129785 (ref. 56). Source Data are provided with this paper.

Code availability

The CASTLE software, including detailed documents and tutorial, is freely available on GitHub (https://github.com/cuixj19/CASTLE). The source code is also available via Zenodo at https://doi.org/10.5281/zenodo.10906304 (ref. 67).


This work was supported by the National Key Research and Development Program of China (grant nos. 2021YFF1200902 and 2023YFF1204802 to R.J.), the National Natural Science Foundation of China (grant nos. 62273194 to R.J. and 62203236 to S.C.), the Fundamental Research Funds for the Central Universities (grant no. Nankai University 63231137 to S.C.), and the Young Elite Scientists Sponsorship Program by CAST (grant no. 2023QNRC001 to S.C.).

R.J. and S.C. conceived the study and supervised the project. X.C. and S.C. designed, implemented and validated CASTLE. Z.L. and Z.G. helped analyze the results. X.C., S.C. and X.C. wrote the paper, with input from all of the authors.

