Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Computational Identi Fication of Preneoplastic Cells Displaying High Stemness and Risk of Cancer Progression

Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

CANCER RESEARCH | GENOME AND EPIGENOME

Computational Identification of Preneoplastic Cells


Displaying High Stemness and Risk of Cancer Progression
Tianyuan Liu1, Xuan Zhao1, Yuan Lin2,3, Qi Luo4, Shaosen Zhang1, Yiyi Xi1, Yamei Chen1, Lin Lin1, Wenyi Fan1,
Jie Yang1, Yuling Ma1, Alok K. Maity4, Yanyi Huang2,3, Jianbin Wang5, Jiang Chang6, Dongxin Lin1,7,
Andrew E. Teschendorff4,8, and Chen Wu1,7,9,10

ABSTRACT

Evidence points toward the differentiation state of cells as a cancer. Spatial transcriptomics and whole-genome bisulfite
marker of cancer risk and progression. Measuring the differentia- sequencing demonstrated that differentiation activity of tissue-
tion state of single cells in a preneoplastic population could thus specific TFs was decreased in cancer cells compared with the basal
enable novel strategies for early detection and risk prediction. cell-of-origin layer and established that differentiation state corre-
Recent maps of somatic mutagenesis in normal tissues from young lated with differential DNA methylation at the promoters of these

Downloaded from http://aacrjournals.org/cancerres/article-pdf/82/14/2520/3180238/2520.pdf by guest on 25 July 2024


healthy individuals have revealed cancer driver mutations, indicat- TFs, independently of underlying NOTCH1 and TP53 mutations.
ing that these do not correlate well with differentiation state and that The findings were replicated in a mouse model of ESCC develop-
other molecular events also contribute to cancer development. ment, and the broad applicability of CancerStemID to other cancer-
We hypothesized that the differentiation state of single cells can types was demonstrated. In summary, these data support an
be measured by estimating the regulatory activity of the transcrip- epigenetic stem-cell model of oncogenesis and highlight a novel
tion factors (TF) that control differentiation within that cell computational strategy to identify stem-like preneoplastic cells that
lineage. To this end, we present a novel computational method undergo positive selection.
called CancerStemID that estimates a stemness index of cells from
single-cell RNA sequencing data. CancerStemID is validated in Significance: This study develops a computational strategy to
two human esophageal squamous cell carcinoma (ESCC) cohorts, dissect the heterogeneity of differentiation states within a preneo-
demonstrating how it can identify undifferentiated preneoplastic plastic cell population, allowing identification of stem-like cells that
cells whose transcriptomic state is overrepresented in invasive may drive cancer progression.

specifying and maintaining the normal differentiation state. For


Introduction instance, the alveolar differentiation factor NKX2–1 in lung can-
A long-held view of oncogenesis is that cancer cells arise from an cer (9) or the goblet differentiation factor KLF4 in colon cancer (10)
aberrant dedifferentiated stem-like state (1–3). Such a model is well represent tumor suppressors, and in general tissue-specific TFs have
supported in the context of both pediatric (e.g., Wilms tumors; been observed to be preferentially silenced in the corresponding
ref. 4) and adult cancers (1, 5–8), where aberrant or dedifferentiated cancer-type, suggesting that these non-classical tumor suppressor
states like metaplasia or dysplasia often precede tumor develop- events may be causally implicated (11, 12). Unlike classical tumor
ment. In addition, there is mounting evidence that the aberrant suppressors such TP53, RB1, or CDKN2A, these tissue-specific TFs
stem-like state is often associated with irreversible silencing of do not in general represent hotspots of somatic mutations and
tissue-specific transcription factors (TF) that are important for genomic deletions in cancer or normal tissue (13–18), with most

1
Department of Etiology and Carcinogenesis, National Cancer Center/Cancer T. Liu, X. Zhao, Y. Lin, Q. Luo, and S. Zhang contributed equally to this article.
Hospital, Chinese Academy of Medical Sciences and Peking Union Medical Corresponding Authors: Chen Wu, Department of Etiology and Carcinogen-
College, Beijing, China. 2Biomedical Pioneering Innovation Center (BIOPIC), esis, National Cancer Center/Cancer Hospital, Chinese Academy of Medical
School of Life Sciences, Peking University (PKU), Beijing, China. 3Beijing Sciences and Peking Union Medical College, Beijing 100021, China. Phone: 8601-
Advanced Innovation Center for Genomics (ICG), Peking University, Beijing, 0877-87395; E-mail: chenwu@cicams.ac.cn; Andrew E. Teschendorff, CAS Key
China. 4CAS Key Laboratory of Computational Biology, Shanghai Institute of Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health,
Nutrition and Health, University of Chinese Academy of Sciences, Chinese University of Chinese Academy of Sciences, Chinese Academy of Sciences,
Academy of Sciences, Shanghai, China. 5School of Life Sciences, Tsinghua- Shanghai 200031, China. Phone: 8618-3170-47442; E-mail: andrew@picb.ac.cn;
Peking Center for Life Sciences, Tsinghua University, Beijing, China. 6Depart- Dongxin Lin, Department of Etiology and Carcinogenesis, National Cancer
ment of Health Toxicology, Key Laboratory for Environment and Health, Center/Cancer Hospital, Chinese Academy of Medical Sciences and Peking
School of Public Health, Tongji Medical College, Huazhong University of Union Medical College, Beijing 100021, China. Phone: 8601-0877-88491;
Sciences and Technology, Wuhan, Hubei, China. 7Collaborative Innovation E-mail: lindx@cicams.ac.cn; and Jiang Chang, Department of Health
Center for Cancer Personalized Medicine, Nanjing Medical University, Toxicology, Key Laboratory for Environment and Health, School of Public
Nanjing, China. 8UCL Cancer Institute, University College London, London, Health, Tongji Medical College, Huazhong University of Sciences and
United Kingdom. 9CAMS Oxford Institute (COI), Chinese Academy of Medical Technology, Wuhan 430030, Hubei, China. Phone: 8618-6940-68151; E-mail:
Sciences, Beijing, China. 10CAMS key Laboratory of Cancer Genomic Biology, changjiang815@hust.edu.cn
Chinese Academy of Medical Sciences and Peking Union Medical College,
Beijing, China. Cancer Res 2022;82:2520–37

Note: Supplementary data for this article are available at Cancer Research doi: 10.1158/0008-5472.CAN-22-0668
Online (http://cancerres.aacrjournals.org/). 2022 American Association for Cancer Research

AACRJournals.org | 2520
Preneoplastic Cells of High Stemness and Cancer Risk

evidence pointing toward an epigenetic silencing mechanism low-grade intraepithelial neoplasias (LGIN), 9 high-grade intraepithe-
(11, 19, 20). However, the precise role and timing of these putative lial neoplasias (HGIN), and 14 invasive cancers (ICA). Medical records
silencing/inactivation events in carcinogenesis remains unclear, and were reviewed to collect clinical data from each patient, including age,
has not yet been explored at single-cell resolution. gender, smoking, and drinking behavior.
With single-cell technology (21), it is in principle now possible to
explore the heterogeneity of differentiation states within a cell pop- Sample handling and tissue processing
ulation, including preneoplastic and cancer cells, a critically important Tissue samples were placed in RPMI1640 medium (Corning,
task that could help identify the least differentiated and more stem-like catalog no. 10–040-CV) with 20% FBS (Cell Signaling Technologies,
cells that are believed to underpin cancer risk and drive cancer catalog no. 30070.03) immediately after surgical resection. Tissue
progression, thus paving the way for novel cancer risk and early was processed for scRNA-seq following previously described proto-
detection strategies. Inferring the differentiation state of individual col (26, 29) with a portion being cryosectioned and hematoxylin and
preneoplastic or cancer cells from single-cell omic data is, however, eosin stained to confirm the histologic staging. Briefly, tissues were
challenging since traditional differentiation markers may no longer be rinsed with cold 10% FBS PBS, cut into small pieces on ice, and digested
valid (22). While a number of computational methods for measuring in RPMI1640 medium containing 2 mg/mL collagenase IV (Gibco,
stemness and differentiation state from single-cell RNA sequencing catalog no. 17104–019) and 0.5 mg/mL hyaluronidase (Sigma Aldrich,
(scRNA-seq) data have been proposed (23–25), each of these methods catalog no. 7326–33–3) for 1 hour at 37 C. The digested cell suspen-
is based on a measure of global transcriptional entropy that does not sion was subsequently filtered through a 70-mm cell strainer (Falcon,
catalog no. 352350) before centrifuging at 560  g for 6 minutes at 4 C.

Downloaded from http://aacrjournals.org/cancerres/article-pdf/82/14/2520/3180238/2520.pdf by guest on 25 July 2024


directly model the differentiation state in terms of the activity of tissue-
specific TFs. Because tissue-specific TFs are the key players controlling Cells were treated with 2 mL of 1 red blood cell lysis buffer (BD
the differentiation state of a cell, it seems natural to develop compu- Biosciences, catalog no. 555899) for 5 minutes following centrifuging
tational methods that can estimate this state from the differentiation of the same parameter. The remaining cells were suspended in 50 mL of
activity patterns of these TFs. 1% FBS PBS after washing once with the same medium. Single-cell
Here we present a novel single-cell algorithm called CancerStemID, suspension was stained with 40 ,6-diamidino-2-phenylindole (DAPI,
to explore the hypothesis that preneoplastic cells undergoing positive Solarbio, catalog no. C0065) prior to FACS on a BD FACSAria II flow
selection during cancer progression may be identifiable by measur- cytometer (BD Biosciences) to remove dead cells and debris.
ing the differentiation activity of tissue-specific TFs. Specifically, we
posited that the number of tissue-specific TFs displaying low dif- Single-cell RNA sequencing
ferentiation activity in a given preneoplastic cell may be a marker of The number and viability cells were examined using cell pellet by
stemness and cancer risk, reflecting not only the progenitor cell-of- staining Trypan blue (20 mL mix of 10 mL suspension and 10 mL 0.4%
origin, but potentially also an epithelial reprogramming relative to Trypan solution, Thermo Fisher Scientific, catalog no. 15250061),
the cell-of-origin. We extensively validate CancerStemID in the following centrifuging at 500  g for 5 minutes at 4 C immediately
context of esophageal squamous cell carcinoma (ESCC). This cancer after FACS. We targeted for approximately 7,000 cells recovered
is the sixth leading cause of cancer-related deaths worldwide and from each channel. scRNA-seq libraries were prepared using
represents a canonical paradigm for stepwise oncogenesis, with well- Chromium Single Cell 50 Reagent Kits (V1, 10× Genomics, catalog
identifiable precancerous lesions that include low and high-grade no. PN-1000006, PN-1000020) and sequencing was accomplished
intraepithelial neoplasia, collectively known as squamous dysplasia with an Illumina NovaSeq 6000 (Illumina, Inc.) with 2  150 bp
(26–28), making this an ideal model system in which to explore our paired-end mode. Raw sequencing data was processed using the cell-
hypothesis. ranger pipeline (version 2.1.0, 10× Genomics) with default parameters
and mapped to GRCh38 reference genome to generate matrices of gene
counts by cell barcodes.
Materials and Methods
Human biospecimen and clinical data Data preprocessing for cell annotation
This study was conducted in accordance with recognized Gene count matrices were analyzed with Seurat package (version
ethical guidelines. It was approved by the Institutional Review 3.1.5; ref. 30) in R (version 3.6.3, The R Foundation). The following
Boards of Cancer Hospital, Chinese Academy of Medical Sciences quality control criteria were used: nonepithelial cells had to express a
(20/069–2265). Informed written consent was obtained from each minimum of 200 genes with a mitochondrial fraction less than 10%;
patient, and clinical information was collected from medical records. epithelial cells had to express a minimum of 200 genes with a
Human biospecimens were obtained from 14 patients with ESCC mitochondrial fraction less than 20%. Suspected doublets were anno-
recruited between August and October of 2020 at the Linzhou Cancer tated using DoubletFinder package (version 2.0.3) and removed. We
Hospital and Linzhou Esophageal Cancer Hospital in Henan, China. removed ribosomal genes and retained genes that were expressed in at
ESCC tumors, dysplasia (≤2 cm to tumor margin), nontumor tissues least 0.1% of all cells. Raw unique molecular identifier (UMI) counts
(≥5 cm from tumor), and peripheral blood samples were collected were normalized using SCTransform function with default para-
during surgical resection with written consent and approval from meters. A total of 115,930 cells passed quality control and were
Institutional Review Boards of Cancer Hospital, Chinese Academy of included in downstream analysis. Batch effect was adjusted by imple-
Medical Sciences (20/069–2265). None of these patients received menting Harmony package (version 1.0; ref. 31). Dimension reduction
chemotherapy or radiotherapy before surgery. The pathologic grading was performed using principal-component analysis (PCA) and the
of squamous dysplasia and staging of ESCC were independently optimal number of principal components (PC) selected using
confirmed by three pathologists according to World Health Organi- ElbowPlot function. The same PCs are also applied in cell cluster
zation classification of Tumors of Digestive System tumors Fifth identification with modularity optimization using kNN graph
edition and American Joint Committee on Cancer 8th edition. A total algorithm as input. Cell clusters were visualized using UMAP algo-
of 47 samples were collected, including 8 normal, 10 inflammatory, 6 rithm (32). And with resolution of 0.3, we obtained nine distinct cell

AACRJournals.org Cancer Res; 82(14) July 15, 2022 2521


Liu et al.

clusters. These clusters were annotated on the basis of the expression of depths with an Illumina NovaSeq 6000 (Illumina, Inc.). Tissue spots
known markers. For epithelial cells, the marker genes included were visually inspected and annotated by aligning the scanned histo-
EPCAM, SFN, KRT5, and KRT14, resulting in 5,070 epithelial cells, logic images using Loupe Browser (version 4.1.0). Raw ST sequences
including 215 from tissue of normal or inflammatory esophageal were mapped to hg38 genome using Spaceranger (version 1.0.0), and
epithelium, 44 from LGIN, 1,456 from HGIN, and 3,355 from ICA reached an average of 202,743 reads per tissue covered spot (mean
(657, 1,540, 1,137, and 21 for stage I, II, III, and IV, respectively). The reads of 231,137, 186,996, 190,097 for LZE7, LZE8, and LZE22 tissue
mean number of genes detected in each epithelial cell was 2,023 and the blocks, respectively). The ST sequencing data encompassed a total of
average UMI count per cell was 8,202. The mean mitochondrial gene 8,679 spots with an average of 3,322 genes detected. A total of 4,208
content was 4.3% of all UMI counts. epithelium/carcinoma spots (Epi spots) were manually selected with
Loupe Browser, including 477 NOR, 945 INF, 243 LGIN, 527 HGIN,
Processing of epithelial cells from 14 ESCC patients (Cohort 1) and 2016 ICA Epi spots. Specifically, NOR/INF basal Epi spots were
The 5,070 epithelial cells were then rerun through a Seurat recognized as located in basal layers of epithelium or near papillae
analysis at a higher level of stringency, where we only retained based on histologic characters (n ¼ 621). Each Epi spot covered an area
cells (i) expressing at least 200 genes, (ii) expressing less than of 55 mm diameter encompassing 10–20 epithelial cells. ST data were
6,000 genes, (iii) with a DoubletScore < 1 using the doubletCells analyzed with Seurat following standard procedure with the same
function from scran R-package (33) and (iv) a mitochondrial percent- quality control, standardization, and clustering parameters, as men-
age < 5%. This resulted in 3178 cells: 95 from normal/inflammatory tioned above. Briefly, raw data were imported into R using Seurat

Downloaded from http://aacrjournals.org/cancerres/article-pdf/82/14/2520/3180238/2520.pdf by guest on 25 July 2024


state, 28 from LGIN, 1053 from HGIN and 2002 from cancer. After Load10X_Spatial function. Low quality Epi spots were removed if
log-normalization with a scale factor of 104, we selected variable number of detected genes fewer than 200 genes and mitochondrial
features using variance-stabilization. After PCA, we estimated five contents more than 10%. The mean number of genes detected in each
significant components using the ElbowPlot function. A nearest neigh- Epi spot was 4,238 and the average UMI count per cell was 16,780. The
bor graph was constructed using the top five components and clusters mean mitochondrial gene content was 2.13% of all UMI counts.
identified at a resolution parameter of 0.1, resulting in 6 epithelial After batch removal using RunHarmony and gene expression
subclusters. normalization using SCTransform, Epi spots were clustered into
nine clusters at resolution of 1.2. The top genes of NOR/INF basal
Processing of scRNA-seq data from 60 patients with ESCC clusters again confirmed the robust annotation by histology, includ-
(Cohort 2) ing canonical esophageal basal epithelial cell markers such as
Full details of sample collection, tissue dissociation, FACS and ADH7, KRT15, and ALDH3A1.
scRNA-seq processing of this cohort can be found elsewhere (34).
Briefly, for the cell annotation analysis, we removed cells with less than scRNA-seq data of epithelial cells from the multistage ESCC
500 detected genes or more than 20% mitochondrial RNA content, and mouse model
removed genes detected in less than 0.1% across all cells. Out of a total This 10× scRNA-seq dataset set was first described and presented
of 208,659 cells, 97,631 CD45− cells passed quality control. After in Yao and colleagues (26). Briefly, processed gene UMI count
clustering for major CD45 cell types and marker gene detection, matrices and cell annotations of esophageal epithelial cells were
44,730 epithelial cells were identified. This comprised 183 normal obtained from the previous publication. Among 1,760 epithelial
epithelial cells and 44,547 cells from patients with ESCC, including cells, there were 20 normal epithelial cells, (NOR; before 4NQO-
13,041, 14,241, and 17,265 from stage I, II, and III ESCC. On average, treatment), 372 of inflammatory state (INF), 383 of hyperplasia
the number of genes detected in a single epithelial cell was 3,446, and (HYP), 187 of dysplasia (DYS), 163 of carcinoma in situ (CIS), and
sequencing depth was 16,442 reads per cell. The average mitochondrial 635 from ICA. Normalization, dimension reduction, and clustering
content proportion was 5.9%. For subsequent analyses on the epithelial procedure was reproduced following the methods described in Yao
cells, these were rerun with Seurat at a higher level of stringency and colleagues (26). The mean UMI is 22,344 and the mean number
using the same parameter choices as for Cohort 1 (notably using a of genes is 3,994, with an average 3.2% of mitochondrial genes. For
mitochondrial percentage threshold of 5%), resulting in a total of the epithelial-specific analyses, we selected the epithelial cells as
20,470 epithelial cells (37 normal, 6,362 stage I, 7,000 stage II, and annotated by Yao and colleagues and reran the Seurat pipeline with
7,071 stage III). the same parameter choices as in Yao and colleagues Following
PCA, we used the ElbowPlot function to identify 7 significant PCs,
10 Visium spatial transcriptomic sequencing (Cohort 1) which was used as input for the FindNeighbors and FindClusters
Esophageal tissue of three patients from Cohort 1 were selected for function at a resolution of 0.4, which resulted in six clusters.
10 spatial transcriptomic (ST) sequencing (n ¼ 12). The tissue RunTSNE function with the top 7 PCs was then implemented for
samples derived from the same patients were embedded in OCT visualization.
sectioning media in a cryomold on dry ice at −80 C. Each ST sample
was processed into sectioning blocks with corresponding pathologic Construction of the esophageal-specific regulatory network
stages confirmed with hematoxylin and eosin staining. The tissue The procedure for constructing a tissue-specific regulatory net-
blocks were cryosectioned into 10-mm thickness and placed onto work is described in detail in our previous publications (12, 35).
6.5 mm  6.5 mm capture area of Visium Spatial slides (10× Briefly, the algorithm called SCIRA derives, for a given tissue-type,
Genomics, PN-2000233, Spatial 30 v1). The RNA quality of each a number of tissue-specific TFs and associated TF-regulons. The
sample has passed quality control with RNA integrity number > TFs are identified by overexpression analysis comparing the given
7.3, and tissue optimization experiment identified 24 minutes as tissue-type to all other tissue-types, using the large multi-tissue
optimum permeabilization time for human esophageal tissue. Spatial RNA-seq expression dataset from the Genotype-Tissue Expression
gene expression detection experiment was performed following man- (GTEX), encompassing 8,555 samples from 29 tissue types. To avoid
ufacturer’s instructions. Three slides were sequenced at recommended confounding by immune-and-endothelial cell infiltration in these

2522 Cancer Res; 82(14) July 15, 2022 CANCER RESEARCH


Preneoplastic Cells of High Stemness and Cancer Risk

bulk-tissue samples, the overexpression analysis is carried out again Power calculation
by comparing the given tissue to blood and separately again to The calculation of SCIRA’s sensitivity (SE) to detect highly
blood vessels, which we found to be a very effective procedure (35). expressed cell type–specific TFs in a given tissue type from the
Tissue-specific TFs are then defined as those overexpressed in the bulk-tissue GTEX dataset is described in detail in ref. 35. Briefly, the
given tissue (in our case esophagus, n ¼ 686 samples) compared main parameters affecting the power estimate include the relative
with all other tissue types (n ¼ 7,869) as well as when compared sample sizes of the two groups being compared (n1 and n2), the average
with blood (n ¼ 511) and blood vessels (n ¼ 689). Independently expression effect size e (in effect the average expression fold-change) of
from this, SCIRA also applies a greedy 2-step partial correlation the cell type–specific TFs compared with all other cell types, which will
framework to the same GTEX dataset to infer regulons for these depend on the proportion of the cell type (w) within the tissue
TFs. To generate the full esophageal network, we ran the following of interest. Assuming that a given TF is more highly expressed in a
commands: cell type that makes up only a proportion w of the cells in the tissue of
interest, then e ¼ log2[FC  w þ 1 (1  w)]/s where FC is the average
reg.o sciraInfReg(data.m, tfEID.v, sdth ¼ 0.25, sigth ¼ 1e-6, fold change and s is a pooled SD. To estimate the average expression
spTH ¼ 0.01, pcorth ¼ 0.2, minNtgts ¼ 10, ncores ¼ 4) fold-change FC for top DEGs between single-cell types in a tissue, we
net.o sciraSelReg(reg.o, tissue.v, toi ¼ c("Esophagus"), cft ¼ analyzed expression data from purified FACS sorted luminal and basal
c("Blood","Blood Vessel"), degth.v ¼ rep(0.05,3), lfcth.v ¼ c(log2(1.5), cells from the mammary epithelium (40), as described in detail in
log2(1),log2(1))); ref. 35. Because FACS-sorted cell populations are still heterogeneous,

Downloaded from http://aacrjournals.org/cancerres/article-pdf/82/14/2520/3180238/2520.pdf by guest on 25 July 2024


we thus expect the resulting fold change estimates to be conservative.
Meaning of parameters and parameter choices are described in
Using limma (41), we estimated FC to be 8 for the highest ranked
the scira R-package (https://github.com/aet21/scira; refs. 12, 35).
DEGs, and approximately 6 for the top 200–300 DEGs. We note that
Briefly, the function sciraInfReg generates the full set of regulons for
these estimates are for a scaled basis where s ¼ 1. Sensitivity was
all human TFs. The function sciraSelReg identifies the tissue-specific
computed using the OCplus R-package.
TFs (in our case esophageal), as described above, and then extracts
the regulons for these esophageal-specific TFs, resulting in the The CancerStemID framework
esophageal regulatory network. Calculation of the transcription factor inactivation load
The main hypothesis underlying CancerStemID is that the number
Definition of differentiation activity (TFA) and validation of the
of tissue-specific TFs displaying low differentiation activity in a given
esophageal-specific regulatory network
cell is a marker of stemness and cancer risk. Given a scRNA-seq dataset
The differentiation activity of a given TF (the TFA value) is
encompassing cells from different stages in cancer development, which
obtained by linear regression of a sample’s expression profile (be it
must include normal, preneoplastic (e.g., hyperplasia, dysplasia) and
bulk RNA-seq or scRNA-seq) against the binding regulon profile of
cancer cells, we first estimate differentiation activity (TFA values, see
the TF, where positive and negative targets are encoded as þ1 and
above) for all the tissue-specific TFs using the SCIRA algorithm (35).
−1, respectively, and with all non-targets set to 0. Specifically, we
We then identify those TFs that exhibit a significant decrease in
define the TFA as the estimated t-statistic of this regression. For a
differentiation activity between the normal and preneoplastic cells.
given data matrix of samples, the pseudocode is:
For each of the preneoplastic cells, we also derive a binary profile over
the TFs that are significantly inactivated by comparing their TFA value
tfa sciraEstRegAct(data, norm ¼ c("z"),regnet.m ¼ net.o$netTOI);
to the TFA values in the normal cells. Specifically, we estimate the
The esophageal-specific regulatory network was validated in mean and SD of the TFA values over the normal cells and then
two independent multi bulk tissue expression datasets: one is an compute the z-score and associated P value for the TFA value of the
RNA-seq dataset from the ProteinAtlas project (36) and the other given preneoplastic cell as compared with the Gaussian defined by the
is an Affymetrix microarray set from Roth and colleagues (37). above mean and SD TFA values with negative z-scores and a P value <
Specifically, we used the TF regulons to estimate differentiation 0.05 are declared to be “hits,” resulting in a binary TF inactivation
activity of the 43 esophageal TFs in each of these two datasets, matrix defined over TFs and preneoplastic cells. The transcription
comparing the activity estimates for esophageal tissue against all factor inactivation load (TFIL) is then defined for each preneoplastic
other tissue-types. In addition, we also downloaded chromatin cell by the number (or fraction) of hits. Of note, the number of TFs used
immunoprecipitation (ChIP-seq) profiles from the ChIP-seq atlas in the TFIL computation is thus determined by the data. Importantly,
(http://chip-atlas.org; ref. 38) and checked if the binding intensity of we would not advise computing any TFIL if there is no statistical
the predicted regulon genes were higher than for non-regulon genes evidence that most tissue-specific TFs display lower TFA in preneo-
using a Wilcoxon rank sum test. This analysis was performed for plastic and cancer cells compared with normal. Indeed, by definition, a
TFs with available ChIP-seq data (EHF, ELF3, ELK3, FOXA1, significant number of TFs promoting differentiation should display
FOXQ1, GRHL2, HDAC1, KLF3, KLF5, MYC, RCOR1, RREB1, lower TFA in preneoplastic cells representing a condition such as
SOX2, TEAD1, TFAP2A, TFAP2C, TP63, ZNF219). Because of dysplasia, and a skew toward lower TFA can be assessed using a
absence or low numbers of ChIP-seq data from normal esophagus, binomial test. If the numbers of TFs displaying lower TFA in pre-
binding intensities were averaged over all available ChIP-seq sam- neoplastic cells is not significantly large, then any P value from the
ples, excluding embryonic samples, hESCs, and predictions from binomial test would be nonsignificant and the TFIL should not be
the STRING database. A third validation was performed in the computed.
scRNA-seq (10×) human esophagus dataset from ref. 39. Here, we
estimated regulatory activity for all 43 esophageal TFs in over Calculation of the cancer risk score
50,000 cells encompassing 19 cell types. We compared the TFA Given the TFA-matrix, we apply diffusion maps (42, 43) to this
values in the esophageal epithelial cells to the surrounding stromal matrix to infer the diffusion components (DC) and the Markov Chain
cells using Wilcoxon-rank sum tests. transition matrix (nearest neighbor graph). We note that since the

AACRJournals.org Cancer Res; 82(14) July 15, 2022 2523


Liu et al.

TFA-matrix is defined over a relatively small number of features otherwise, and with Aii ¼ 0). CCAT is a much faster and scalable proxy
(the tissue-specific TFs), that no dimensional reduction is necessary of differentiation potency than SCENT. The reason why CCAT
prior to application of diffusion maps. The aim of this diffusion map measures potency is that a cell of higher stemness tends to overexpress
analysis is to ascertain the existence of a bifurcation, with one branch network hubs, with many of these network hubs encoding ribosomal
defining invasion/cancer and the other representing a non-cancer fate proteins (23), a result we have validated across over 2 million cells and
(e.g., differentiation). To estimate pseudotime, we use the following 28 scRNA-seq studies (45). The association between ribosomal gene
procedure to obtain a root-cell, from which the two tip points (cancer expression and differentiation potency has been observed across
vs. noncancer) are then identified. From the Markov transition matrix different species and is independent of cell proliferation (46, 47). It
M defined over all cells, we define the submatrix M ~ by only selecting is important to observe that the three single-cell measures we compute
cells in the normal þ inflammatory state. This submatrix defines a within the CancerStemID framework, that is, the stemness index
weighted subgraph, which is not necessarily connected. To identify CCAT, the TFIL, and the cancer risk score, are all independent from
the main modules within this subgraph we use the walk-trap each other, and that any associations between them are nontrivial.
community detection algorithm (44), to subsequently select the
largest community. This defines the root-state and the root-cell is Calculation of cell-cycle scores
obtained as the cell that minimizes the median absolute deviation in To identify single cells in either the G1–S or G2–M phases of the
diffusion component space. cell-cycle we followed the procedure described in Tirosh and col-
To estimate the cancer risk score, we compute the Pearson corre- leagues (48). Briefly, we used genes whose expression is reflective of
G1–S or G2–M phase. A given normalized scRNA-seq data matrix is

Downloaded from http://aacrjournals.org/cancerres/article-pdf/82/14/2520/3180238/2520.pdf by guest on 25 July 2024


lation coefficient (PCC) matrix between each preneoplastic cell
(hyperplasia, dysplasia) and each cancer cell, where the PCCs are then z-score normalized for all genes present in these signatures.
calculated using the TFA matrix defined for the TFs that exhibit Finally, a cycling score for each phase and each cell is obtained as the
significant downregulation between the normal cells and the preneo- average z-score over all genes present in each signature.
plastic ones. Subsequently, the PCCs are averaged over the cancer cells,
to arrive at the cancer risk score per cell. We note that the cancer risk Relation between stemness, TFIL, and cell-cycle scores
score and the TFIL are independent measures, because the TFIL for As shown by us previously (23), the association between stemness
each preneoplastic cell is estimated by comparison to the normal/ (as measured with SCENT or CCAT) and cell proliferation is non-
inflammatory state, whereas the cancer risk score reflects the similarity linear: proliferating cells generally have high stemness scores, but
to the cancer cells. Thus, a positive association between TFIL and the noncycling cells can also attain high stemness values. Thus, prolifer-
cancer risk score is nontrivial and would indicate that preneoplastic ation is a confounder that needs to be adjusted for. In this work, we
cells with a higher TFIL are more similar in regulatory activity phase assess the associations between stemness, TFIL and cancer risk by
space to cancer cells. An alternative method to estimate the cancer risk including the two cell-cycle scores as covariates in the linear regres-
score is to compute the PCCs between the preneoplastic cells and the sions. In addition, we identify noncycling cells as those with an average
cancer and cancer-free tip points identified via the diffusion map cell-cycle score < 0, and recompute linear regressions between the
analysis above. Both methods for estimating the cancer risk score yield single-cell measures of interest using only such noncycling cells.
similar results on the datasets considered here.
Analysis of bulk-tissue mRNA expression from normal and ESCC
Estimation of stemness samples
From the scRNA-seq data matrix and for each cell independently One dataset GSE23400 (paired ESCC and normal adjacent samples,
we estimate a stemness/differentiation potency score using the n ¼ 53) is derived from The Gene Expression Omnibus (GEO; https://
Correlation of Connectome and Transcriptome (CCAT) measure (45). www.ncbi.nlm.nih.gov/geo/query/acc.cgi; refs. 49, 50). The other data-
Briefly, CCAT is defined by the PCC between a single cell’s genome- set is an in-house database of gene expression consisting of 121 ESCC
wide RNA-seq profile ~ x (normalized and log-transformed) and the normal adjacent pairs and an independent set of 159 ESCC tumor
connectivity (i.e., degree or number of neighbors) profile, ~
k, of the samples, i.e., a total of 121 normal samples and 280 ESCC tumor
corresponding proteins as determined by a highly curated protein- samples (34, 51). In all cases, differential expression was performed
protein interaction (PPI) network from Pathway Commons: using Wilcox rank sum tests.
 
CCAT ¼ PCC ~ x; ~
k Whole-genome bisulfite sequencing of Cohort 2 ESCC patients
We performed whole-genome bisulfite sequencing (WGBS) for
CCAT is derived from our Diffusion/Signalling Entropy Rate (SR) ESCC and paired normal tissues derived from 26 patients in Cohort
measure, also called SCENT (23), which is given by the formula 2. Fresh frozen sample regions of ESCC and normal esophageal
1 X X
n epithelium were collected with laser capture microdissection using
SRð~
x; pÞ ¼  pi pij log pij ; Leica model LMD7000 Laser Microdissection Microscope (Leica
maxSR i¼1 j2N ðiÞ
Microsystems) after crystal violet (Sigma-Aldrich, catalog no. 3886)
staining and pathologic reviewing. WGBS libraries were prepared
where pij are the entries of a stochastic matrix, and p is the invariant following NEBNext Enzymatic Methyl-seq Kit direction (New
measure, satisfying pP ¼ p and the normalization constraint pT1 ¼ 1. England Biolabs, catalog no. E7120S/L). The average depth of
The stochastic matrix is given by the formula the sequencing libraries was approximately 26X. WGBS data
xj xj were mapped to hg38 genome and methylation calling was per-
pij ¼ P ¼ ;
k2N ðiÞ xk ðAxÞi formed using Bismark software (version 0.19.0; ref. 52). Dupli-
cation was removed by applying Picard tools (version 2.4.1; http://
where N(i) denotes the neighbors of protein i, and where A is the broadinstitute.github.io/picard/). DNA methylation levels of CpGs
adjacency matrix of the PPI network (Aij ¼ 1 if i and j are connected, 0 within 500-bp upstream of the transcription starting sites of the

2524 Cancer Res; 82(14) July 15, 2022 CANCER RESEARCH


Preneoplastic Cells of High Stemness and Cancer Risk

esophageal-specific TFs were extracted for analysis. The overall study profiling malignant and nonmalignant colon epithelial cells from
comparison of promoter methylation was performed with paired 11 patients. We processed these data as described previously (35).
Student t test using the averaged DNA methylation (DNAm) levels Briefly, we downloaded the normal mucosa and tumor epithelial cell
across all promoter CpGs. For CpG-specific differential methylation FPKM files from GEO under accession number GSE81861. In total,
analysis, we used the Wald test as implemented in the dss R-package there were 160 and 272 normal and tumor epithelial cells.
(version 2.38.0; ref. 53). Differentially methylated CpGs between
ESCC and paired normal tissue (n ¼ 26 pairs) were defined by Data availability
requiring a significant Wald test P < 0.01 and a difference in average The raw sequencing data of our human Cohort 1 scRNA-seq
DNAm (delta) of at least 0.1. To assess statistical significance over data is available from the Genome Sequence Archive of Beijing
the whole promoter, we used a paired t test comparing the mean Institute of Genomics, Chinese Academy of Sciences (https://ngdc.
DNAm value over all DMLs in the promoter between the 26 ESCCs cncb.ac.cn/gsa/) with accession number HRA000776 (GSA-Human
and 26 matched normals. Of note, the latter test requires directional subAccession number). The raw sequencing data of the human
DNAm changes to be more consistent to attain statistical signifi- Cohort 2 scRNA-seq data is available from GSA (https://bigd.big.ac.
cance and is therefore more stringent. cn/gsa) under accessing number HRA000195. The gene-by-cell
count matrix of Cohorts-1 and 2 are available from GEO under
Analysis of genomic alterations in Cohort 2 ESCC patients accession numbers GSE199654 and GSE160269. Gene expression
Somatic mutation and copy-number variation (CNV) profiles of matrix of ESCC and paired adjacent normal samples is available

Downloaded from http://aacrjournals.org/cancerres/article-pdf/82/14/2520/3180238/2520.pdf by guest on 25 July 2024


the 43 TFs were obtained from Zhang and colleagues (34). Briefly, from Gene Expression Omnibus (https://www.ncbi.nlm.nih.gov/
genomic DNA from blood, adjacent normal tissue and tumor geo/) with accession number GSE160269. The raw sequencing
samples was extracted using the QIAamp DNA mini Kit (Qiagen). data and processed gene expression matrix of the mouse model
The sequencing libraries for WGS were constructed using scRNA-seq data have been deposited in GSA under the accession
Tn5 transposase and sequenced on HiSeq XTen (Illumina) with number CRA002118. The GTEX bulk RNA-seq dataset (TPMs) was
2 150 bp paired-end mode. WES libraries were constructed using downloaded from https://commonfund.nih.gov/GTEx/data. The 10×
NEBNext Ultra DNA Prep Kit for Illumina 760 (New England scRNA-seq normal cancer datasets in lung and colon were obtained
Biolabs), followed by exome enrichment using SureSelect Human from either GEO or ArrayExpress (www.ebi.ac.uk) with following
All Exon V6 (Agilent Technologies). The WES libraries were accession numbers: GSE131907, E-MTAB-6149, GSE132465,
sequenced on NovaSeq 6000 (Illumina) with 2150 bp paired- GSE144735. All other data supporting the findings of this study
end mode. The mean sequencing depth for WES samples was about are available within Supplementary Information files and from the
150X (for tumor tissues) while the depth was about 1X for WGS corresponding author upon reasonable request.
samples. After WES quality control (34), somatic mutations were
called with mutect2 workflow of GATK and annotated by Annovar Code availability
software. CNV analysis was performed following baseqCNV pipe- An R-package CancerStemID with a vignette illustrating the code
line and significant CNVs at gene level were detected by GISTIC functionality on the mouse ESCC 10× dataset, and an executable
2.0 algorithm, as described in Zhang and colleagues (34). R-markdown file showcasing additional analyses on the human ESCC-
cohort-1 10× scRNA-seq and human 10× Visium datasets are freely
Analysis of lung and colon scRNA-seq datasets available from https://figshare.com/projects/CancerStemID_/112371.
We obtained 4 scRNA-seq 10× Chromium datasets profiling On the same figshare site, we also provide R-scripts for reproducing
sufficient numbers of normal epithelial and cancer epithelial cells, source data and results on the human and mouse ESCC datasets
two from lung tissue (54, 55), and the other two from colon (56). analyzed here. The SCIRA R-package for estimating TF differentiation
One of the lung-tissue sets derived from lung adenocarcinoma activity is available from https://github.com/aet21/scira. The SCENT
(LUAD) patients (LUAD1) and processed annotated count data R-package for estimating stemness is available from https://github.
was download from GEO (GSE131907; ref. 54). This set contained com/aet21/SCENT.
3,703 normal lung epithelial and 32,764 lung cancer epithelial cells.
We followed the same Seurat pipeline as for our esophageal sets, Ethics approval and consent to participate
which resulted in 3,614 normal cells (521 alveolar type-1, 2009 This study was approved by the Institutional Review Boards
alveolar type-2, 650 ciliated, and 434 club cells), 6,255 lung tumor of Cancer Hospital, Chinese Academy of Medical Sciences (20/
cells, and 2,896 metastatic cells from adjacent lymph nodes. The 069–2265). Informed consent was obtained from each patient, and
other lung tissue set (LUAD2) derived from both LUAD and LSCC clinical information was collected from medical records.
patients and .Rds files containing the processed data were down-
loaded from ArrayExpress (E-MTAB-6149). After quality control, a
total of 52,698 single cells remained, of which, 1,709 were annotated Results
as alveolar, 5,603 as B cells, 1,592 as endothelial cells, 1,465 as The CancerStemID framework: rationale
fibroblasts, 9,756 as myeloid cells, 24,911 as T-cells, and 7,450 as CancerStemID is based on the hypothesis that the differentiation
tumor epithelial cells. The two colon 10× sets derive from the same state of a cell can be inferred by estimating the regulatory activity of
study (56), and processed annotated count data were downloaded the TFs that control differentiation within that cell lineage. This is a
from GEO (GSE132465, GSE144735). The first colon set (COAD1) reasonable assumption since differentiation into a specific cell-
contained 1,070 normal epithelial and 17,469 cancer cells, whereas lineage is characterized by overactivation of lineage-specific TFs,
the second one (COAD2) comprised 1,144 normal epithelial and with these same TFs generally displaying low basal levels of
5,024 cancer cells. Count data were processed with the Seurat differentiation activity in the corresponding stem and progenitor
pipeline. In addition to these two 10× colon sets, we also analyzed cells (Fig. 1A; ref. 58). It follows that preneoplastic cells in which
a scRNA-seq Fluidigm C1 dataset from Li and colleagues (57), a lineage-specific TFs exhibit low differentiation activity may exhibit

AACRJournals.org Cancer Res; 82(14) July 15, 2022 2525


Liu et al.

A Normal development B Cancer development: Hypothesis


Low differentiation activity Cell at low cancer-risk
Adult Preneoplastic
TFA TFA
stem cell cell population

Moderate differentiation activity Cell at high cancer-risk


Progenitor
TFA TFA
(uni or multipotent) Selection
e ti

High differentiation activity Cancer cell

Differentiated TFA TFA


cell Cancer
cell population Tissue-specific TFs
Tissue-specific TFs

C The CancerStemID framework


Analyze scRNA-seq from Estimate differentiation activity TFA-derived

Downloaded from http://aacrjournals.org/cancerres/article-pdf/82/14/2520/3180238/2520.pdf by guest on 25 July 2024


different stages in carcinogenesis of tissue-specific TFs [TFA] diffusion map
Disease stage Compute transcription factor
Normal
N H D C inactivation load (TFIL) of each cell
epithelial cells (N)

TFs
displaying
Hyperplasia reduced
Dysplasia TFs differentiation
Carcinoma in situ activity Cells at low Cells at high
cancer risk cancer risk

TFA
Invasive High
cancer (C) Noncancer Cancer High TFIL/stemness
Low fates fate Low TFIL/stemness
Transcription factor
differentiation activity
(TFA) map

Figure 1.
Rationale and the CancerStemID algorithm. A, Focusing on normal development and differentiation, tissue-specific TFs exhibit increased differentiation activity
(TFA) as cells differentiate from adult stem cells to multi-or-unipotent progenitors and finally to fully differentiated cells, as shown. B, Given a population of
preneoplastic cells, these cells exhibit heterogeneity in terms of their TFA profiles. The underlying hypothesis is that those preneoplastic cells with a TFA profile more
similar to that of the adult or progenitor states of the tissue are more likely to be selected for during cancer progression, in line with the Cancer Stem Cell hypothesis.
C, CancerStemID is a computational framework applicable to scRNA-seq data generated from different stages in cancer progression, aimed at identifying the
preneoplastic cells that are under positive selection, i.e., at highest risk of cancer progression. The CancerStemID algorithm first estimates transcription factor
differentiation activity (TFA) for tissue-specific TFs across all single cells in order to identify the TFs that exhibit reduced differentiation activity during cancer
progression. For each cell, we also independently estimate a (i) differentiation potency (dedifferentiation) score using the CCAT/SCENT algorithm, (ii) a TFIL
representing the number of tissue-specific TFs that are inactivated in a given cell, and (iii) a cancer progression (or cancer risk) score. The cancer progression score is
derived by applying diffusion maps to the TFA matrix, so as to infer lineage trajectories that map to cancer and noncancer fates, estimating for each preneoplastic cell
a relative probability of diffusing to the cancer fate, thus defining a cancer progression score. The main hypothesis is that a preneoplastic cell with a higher TFIL is
associated with an increased stemness and cancer progression score.

a higher stemness and cancer risk, reflecting the cell-of-origin that prevents reliable inference of TF regulatory activity from measured
undergoes positive selection during cancer progression (Fig. 1B). TF expression levels. In the second step, we quantify the overall
The CancerStemID framework thus involves two steps: (i) identi- degree of differentiation activity of a cell by direct comparison of the
fication of the key TFs and inference of their differentiation activity inferred TFA values relative to an appropriate normal state. In
(TFA) in single-cells, and (ii) quantification of the overall level of effect, the number of tissue-specific TFs displaying low differenti-
dedifferentiation, which we posit identifies cellular states that ation activity relative to this normal state, a quantity we call TFIL, is
progress to the invasive cancer stage (Fig. 1C). To identify the a direct proxy of the dedifferentiation state of the cell (Fig. 1C).
tissue-specific TFs and to estimate their TFA values we use the
SCIRA algorithm (35), a machine-learning method that infers TFs Construction and validation of an esophageal-specific
and associated regulons from a large and appropriately powered regulatory network
multi-tissue gene expression dataset while adjusting for cell type To test CancerStemID in ESCC, we first aimed to identify esoph-
heterogeneity. Differentiation activity of TFs in single cells is then ageal-specific TFs and their regulons. To this end, we applied
derived using the regulon set of each TF. As shown by a number of SCIRA (35) to the large multi-tissue GTEX expression dataset (61),
studies (35, 59, 60), this regulon-based approach leads to improved encompassing 8,555 samples and 29 tissue types, including 686 normal
inference of differentiation activity in the context of scRNA-seq esophageal tissue specimens, while adjusting for the variation in
data, mainly due to the high dropout rate of such data, which immune-cell infiltration between samples and tissues (Materials and

2526 Cancer Res; 82(14) July 15, 2022 CANCER RESEARCH


Preneoplastic Cells of High Stemness and Cancer Risk

Methods). Our power calculation indicated more than 90% sensitivity (Fig. 2A; Supplementary Data File S1; Materials and Methods), with
to detect esophageal-epithelial specific TFs (Materials and Methods; an average of 42 regulon-genes per TF. Several of the identified TFs
Supplementary Fig. S1A). SCIRA inferred a regulatory network con- (e.g., TP63, KLF5, SOX2, FOXE1, PAX9, EHF) have established roles
sisting of 43 esophageal-specific TFs and 1,136 target/regulon genes in squamous epithelial differentiation of the esophagus (62–64).

A SCIRA on GTEx
C
ELF3 EHF
686 esophagus samples vs. 7,869 other-tissue samples
10 P < 10−100 P < 10−100
5
Esophageal-specific regulatory network 5
(43 TFs + 1,136 regulon genes) TFA 0
0
−5 −5

Epi IC Endo Fibro Epi IC Endo Fibro


n = 80,489 4,254 1,677 543 n = 80,489 4,254 1,677 543

D
TRIM29 PAX9 YBX2

Downloaded from http://aacrjournals.org/cancerres/article-pdf/82/14/2520/3180238/2520.pdf by guest on 25 July 2024


HR HES2 BNC2
OVOL1 KLF8 TEAD1
TEAD3 MYCBP
FOXE1
AFDN RCOR1
ELK3
SOX2 EHF
IRF6 ELF3 STON1
ZNF219 MYC DES
FOXA1 KLF3 TFAP2A
GRHL2 BARX2
TRIOBP FOXN1 TFA DE
TRIP10 FOXQ1
TRIM16 HDAC1
KLF5 BNC1 TFA-UP DE-UP
TP63 SOX15 TFA-DN DE-DN
ZNF185
DTX2
RARG P < 10−100
TFAP2C
RREB1 P < 10−50
P < 0.001
TFA DE TFA DE

B
2.9
Epithelial cell (basal)
Epithelial cell (suprabasal) 0.8
Epithelial cell (stratified) 0.2
TFA
Epithelial cell (upper) −0.3
Endothelial cell (lymphatic) −0.9
Endothelial cell (vascular arterial)
−3.1
Endothelial cell (vascular venous)
Gland duct
Mucous gland Epi stratified
CD27- B cell
CD27+ B cell
CD4+ T cell Epi upper
CD8+ T cell
NKT/CD8+ CTL
Dendritic cell
Monocyte/macrophage
UMAP-2

UMAP-2

Epi suprabasal
Mast cell
Fibroblast Epi basal
Muscle
UMAP-1 UMAP-1

Figure 2.
Construction and validation of the esophageal-specific regulatory network. A, We applied the SCIRA algorithm to the large multi-tissue GTEX expression dataset,
encompassing 686 esophagus and more than 7,500 samples from other tissue-types, to infer an esophageal-specific regulatory network consisting of 43
esophageal-specific TFs (black squares) and their regulon genes (red circles). The regulon associated with each TF is depicted with a distinct background color, with
the regulon genes representing direct binding and indirect downstream targets. The regulons are then used to estimate regulatory activity of the TFs (TFA) in an
independent sample (bulk or single-cell RNA-seq profile). B, Validation of the esophageal-specific TF regulons in the 10× scRNA-seq esophageal tissue dataset from
the HCA. Left, UMAP depicts the clusters representing different cell types in the human esophagus. Right, UMAP colors the cells according to the average TFA over the
43 esophageal TFs. C, Violin plots for two of the esophageal TFs (ELF3, EHF) displaying their estimated TFA levels across all cells from the human esophagus stratified
according to whether the cell is epithelial, an immune cell, a fibroblast, or an endothelial cell. P value derived from a one-tailed Wilcoxon rank sum test comparing
epithelial with the other cell types. D, Diagram displaying for each of the 43 esophageal TFs if they are inactivated/downregulated (DN) or activated/overexpressed
(UP) according to differential TFA or differential expression (DE). In the case of differential expression, P values were derived from a Wilcoxon rank sum test. In the
case of TFA values, because these do not have dropouts, we used a t test to estimate P values.

AACRJournals.org Cancer Res; 82(14) July 15, 2022 2527


Liu et al.

We validated the 43 esophageal-specific TFs and regulons in two and Methods; Fig. 3D and E). We observed good agreement between
independent multi-bulk tissue expression datasets (Supplementary the cancer versus normal differential activity patterns derived from the
Fig. S1B and S1C; refs. 36, 37), using ChIP-seq data from the two independent cohorts (Fisher one-tailed test P ¼ 0.006; Fig. 3F). Of
ChIP-seq Atlas (Supplementary Fig. S1D; ref. 38), and in 10× note, these skews toward lower differentiation activity were not
scRNA-seq data of normal esophageal tissue (50,000 cells and 19 observed at the level of TF expression, consistent with previous
cell types) generated as part of the Human Cell Atlas (HCA; demonstrations that regulons improve the sensitivity to detect differ-
Fig. 2B–D; see Materials and Methods; ref. 39). By estimating entiation activity changes as compared with TF expression (Fig. 3E;
differentiation activity of the 43 esophageal TFs in this normal ref. 35). In support of this, we note that tumor versus normal differential
esophagus HCA set, we verified that the average differentiation activity patterns derived from the scRNA-seq data were more consistent
activity (TFA) was highest in the epithelial clusters, and that 81% than differential expression, when compared with the differential
(i.e., 35) of our TFs displayed a significantly higher activity in expression patterns seen in corresponding bulk tissue RNA-seq datasets
epithelial cells (Fig. 2B–D). Within the epithelial compartment, the (Supplementary Fig. S3C and S3D; Supplementary Table S2).
average TFA correlated with differentiation state, being lowest and Of note, some of the TFs displaying reduced differentiation
highest for cells in the basal and upper epithelium layers, respec- activity (e.g., TRIM29, EHF, PAX9) have been implicated as tumor
tively (Fig. 2B; Supplementary Fig. S2A). To benchmark this suppressors in squamous cell carcinoma including ESCC (65–67).
association with differentiation state, we separately estimated Other TFs like TP63 and SOX2, which have been implicated as
potency of each cell using CCAT (45), a model of single-cell potency oncogenes in ESCC (68–72), displayed increased expression in

Downloaded from http://aacrjournals.org/cancerres/article-pdf/82/14/2520/3180238/2520.pdf by guest on 25 July 2024


rooted in the concept of diffusion network entropy that we have cancer at both single-cell and bulk RNA-seq levels, whilst simulta-
previously and very extensively validated across different cellular neously displaying reduced differentiation activity (Supplementary
lineages and species (human and mouse), encompassing over 28 Fig. S3C), suggesting that their cistromes undergo reprogramming in
scRNA-seq studies and 2 million cells (23, 45). Applying CCAT to ESCC. To confirm this, we observed that a list of 152 TP63 and SOX2
the esophageal HCA data also confirmed that basal cells exhibited targets derived from ESCC bulk RNA-seq and ChIP-seq data (see
higher potency values compared with the more differentiated cells Materials and Methods; refs. 68–72), displayed consistent upregula-
of the stratified and upper epithelium, yet unlike TFA, the mono- tion in our scRNA-seq data (Supplementary Fig. S4A and S4B).
tonic linear pattern was less evident and only appreciable when Moreover, none of these 152 targets overlapped with our TP63/SOX2
focusing on noncycling cells (Supplementary Fig. S2B), indicating regulon target genes, a clear reflection that the latter solely measure
that TFA is less confounded by cell-cycle state and thus a more TP63/SOX2’s role in esophageal differentiation. These data establish
reliable proxy of differentiation state than CCAT. that esophageal-specific TFs display reduced differentiation activity
not only in ESCC but also in a stage preceding cancer development,
Esophageal-specific TFs display reduced differentiation activity with TP63/SOX2’s cistromes reprogrammed to acquire oncogenic
in preneoplastic cells functions (68–72). Of note, one of the few TFs displaying consistent
Next, we performed scRNA-seq (10× Chromium) profiling in increased TFA in the two cohorts was MYC (Fig. 3D). To shed light
cancer and adjacent noncancer tissue specimens derived from on the potential significance of this, we observed that the 31 genes
14 patients with ESCC (“Cohort 1”), representing four different making up our MYC regulon are enriched for ribosome biogenesis
stages in cancer development including normal/inflammatory (Supplementary Table S3), which has been proposed to be a marker
(NOR), LGIN/HGIN, and ICA (see Materials and Methods; of stemness (23, 46).
Supplementary Table S1; Fig. 3A). After stringent quality control,
batch correction and processing with Seurat, we obtained over Validation in a mouse model of esophageal cancer
110,000 cells, of which, 3,178 were annotated as epithelial (Fig. 3A; development
see Materials and Methods). This included 1,176 nonmalignant To further validate our findings in ESCC, we next analyzed scRNA-
epithelial cells, allowing us to explore the dynamics of differenti- seq data of 36,114 CD45 cells collected at six well-defined stages of
ation activity change across preneoplastic stages. Dimensional ESCC development in mouse (26). In this model, ESCC is induced
reduction and graph-based clustering over the most variable genes by 4-nitroquinoline 1-oxide (4NQO), a chemical carcinogen that
revealed clusters that correlated with disease stage (Fig. 3B), but a mimics ESCC development in humans (26). To justify application of
much stronger association with stage was seen when performing our esophageal TF regulons derived from human data to mouse
PCA on the estimated differentiation activity matrix over the 43 scRNA-seq data, we first checked that the majority of the 43 TFs
esophageal-specific TFs, with PC-1 clearly discriminating normal (n ¼ 31) displayed a higher TFA in the normal epithelia compared to
inflammatory and LGIN cells from HGIN and ICA (correlation test immune and stromal cells (Supplementary Fig. S5A–S5C). For each of
P < 10–90; Fig. 3C). In line with this, we observed that 25 of our 43 these 31 TFs we estimated their TFA in each of 1,760 epithelial cells,
esophageal-specific TFs exhibited a significant decrease of activity encompassing cells from the normal inflammatory state (NOR/INF,
in HGIN and ICA cells (Fig. 3D), representing a significant skew n ¼ 392), hyperplasia (HYP, n ¼ 383), dysplasia (DYS, n ¼ 187),
towards lower differentiation activity (binomial test, P ¼ 310–5; carcinoma in situ (CIS, n ¼ 163), and ICA (n ¼ 635; Supplementary
Fig. 3E). A Monte-Carlo randomization analysis of the regulons Fig. S5D). Mapping the dynamic changes between subsequent disease
further demonstrated that this number of less active TFs could stages revealed two waves of reduced differentiation activity: one
not have arisen by random chance (see Materials and Methods; between the inflammatory and hyperplasia stages, and another
Supplementary Fig. S3A). By focusing on subsets of patients for between CIS and invasive cancer (Supplementary Fig. S5E). Over the
which both normal/inflammatory and HGIN/ICA cells were pro- whole time course, we observed a clear skew, with 71% of the 31 TFs
filed, we were also able to exclude batch effects (Supplementary exhibiting significantly lower differentiation activity during tumor
Fig. S3B). Confirming that batch effects were not driving these progression (binomial test, P ¼ 0.005; Supplementary Fig. S5F). A
patterns, results were validated in an independent 10× scRNA-seq similar but nonsignificant skew was also observed at the level of
dataset comprising 60 patients with ESCC (“Cohort 2,” see Materials TF-expression (Supplementary Fig. S5G and S5H).

2528 Cancer Res; 82(14) July 15, 2022 CANCER RESEARCH


Preneoplastic Cells of High Stemness and Cancer Risk

A Multi–region tissue collection (Cohort 1) Single-cell transcriptomic analysis Cell type annotation

ICA (n = 14) 1 1 Epithelial cell


FACS
2 Endothelial cell
5 2
HGIN (n = 9) 4 3 3 Fibroblast
Cell debris
Cell clumps Live cells 4 Pericyte
Dead cells 5 T cell
LGIN (n = 6) 6
6 Myeloid cell
INF (n = 10) 10 7 7 B cell
Chromium
scRNA-seq 8
ESCC patients 8 Plasma B cell
(N = 14) NOR (n = 8)
N = 115,930 9 9 Mast cell
Adventitia Epithelium

B N = 3,178 C PCA on TFA matrix

EC1 NOR
INF
EC2 N/INF
LGIN LGIN

PC2
EC3
HGIN HGIN
EC4 ICA
ICA_Stage I
EC5 ICA_Stage II
tSNE_2
tSNE_2
_

Downloaded from http://aacrjournals.org/cancerres/article-pdf/82/14/2520/3180238/2520.pdf by guest on 25 July 2024


EC6 ICA_Stage III N = 3,178
ICA_Stage IV P < 10−90

Density
tSNE_1 tSNE_1

PC1
D Cohort 1 (14 ESCC) Cohort 2 (60 ESCC)
N/INF LGIN HGIN
PAX9
ICA t(TFA) N ICA t(TFA)
E Cohort 1 Cohort 2
BARX2 *** **
ZNF185 *** *

30
30
FOXN1 *** * P = 3 × 10 −5 P = 3 × 10 −4
TRIM16 ***
SOX15 *** *
HES2 *** UP
***

20
20
EHF P = 0.952
#SigTF

#SigTF
DN
SOX2 *** *
TRIM29 *** ** P =0.806
FOXQ1 *** *
OVOL1 *** *

10
10

TP63 ***
TFAP2C **
**

5
5

KLF3
IRF6 **
**

0
0

ELF3
KLF8 ** ** TFA DE TFA DE

RARG * *
DTX2 *
HR *
BNC1 *
TRIOBP * F
BNC2 * *
ELK3 * **
* 0 4
6

RCOR1 *
*
t(TFA:ICA - N) [Cohort 2]

MYCBP
*
4

STON1
DES **
*
2

TEAD1
TEAD3 **
GRHL2 * P = 0.006
0

AFDN *
ZNF219
−2

FOXE1
HDAC1 *
−4

TRIP10
YBX2
−6

RREB1 13 1
FOXA1 * *
MYC * **
KLF5 * * −30 −20 −10 0 10 20 30
TFAP2A *** *
n = 95 28 1,053 2,022
*** n = 37 20,433
* t(TFA:ICA - N) [Cohort 1]

TFA
Low High * P < 0.05
** P < 10
−5

t(TFA) *** P < 10−10


<0 >0

Figure 3.
Reduced differentiation activity of esophageal-specific TFs precedes cancer development. A, scRNA-seq profiling on tumor and normal adjacent tissue from
14 patients with ESCC. UMAP diagram depicts 115,930 cells with clusters annotated to different cell types. B, The first tSNE-plot displays six different epithelial
subclusters. The second tSNE plot colors cells by disease stage. C, PCA scatterplot (PC1 vs. PC2), as derived by applying PCA to the transcription factor
regulatory activity (TFA) matrix of 43 esophageal TFs and a total of 3,178 epithelial cells. Cells are colored by disease stage. Density plot beneath
PC1 axis depicts the distribution of cells of each disease stage according to PC-1 weight. P value is from a Pearson correlation test between PC1 and disease
stage (1 ¼ N/INF, 2 ¼ LGIN, 3 ¼ HGIN, 4 ¼ ICA). D, Heatmaps of TFA for the 43 esophageal TFs across the four main disease stages in Cohorts 1 and 2 as shown.
For each disease stage, the TFA over all cells in that stage were averaged. Color bar labeled “t(TFA)” displays the t statistic of a linear regression between TFA
and disease stage (encoded as an ordinal variable, 1 ¼ N/INF, 2 ¼ LGIN, 3 ¼ HGIN, 4 ¼ ICA), and P values shown derive from this t test. In the case of Cohort 2,
there were only two disease stages (1 ¼ N, 2 ¼ ICA). E, Barplots displaying the number of significantly inactivated/downregulated (DN) and activated/
overexpressed (UP) TFs in Cohorts 1 and 2 according to differential TFA or differential expression. F, Scatterplot of the t statistics of differential TFA between
ICA and N for Cohort 1 versus Cohort 2. P value is from a linear regression. The number of TFs significantly inactivated in both Cohorts 1 and 2 is displayed in blue.

AACRJournals.org Cancer Res; 82(14) July 15, 2022 2529


Liu et al.

A NOR/INF HGIN ICA Figure 4.


# Normal Basal Spots = 141 # HGIN Spots = 313 # ICA Spots = 613 LZE22 Spatial transcriptomic analysis reveals
Tissue region
annotation reduced TFA relative to normal basal
cells. A, Images showing histology
Epi-stromal (top) with annotated ST spots (bot-
interface
Differentiated/
tom) mapped to corresponding epi-
Basal cell thelial tissue types derived from LZE22
boundary
patient. The number of spots in each
TFA t(TFA) category is indicated. Epithelial region
2 >3 (separated from stromal region with
yellow solid lines) and basal region
−1 <-3 (area between yellow dashed and solid
lines) were annotated after pathologic
review. Average TFA of each ST spot is
B *** C displayed in color scale in relative
*** measures. B, A violin plot showing the

TFAP2C

TFAP2A
TRIOBP
ZNF185

ZNF219
MYCBP
4

TRIM29
TRIM16

RCOR1

HDAC1
FOXQ1

GRHL2

BARX2

TRIP10

OVOL1

RREB1
FOXN1

FOXA1
STON1
TEAD3

FOXE1

TEAD1
SOX15
RARG

AFDN

BNC1

BNC2
distribution of TFA across NOR, HGIN,

SOX2
HES2

YBX2
DTX2

ELK3
**

PAX9
KLF8

KLF3

KLF5
ELF3
TP63

MYC
IRF6

DES
EHF

HR
and ICA spots (n ¼ 141, 313, and 613,
2
Basal respectively). P values were computed
TFA

141
with an unpaired Student t test. C,

Downloaded from http://aacrjournals.org/cancerres/article-pdf/82/14/2520/3180238/2520.pdf by guest on 25 July 2024


0
HGIN 313 Heatmap displaying the average TFA
of the 43 esophageal-specific TFs,
−2 ICA 613
Basal HGIN ICA
averaged over normal basal, HGIN, and
Epithelial spot type ICA states. The number of spots in each
Basal HGIN ICA t(TFA) stage is indicated. Statistics of differ-
***
***
***
***
***
***
***
***
***
***
***
***
***
***
***
***
***
**
***
***
**
*
***
***
*
***
***
***
***
***
*
**
***
***
*
**
***
***
***
***
***
***
* P < 0.05 ** P < 10−5 *** P < 10−10 ential TFA are indicated in the color bar
below. P values were computed with
)
7)

8)
22
C

an unpaired Student t test. Scale bar,


ZE

ZE
C

ZE
ES

D E
(L

(L

500 mm. D, Heatmap displays the


(L
1

se

um

um
C

um

P = 2 × 10−8
C

ou

si

si
ES

ES

signed statistical significance of asso-


si
Vi

Vi
M

Vi

100
PAX9
KLF8
Obs ciation between differentiation activity
80 Null
EHF (TFA) and cancer progression, for the
Fraction (%)

ELF3
FOXQ1
60 43 esophageal-specific TFs across six
TRIM29
TRIM16
40 independent scRNA-seq studies with
ZNF185
20
the 10 Visium data results displayed
RCOR1
BARX2 separately for each of the 3 patients.
FOXN1 0
SOX15
For the 10 Visium data we display the
0 2 4 6 8 10
DTX2 results for each patient separately
HR Number of inactivated TFs shared by
OVOL1
all 6 studies
because for the 10 Visium data we
IRF6
GRHL2
had enough normal epithelial spots for
TP63 the comparison within each patient
RARG P = 5 × 10−9
HES2
Sign(t)*(-Log10 P)
100 to be meaningful. The values in this
KLF3
BNC1
>20 Obs
Null
heatmap represent the sign of the t-
80
TFAP2C 5 statistic multiplied by -log10(P), where
Fraction (%)

AFDN 60
SOX2 3 P is the associated P value. Blue colors
TRIOBP 0
TRIP10
40 denote reduced TFA during ESCC pro-
MYCBP −3 gression. The color bar to the right
20
TEAD3 −5 labels the number of studies in which
ZNF219 0
RREB1 <−20
YBX2
the TF displays reduced TFA. E, Plots
0 3 6 9 13 17
BNC2
Number of inactivated TFs shared compare the number of TFs observed
MYC 6
FOXE1 5 by at least 5 studies to exhibit reduced differentiation
ELK3
FOXA1 4 activity in all 6 studies (left) and in at
TEAD1 3 least 5 studies (right) with the corre-
HDAC1
2 sponding binomial null distributions.
KLF5
STON1 1
DES 0 Green vertical line indicates the
TFAP2A observed numbers and the P value is
from a one-tailed binomial test.

Reduced differentiation activity is observed relative to the basal ESCC Cohort 2 for which using less stringent quality control
epithelium thresholds, a sufficient number of normal epithelial cells (n ¼
Given that normal esophageal basal cells displayed much lower 183) were obtained. On the basis of four well-known esophageal
TFA compared with normal cells from the differentiated upper basal markers (TP63, KRT5, KRT14, KRT15), we identified 36 basal
epithelium (Fig. 2B; Supplementary Fig. S2A), we reasoned that the cells, which reassuringly displayed a significantly higher potency
lower differentiation activity displayed by esophageal-specific TFs than the 147 nonbasal ones, thus validating our assignments
during cancer progression could reflect the increased enrichment of (Supplementary Fig. S6A and S6B). Despite the relatively small
the basal cell-of-origin population. To explore this, we reran the number of basal cells, esophageal-TFs still displayed a clear trend
differential TFA-analysis using only a subset of normal cells that we toward reduced differentiation activity in preneoplastic and cancer
could confidently classify as basal. This was done in our human cells (Supplementary Fig. S6C; binomial test, P < 0.0001). To

2530 Cancer Res; 82(14) July 15, 2022 CANCER RESEARCH


Preneoplastic Cells of High Stemness and Cancer Risk

A 12 B Cohort 1
8
TFIL P = 3 × 10−19
4 0.25
0

Stemness (CCAT)
ELK3 0.20
TEAD1
AFDN 0.15
RCOR1
IRF6 0.10
HR
EHF
TRIOBP 0.05
DES
N/INF RARG 0.00
N ICA
TFAP2C
(n = 95) Differential ZNF219 n = 95 n = 2,002
DTX2
KLF8
TFA HES2
analysis TRIM29 C P = 8 × 10−88
SOX15
LG/HGIN OVOL1 0.25
FOXQ1
(n = 1,081)

Stemness (CCAT)
ELF3 0.20
TP63
FOXN1 0.15
PAX9
KLF3
0.10
BARX2
SOX2
TRIM16 0.05

Downloaded from http://aacrjournals.org/cancerres/article-pdf/82/14/2520/3180238/2520.pdf by guest on 25 July 2024


ZNF185
0.00
LG/HGIN 0 1 2 ≥3
LGIN inact. n = 590 n = 247 n = 127 n = 117
Single cells
HGIN n.s. TFIL

Noncancer
D E fate
P = 2 × 10−190 Noncyclling cells
0.25 0.25 P = 4 × 10−24
Stemness (CCAT)

Stemness (CCAT)

0.20 0.20

0.15 0.15

DC3
0.10 0.10
High
0.05 0.05 Cancer
fate
0.00 Low 0.00
0.0 0.5 1.0 0 1 2 ≥3
n = 412 n = 144 n = 34 n = 16
Cycle score

2
TFIL DC

DC
1

F P = 1 × 10−27 G P = 3 × 10−45 Noncyclling cells


P = 2 × 10−7
0.15
0.15
0.15
CancerProgScore

0.10 0.10
0.10
0.05 0.05 0.05
0.00 0.00
0.00
−0.05 −0.05
−0.05 −0.10
−0.10 High
−0.10 −0.15
−0.15

−0.15
Low
0 1 2 ≥3 0 1 2 ≥3
0.0 0.5 1.0 n = 412 n = 144 n = 34 n = 16
n = 590 n = 247 n = 127 n = 117
Cycle score TFIL
TFIL

Figure 5.
Transcription factor inactivation load correlates with stemness and cancer risk. A, A differential TFA analysis was performed between epithelial cells from the normal/
inflammatory stage and cells from the LGIN/HGIN (Cohort 1). Heatmap displays a binary matrix [black, inactivation event; gray, not significant (n.s.)] depicting the
inactivation events for each cell and TF. For a given LGIN/HGIN cell, inactivation of a TF is defined by a significantly lower activity in that cell compared with all N/INF
cells using a Bonferroni-adjusted P < 0.05 threshold, and where the P value is computed from a cells linear model. TFA is ranked in increasing order of TFIL, where TFIL
is defined as the number of TFs displaying an inactivation event in that cell. TFs labeled in blue are those exhibiting a significantly lower activity in LGIN/HGIN
compared with normal/inflammatory stage. B, Violin plots display the estimated stemness scores using the CCAT measure for epithelial cells in the normal (N) and
ICA for Cohort 1. P values derived from a one-tailed Wilcoxon rank sum test. C, Violin plots displaying the estimated stemness score (CCAT) against the TFIL in the
LGIN/HGIN cells from Cohort 1. P values derived from a linear regression between CCAT and TFIL. D, Smoothed scatterplot of CCAT versus the computed cell-cycle
score for the LGIN/HGIN cells. P values are from a linear regression between CCAT and cell-cycle score. Violin plot to the right is as in C but now only using noncycling
cells, that is, cells with a negative cell-cycle score. E, Three-dimensional diffusion map inferred by applying the diffusion maps algorithm to the TFA-matrix defined
over the 43 esophageal TFs and 3,178 epithelial cells (Cohort 1). Cells are colored according to disease stage, as shown. Black box contains the root state, that is, a cell
from the normal stage that has highest centrality. Red boxes denote the two inferred tipping points, labeling cancer-free and cancer endpoints. Below the diffusion
map, we display a two-dimensional density plot encompassing all LGIN/HGIN cells, and a cancer risk score was obtained for each of these cells by their proximity to
the cancer fate. F, Violin plot displays the estimated cancer progression score for epithelial cells in LGIN/HGIN stage as a function of TFIL. P values derived from a
linear regression. G, Smoothed scatterplot displays the relation between the cancer progression and cell-cycle scores. P values derived from a linear regression. Right
panel is like F, but now using only noncycling cells, defined as cells with a cell-cycle score less than 0.

AACRJournals.org Cancer Res; 82(14) July 15, 2022 2531


Liu et al.

A 50% P = 7.2 × 10−16

Frequency of promoter
methylation changes
40% Hyper loci ≥1 Hyper loci = 0 Promoter CpG methylation

Hyper
30% Hypermethylation
20% Hypomethylation
10%
0%

Hypo 10%
20%
genomic alterations

Frequency of genomic changes


Frequency of

Mutation
Mutation/Deletion
Deletion 0% 6% 12%
Amplification
Amplification 0% 50%
25%
PAX9
ELF3

STON1
FOXN1
TRIM29

ZNF219

FOXA1

HDAC1

TEAD1
OVOL1
RARG
ELK3
RCOR1
BNC1

FOXE1
DES

TP63

KLF3
FOXQ1
RREB1
TFAP2A
TEAD3
SOX2

GRHL2
KLF8

BNC2

YBX2
BARX2
HES2

DTX2
HR
TRIP10
TFAP2C

MYC
IRF6
SOX15

ZNF185

KLF5

TRIM16
TRIOBP

MYCBP
AFDN
EHF
Not available
Not significant

B C **
*** D t(TFA)[MC-UC]
ELF3 ** ****
FOXN1

TFA (PAX9)
20

Downloaded from http://aacrjournals.org/cancerres/article-pdf/82/14/2520/3180238/2520.pdf by guest on 25 July 2024


TRIM29 ELF3 ****
EHF
ZNF219 BARX2 ****
10
EHF ****
0 FOXA1 ****
TRIM29 ****
PAX9 N UC MC
∆Methylation of TF promoter CpG sites

KLF8 ***
n = 37 4,388 6,823
SOX15 ***
****
n.s. ** PAX9 ***
10 **
TFA (EHF) ZNF219
TRIM16 5 DES *
FOXN1 HES2
SOX15 0 BNC2
TRIOBP
−5
DES N UC MC YBX2 *
n = 37 4,142 7,069 AFDN *
ZNF185 ***
****
*** TP63 ***
STON1 20 **
STON1 ***
TFA (ELF3)

10
TP63 t(TFA)[MC-UC]
0 −50 0 20

KLF8 * P < 0.05


−10
N UC MC ** P < 10−5
P39
P2
P24
P130
P127
P38
P20
P36
P4
P75
P84
P1
P89
P128
P27
P79
P26
P8
P10
P47
P28
P49
P126
P91
P104
P32

n = 37 6,554 4,657 *** P < 10−10


∆Promoter CpG methylation Methylation status **** P < 10−50
N UC MC n.s. Not significant
−1 1

E F P < 10−39 P = 0.0002 P < 10 −25


100
Stemness (CCAT)

0.25 0.25
Stemness (CCAT)

TFIL
Fraction (%)

80
0.20 0.20
60
0
0.15 0.15 1
0.10 0.10 40 2
≥3
0.05 P< 10−500 0.05 20
0.00 0.00 0
WT MT WT MT WT MT WT MT
0 1 2 3 4 5 6 n = 37,497 7,050 23,946 20,601
n = 28,841 10,218 3,420 1,301 512 188 67 NOTCH1 TP53 NOTCH1 TP53
TFIL

Figure 6.
Differential TFA of esophageal-specific TFs is associated with differential DNAm. A, Top, the y-axis of the barplot represents for each TF, the fraction of
promoter CpGs that display significant differential methylation between the 26 ESCCs and their 26 matched normals (Cohort 2) as assessed using a Wald test
(P < 0.01). TFs with at least one significantly hypermethylated promoter CpG site are displayed to the left of the dashed line (n ¼ 19). Significant differences at
the level of each TF promoter were assessed using a paired Student t test (P < 0.01) comparing the mean DNAm values over the promoter. Significant TFs are
shown by annotating the TF names in purple or yellow depending on whether it is hyper- or hypomethylated, respectively. Overall significance of differences in
DNAm levels for all 43 TFs was calculated with a paired Student t test (P ¼ 7.2  10–16). Bottom, heatmap displays the frequency of nonsynonymous somatic
mutations and gene copy number variations across the ESCC patients. B, Heatmap displays the methylation profiles of CpGs mapping to promoters with
significant hypermethylation (red) or hypomethylation (blue) in at least five patients. C, Violin plots display the TFA levels of PAX9, EHF, and ELF3 for
epithelial cells derived from normal esophageal tissue (N), tumor cells from patients with no significant promoter hypermethylation (UC), and tumor cells from
patients with significant promoter hypermethylation (MC). The number of single cells in each category is indicated. P values were computed with an unpaired
Student t test. (Continued on the following page.)

2532 Cancer Res; 82(14) July 15, 2022 CANCER RESEARCH


Preneoplastic Cells of High Stemness and Cancer Risk

validate and strengthen these findings with increased cell numbers, tiation activity. Of note, the CCAT potency measure also exhibited a
we performed STs with the 10× Visium platform on normal, strong association with cell proliferation, yet critically, the association
squamous dysplasia and invasive cancer samples from three is nonlinear, indicating that noncycling cells can also exhibit moderate
patients with ESCC of Cohort 1 (see Materials and Methods). to high potency (Fig. 5D). We verified that the association between
Across all three patients, this revealed a total of 4,208 epithelial stemness and TFIL is independent of cell proliferation (Supplementary
spots (“Epi spots”), distributed as 477 normal, 945 inflammatory, Table S4), and in line with this, noncycling cells with a high TFIL
243 LGIN, 527 HGIN, and 2016 ICA (Fig. 4A; Supplementary exhibited a higher stemness than noncycling low TFIL ones (Fig. 5D).
Fig. S6D and S6E). From the normal/inflammatory stages, we To test whether the TFIL and stemness are associated with cancer
confidently identified by histology (three separate pathologists progression, we independently estimated, for each of the noncancer-
working independently with a 20 microscope) a total of 621 basal ous cells, a cancer progression score, reflecting the closeness of the cell’s
spots located in the vicinity of the basal membrane or papillae position to the cancer state in the differentiation activity (TFA) phase
(Supplementary Fig. S7A and S7B), which we subsequently con- space, which we inferred by applying diffusion maps (see Materials and
firmed by ST expression of basal-specific markers (Supplementary Methods; Fig. 5E; refs. 42, 43). We note that the diffusion map
Fig. S8). Unsupervised clustering of annotated epithelial, stromal, naturally predicted a bifurcation with cancer cells clustering almost
and immune-cell spots validated our assignments, revealing clear exclusively at one end of diffusion component-1 (DC1) and with
separability, thus confirming high purity of our epithelial spots noncancer cells distributed more evenly (Fig. 5E). As with the stem-
(Supplementary Fig. S9A–S9E). Estimating TFA values in the ness measure itself, the cancer progression score increased with the

Downloaded from http://aacrjournals.org/cancerres/article-pdf/82/14/2520/3180238/2520.pdf by guest on 25 July 2024


normal basal, dysplasia and cancer tissue blocks of each patient, TFIL per cell (Fig. 5F), correlating nonlinearly with cell proliferation
revealed a highly significant and consistent pattern of overall but also independently of it (Fig. 5G; Supplementary Table S4). All
reduced differentiation activity in cancer cells (Fig. 4A and B; these findings were replicated with high statistical significance in our
Supplementary Fig. S6D and S6E), with TFA patterns of the ESCC mouse model (Supplementary Fig. S10A–S10C; Supplementary
individual TFs confirming an overall decrease in differentiation Table S4).
activity relative to the normal basal state (Fig. 4C; Supplementary
Fig. S6D and S6E). Thus, these data indicate that the reduced Reduced differentiation activity correlates with DNA
differentiation activity of esophageal-TFs during cancer progression methylation changes
is not only driven by the corresponding enrichment of the basal cell- To explore whether changes in differentiation activity are asso-
of-origin. Combined across the two human ESCC cohorts, the ciated with DNA alterations, we performed whole-genome sequenc-
mouse ESCC dataset and the Visium assays from three patients, ing (WGS) and laser capture microdissection–based whole-genome
we observed a total of 8 TFs displaying consistent reduced differ- bisulfite-sequencing (LCM-WGBS) on 26 ESCC bulk samples from
entiation activity during cancer progression in all six of these Cohort 2, and on corresponding matched normal adjacent tissue
datasets, with 19 TFs doing so in at least five datasets (Fig. 4D), from all 26 patients (see Materials and Methods). Focusing on
results that can not be explained by random chance (Fig. 4E). promoter DNA methylation (DNAm) within 500-bp upstream of
the transcription starting site, we observed that DNAm levels were
Reduced differentiation activity correlates with stemness and higher in ESCC compared with paired normal adjacent tissue
cancer risk (Fig. 6A; Supplementary Table S5). Among the 1,478 CpGs located
Next, we aimed to determine whether cells exhibiting the lowest in the promoter regions of the 43 TFs, there were 290 regions
differentiation activity in a preneoplastic cell population define tran- encompassing 19 TFs that displayed a significant increase in
scriptomic states that progress to cancer. Although assessing this methylation in ESCC compared with matched normal tissue (Wald
would require prospective lineage-tracing, one can obtain supportive test, P < 0.01). Hypermethylation was recognized at 87% of sites
evidence for this computationally. First, we devised a method to call with significant DNAm changes, with the most frequent changes
“TF inactivation” events in each of the noncancerous LGIN/HGIN occurring at PAX9 (48.5%), ELF3 (37.2%), DES (34.4%), EHF
cells from our human scRNA-seq data (Cohort 1), by comparing the (30.8%), and STON1 (28.0%; Fig. 6A and B). In contrast, genomic
estimated TFA in the cell to those of the normal inflammatory state alterations were not as significant, with nonsynonymous mutations
(see Materials and Methods). For each noncancerous cell, we thus and copy-number deletions distributed sporadically at relatively low
obtained a “TFIL,” representing the number of esophageal-specific TFs frequencies (<10%; Fig. 6A). Comparing the previously estimated
displaying low differentiation activity in that cell (Fig. 5A). Indepen- TFA values between normal and cancer cells from patients with and
dently from this, we also estimated the stemness of each cell using without TF promoter hypermethylation revealed that for 11 of the
CCAT, and consistent with the cancer stem-cell hypothesis, ESCC cells 19 hypermethylated TFs, differentiation activity was significantly
from Cohort 1 exhibited higher stemness (i.e., lower commitment and lower in the cancer samples with TF promoter hypermethylation
differentiation) than normal cells (Fig. 5B). Importantly, we observed (Fig. 6C and D). By applying our SEPIRA algorithm (12) to each
a strong nontrivial correlation between CCAT and TFIL (Fig. 5C), WGBS sample, with DNAm values in each profile summarized at the
thus establishing a direct connection between potency and differen- level of gene promoters, we estimated TFA values for a total of 11 TFs that

(Continued.) D, Heatmap displaying the significance of differential TFA between tumor cells from patients with and without significant promoter
hypermethylation and for the 19 TFs displaying significant hypermethylation in ESCC compared with normal adjacent tissue (i.e., the significant TFs in
barplot of A). E, Boxplots displaying correlation between the CCAT potency/stemness index (y-axis) and TFIL (x-axis) in the cancer cells from ESCC Cohort 2.
P value is from a linear regression. F, Violin plots compare the CCAT potency/stemness values with NOTCH1 and TP53 mutation status as assessed in ESCC
Cohort 2. Note that somatic mutations were only assessed at the bulk tissue level within each ESCC patient, hence for patients carrying mutations, we assigned
all corresponding single cells as “MT,” with patients not carrying mutations assigned the status of wild-type (WT). P values are from a one-tailed Wilcoxon rank
sum test. Barplots compare the relative proportions of cells with varying TFILs between NOTCH1 mutant and wild-type patients, and similarly for TP53. P value
derives from a x2 test. In the case of TP53, relative proportions don’t change in a consistent manner, so P value is not shown.

AACRJournals.org Cancer Res; 82(14) July 15, 2022 2533


Liu et al.

displayed sufficient read coverage at regulon genes and that according to Supplementary Fig. S13C and S13D; ref. 56). Thus, these data establish
our previous SCIRA-based analysis were inactivated in ESCC. Of these that tissue-specific TFs display lower differentiation activity in corre-
11, a total of 4 (SOX2, RCOR1, ELK3, TEAD1) displayed significantly sponding single cancer cells, and across different cancer types.
lower TFA in ESCC, while the remaining 7 did not display differential
TFA (Supplementary Fig. S11). Thus, for a small fraction of TFs, their
lower TFA in ESCC is associated with hypermethylation of TF-target Discussion
promoters. Next, we decided to explore whether the correlation of TFA Here we have devised a computational method to dissect the
with dedifferentiation is independent of underlying NOTCH1 and TP53 heterogeneity of a preneoplastic epithelial cell population, identifying
mutations, two key mutations in ESCC development. Whilst our CCAT a subpopulation of cells with a high TFIL that is independently
stemness/dedifferentiation index displayed a very strong and highly associated with high stemness and that is found enriched at the
significant association with the TFIL derived from the TFA profiles (as invasive cancer stage. Underlying this result is the important obser-
assessed over the single cells from Cohort 2; Fig. 6E), we only observed a vation that the number of tissue-specific TFs displaying reduced
much milder and no association with NOTCH1 and TP53 mutation differentiation activity increases during cancer progression, consistent
status, respectively (Fig. 6F). Thus, these data support the view that with the progressive selection of a dedifferentiated stem-like state.
changes in differentiation activity of the esophageal-specific TFs is Given that differentiation within the esophageal epithelium proceeds
mirrored at the level of the DNA methylome and that these changes via a unipotent lineage driven by the stem and progenitor cells located
provide a closer proxy to the dedifferentiation/stemness index of cancer in the basal layer, our observations are entirely consistent with a

Downloaded from http://aacrjournals.org/cancerres/article-pdf/82/14/2520/3180238/2520.pdf by guest on 25 July 2024


cells compared with NOTCH1 and TP53 mutations. gradual enrichment of a basal stem/progenitor cell with cancer pro-
gression. However, it would appear that the reduced differentiation
Reduced differentiation activity of tissue-specific TFs is a activity in preneoplastic and cancer cells is not just driven by an
cancer hallmark enrichment of the basal cell-of-origin within cancer lesions, because
Finally, we asked whether the low differentiation activity displayed the reduced differentiation activity is also seen relative to the normal
by tissue-specific TFs in esophageal cancer is a broad phenomenon that basal cells. That is, cancer cells display low differentiation activity of
applies across cancer types. We first explored this in the context of esophageal-specific TFs even when compared with their presumed cell
LUAD, for which a recent 10× scRNA-seq study (“LUAD1”; ref. 54) of origin, pointing toward an aberrant epithelial reprogramming of the
had profiled sufficient numbers of normal and tumor epithelial cells, stem-like state. Supporting this, several of the identified TFs have
including alveolar type-1 and type-2 (AT1/2) cells, which are the most tumor suppressor roles in esophageal cancer (e.g., PAX9; ref. 67) or in
abundant cell types in the distal airway epithelium and which are other squamous cell carcinomas (e.g., TRIM29; ref. 65). This epithelial
thought to give rise to LUAD (73). Using a lung-specific regulatory reprogramming may even constitute a cancer hallmark, because we
network consisting of 38 lung-specific TFs and associated regulons observed strong associations between TFIL, stemness, and cancer in
(Supplementary Data S2; ref. 35), we estimated differentiation activity other cancer types (colon and lung adenocarcinomas).
of the 38 TFs across all normal and tumor epithelial cells. This Although the role of the tumor stroma in promoting or preventing
confirmed that the TFs were specific to alveolar cells, and predom- invasive cancer is now well established (76, 77), we propose that an
inantly for the more differentiated AT1 subtype (Supplementary epithelial reprogramming of tissue-specific TFs may drive the early
Fig. S12A). In line with this, our CCAT stemness index predicted a dedifferentiation process that precedes cancer development. This
higher level of potency for AT2 compared with AT1 (Supplementary reprogramming is characterized by a gradual and irreversible inacti-
Fig. S12B). We also observed that the TFA values for tumor cells were vation of tissue-specific TFs, which promotes cells to acquire a more
significantly lower compared with the combined alveolar cells, with an plastic state. What DNA alterations may drive this reprogramming is
even more pronounced decrease for metastatic cells collected from still unclear. While whole genome and exome sequencing of ESCC
adjacent lymph nodes (Supplementary Fig. S12A). Correspondingly, and precancerous bulk tissue have identified numerous genomic
the CCAT stemness index was increased in tumor and metastatic cells aberrations affecting key pathways such as TP53, NOTCH1, and
compared with normal alveoli (Supplementary Fig. S12B). Comparing PI3K-AKT (78, 79), with the exception of NOTCH1, these alterations
normal alveoli to tumor cells only, we observed a significant skew do not target dedifferentiation pathways and are seen to accumulate
toward lower differentiation activity with 26 TFs exhibiting lower TFA with age in the normal esophageal epithelium (17, 80, 81), indicating
levels in tumor cells (binomial test, P ¼ 0.002; Supplementary that other molecular alterations are causally implicated in the
Fig. S12C), a number that could not have arisen by random chance dedifferentiation process (82). In this regard, it is worth stressing
(Supplementary Fig. S13A). A similar strong skew towards lower again that most tissue-specific TFs do not in general represent hotspots
differentiation activity in tumor cells compared with normal alveoli for somatic mutations or genomic deletions, either in normal
was observed in an independent 10× scRNA-seq LUAD dataset cells (13, 14, 83), preneoplastic lesions (82) or cancer itself (11, 14),
(“LUAD2”; binomial test, P ¼ 210–9; Supplementary Fig. S12C; a result we have confirmed here with WGS. Importantly, we have
refs. 35, 55). Of note, this pattern of lower differentiation activity was shown that our TFIL measure provides a much better correlate of the
not observed at the level of differential expression, but is more dedifferentiation state of cancer cells compared with a traditional
consistent with the widespread underexpression as seen in the bulk marker such as NOTCH1 mutation, supporting the view that the
tissue LUAD (and LSCC) TCGA studies (Supplementary Fig. S12C; dedifferentiation process is largely independent of NOTCH1 muta-
refs. 74, 75). In addition, PCA on the estimated TFA matrix revealed tions. Using a novel approach that integrates LCM-WGBS data with
better separability of tumor and normal epithelial cells compared with a scRNA-seq profiles from the same ESCC samples, we have shown that
corresponding PCA on TF expression levels (Supplementary Fig. S13B). the reduced differentiation activity of tissue-specific TFs is instead
We observed very similar skews toward lower differentiation activity in frequently associated with promoter hypermethylation. This associ-
cancer when estimating TFA of 56 colon-specific TFs (Supplementary ation is also unlikely to be driven by the increased enrichment of the
Data S3; ref. 35) in two independent 10× scRNA-seq studies of colorectal basal cell-of-origin in cancer lesions, because the low differentiation
adenocarcinoma (see Materials and Methods; Supplementary Fig. S12D; activity of tissue-specific TFs in adult stem cells is mainly controlled by

2534 Cancer Res; 82(14) July 15, 2022 CANCER RESEARCH


Preneoplastic Cells of High Stemness and Cancer Risk

repressive histone marks, and not by promoter hypermethylation (84). cells identifies dedifferentiated stem-like cells that appear to be selected
In line with this, promoter hypermethylation of tissue-specific TFs for during cancer progression. These novel insights and the compu-
is observed in normal cells exposed to cancer risk factors, including tational CancerStemID framework presented herein, could facilitate
age (85), and has been proposed to be a cancer hallmark (3, 19). the development of the much-needed early detection and cancer risk
However, we cannot exclude the possibility that other epigenetic prediction markers for deadly cancers such as ESCC, or alternatively,
mechanisms, for instance somatic mutations affecting epigenetic to help assess the efficacy of cancer prevention trials (90).
enzymes, could drive DNAm changes affecting tissue-specific TFs.
It will be important for future work to generate scRNA-seq data Authors’ Disclosures
jointly with scATAC-seq (86), histone modifications (87) or DNAm No disclosures were reported.
data (88), in the same cells, as this could establish direct relation-
ships between TFIL and changes to chromatin accessibility. Authors’ Contributions
Overall, we acknowledge that our study and the conclusions drawn T. Liu: Data curation, formal analysis, visualization, writing–review and editing.
from it are subject to several limitations. First, our in silico predictions X. Zhao: Data curation, formal analysis, investigation. Y. Lin: Formal analysis,
would require experimental validation. To establish experimentally if visualization. Q. Luo: Formal analysis, visualization. S. Zhang: Formal analysis,
the preneoplastic cells we have identified represent those at highest investigation, visualization, writing–review and editing. Y. Xi: Data curation.
cancer risk would require advanced in vivo lineage tracing techni- Y. Chen: Data curation, formal analysis. L. Lin: Data curation. W. Fan: Data
ques (89) that have not yet been developed. Second, how to epige- curation. J. Yang: Data curation. Y. Ma: Data curation. A.K. Maity: Formal analysis.
Y. Huang: Validation, methodology. J. Wang: Validation, methodology. J. Chang:

Downloaded from http://aacrjournals.org/cancerres/article-pdf/82/14/2520/3180238/2520.pdf by guest on 25 July 2024


netically perturb a number of tissue-specific TFs in a way that mimics Conceptualization, supervision, funding acquisition, writing–review and editing.
the epigenetic changes seen in cancer development is also a formidable D. Lin: Conceptualization, supervision, funding acquisition, writing–review and
challenge, yet necessary to explore the functional consequences for editing. A.E. Teschendorff: Conceptualization, formal analysis, supervision, funding
cellular properties such as stemness and plasticity. Third, the number acquisition, visualization, methodology, writing–original draft, writing–review and
of normal basal cells analyzed at single-cell resolution was relatively editing. C. Wu: Conceptualization, data curation, supervision, funding acquisition,
low, which only reflects the inherent difficulty of acquiring large writing–review and editing.
numbers of such cells from patients with ESCC. Although we did
address this by analyzing spatial transcriptomic data encompassing Acknowledgments
over a 1,000 basal epithelial spots from three patients with ESCC, This project was funded by the National Natural Science Foundation of China
limitations remain in that the purity of these epithelial spots is likely to (81988101 to D. Lin and C. Wu; 31771464 and 32170652 to A.E. Teschendorff; 81872696
and 82073654 to J. Chang), National Natural Science Fund for Distinguished Young
be only around 70%. In this regard we note that even if these normal
Scholars (81725015 to C. Wu), Beijing Outstanding Young Scientist Program
epithelial spots were to contain 30% stromal cells, this would only act to (BJJWZYJH01201910023027 to C. Wu), Medical and Health Technology Innovation
artificially lower the TFA values for the epithelial-specific TFs, reduc- Project of Chinese Academy of Medical Sciences (2016-I2M-3–019 to D. Lin; 2016-I2M-
ing power to observe differences with the cancer cells, yet here we were 4–002 to C. Wu; 2021-I2M-1-013, 2019-I2M-2–001 to D. Lin and C. Wu), Natural
able to observe a reduction in cancer cells, suggesting that this was not a Science Fund for Distinguished Young Scholars of Hubei Province (2020CFA067 to
major limitation. Moreover, it is worth highlighting that our results J. Chang). The authors thank all the patients and physicians participating in the research
at Linzhou Cancer Hospital and Linzhou Esophageal Cancer Hospital.
were strongly consistent across six independent datasets (2 scRNA-seq
human ESCC cohorts, 1 scRNA-seq mouse study of ESCC develop-
The costs of publication of this article were defrayed in part by the payment of page
ment, and 3 spatial transcriptomic datasets from three independent charges. This article must therefore be hereby marked advertisement in accordance
ESCC patients), a clear indication that our results are not explained by with 18 U.S.C. Section 1734 solely to indicate this fact.
small cell numbers or random chance.
In summary, we have here shown that the number of tissue-specific Received February 24, 2022; revised April 5, 2022; accepted May 6, 2022;
TFs displaying low differentiation activity in preneoplastic epithelial published first May 10, 2022.

References
1. Tirosh I, Venteicher AS, Hebert C, Escalante LE, Patel AP, Yizhak K, et al. 8. Issa JP. Epigenetic variation and cellular Darwinism. Nat Genet 2011;43:724–6.
Single-cell RNA-seq supports a developmental hierarchy in human oligo- 9. Winslow MM, Dayton TL, Verhaak RG, Kim-Kiselak C, Snyder EL, Feldser DM,
dendroglioma. Nature 2016;539:309–13. et al. Suppression of lung adenocarcinoma progression by Nkx2–1. Nature 2011;
2. Feinberg AP, Ohlsson R, Henikoff S. The epigenetic progenitor origin of human 473:101–4.
cancer. Nat Rev Genet 2006;7:21–33. 10. Zhao W, Hisamuddin IM, Nandan MO, Babbin BA, Lamb NE, Yang VW.
3. Baylin SB, Ohm JE. Epigenetic gene silencing in cancer - a mechanism for early Identification of Kruppel-like factor 4 as a potential tumor suppressor gene in
oncogenic pathway addiction? Nat Rev Cancer 2006;6:107–16. colorectal cancer. Oncogene 2004;23:395–402.
4. Schedl A, Hastie N. Multiple roles for the Wilms’ tumour suppressor 11. Teschendorff AE, Zheng SC, Feber A, Yang Z, Beck S, Widschwendter M.
gene, WT1 in genitourinary development. Mol Cell Endocrinol 1998;140: The multi-omic landscape of transcription factor inactivation in cancer.
65–9. Genome Med 2016;8:89.
5. Tao Y, Kang B, Petkovich DA, Bhandari YR, In J, Stein-O’Brien G, et al. Aging- 12. Chen Y, Widschwendter M, Teschendorff AE. Systems-epigenomics inference of
like spontaneous epigenetic silencing facilitates Wnt activation, stemness, and transcription factor activity implicates aryl-hydrocarbon-receptor inactivation
Braf(V600E)-induced tumorigenesis. Cancer Cell 2019;35:315–28. as a key event in lung cancer development. Genome Biol 2017;18:236.
6. Xie W, Kagiampakis I, Pan L, Zhang YW, Murphy L, Tao Y, et al. DNA 13. Moore L, Leongamornlert D, Coorens THH, Sanders MA, Ellis P, Dentro SC,
methylation patterns separate senescence from transformation potential and et al. The mutational landscape of normal human endometrial epithelium.
indicate cancer risk. Cancer Cell 2018;33:309–21. Nature 2020;580:640–6.
7. Maegawa S, Gough SM, Watanabe-Okochi N, Lu Y, Zhang N, Castoro RJ, et al. 14. Alexandrov LB, Kim J, Haradhvala NJ, Huang MN, Tian Ng AW, Wu Y, et al.
Age-related epigenetic drift in the pathogenesis of MDS and AML. Genome Res The repertoire of mutational signatures in human cancer. Nature 2020;578:
2014;24:580–91. 94–101.

AACRJournals.org Cancer Res; 82(14) July 15, 2022 2535


Liu et al.

15. Lee-Six H, Olafsson S, Ellis P, Osborne RJ, Sanders MA, Moore L, et al. The 41. Smyth GK. Linear models and empirical bayes methods for assessing
landscape of somatic mutation in normal colorectal epithelial cells. Nature 2019; differential expression in microarray experiments. Stat Appl Genet Mol
574:532–7. Biol 2004;3:Article3.
16. Brunner SF, Roberts ND, Wylie LA, Moore L, Aitken SJ, Davies SE, et al. Somatic 42. Angerer P, Haghverdi L, Buttner M, Theis FJ, Marr C, Buettner F. destiny:
mutations and clonal dynamics in healthy and cirrhotic human liver. Nature diffusion maps for large-scale single-cell data in R. Bioinformatics 2016;32:
2019;574:538–42. 1241–3.
17. Martincorena I, Fowler JC, Wabik A, Lawson ARJ, Abascal F, Hall MWJ, et al. 43. Haghverdi L, Buttner M, Wolf FA, Buettner F, Theis FJ. Diffusion pseudotime
Somatic mutant clones colonize the human esophagus with age. Science 2018; robustly reconstructs lineage branching. Nat Methods 2016;13:845–8.
362:911–7. 44. Pons P, Latapy M. Computing communities in large networks using random
18. Li R, Di L, Li J, Fan W, Liu Y, Guo W, et al. A body map of somatic mutagenesis in walks. Berlin, Heidelberg: Springer; 2005.
morphologically normal human tissues. Nature 2021;597:398–403. 45. Teschendorff AE, Maity AK, Hu X, Weiyan C, Lechner M. Ultra-fast scalable
19. Ohm JE, McGarvey KM, Yu X, Cheng L, Schuebel KE, Cope L, et al. A stem cell- estimation of single-cell differentiation potency from scRNA-Seq data. Bioin-
like chromatin pattern may predispose tumor suppressor genes to DNA formatics 2021;37:1528–34.
hypermethylation and heritable silencing. Nat Genet 2007;39:237–42. 46. Athanasiadis EI, Botthof JG, Andres H, Ferreira L, Lio P, Cvejic A. Single-cell
20. Schlesinger Y, Straussman R, Keshet I, Farkash S, Hecht M, Zimmerman J, et al. RNA-sequencing uncovers transcriptional states and fate decisions in haema-
Polycomb-mediated methylation on Lys27 of histone H3 pre-marks genes for topoiesis. Nat Commun 2017;8:2045.
de novo methylation in cancer. Nat Genet 2007;39:232–6. 47. Shi J, Teschendorff AE, Chen W, Chen L, Li T. Quantifying Wadding-
21. Tang F, Lao K, Surani MA. Development and applications of single-cell ton’s epigenetic landscape: a comparison of single-cell potency measures.
transcriptome analysis. Nat Methods 2011;8:S6–11. Briefings Bioinf 2018.
22. Teschendorff AE, Feinberg AP. Statistical mechanics meets single-cell biology. 48. Tirosh I, Izar B, Prakadan SM, Wadsworth MH II, Treacy D, Trombetta JJ, et al.

Downloaded from http://aacrjournals.org/cancerres/article-pdf/82/14/2520/3180238/2520.pdf by guest on 25 July 2024


Nat Rev Genet 2021;22:459–76. Dissecting the multicellular ecosystem of metastatic melanoma by single-cell
23. Teschendorff AE, Enver T. Single-cell entropy for accurate estimation of RNA-seq. Science 2016;352:189–96.
differentiation potency from a cell’s transcriptome. Nat Commun 2017;8:15599. 49. Su H, Hu N, Yang HH, Wang C, Takikita M, Wang QH, et al. Global gene
24. Banerji CR, Miranda-Saavedra D, Severini S, Widschwendter M, Enver T, expression profiling and validation in esophageal squamous cell carci-
Zhou JX, et al. Cellular network entropy as the energy potential in noma and its association with clinical phenotypes. Clin Cancer Res 2011;
Waddington’s differentiation landscape. Sci Rep 2013;3:3039. 17:2955–66.
25. Grun D, Muraro MJ, Boisset JC, Wiebrands K, Lyubimova A, Dharmadhikari G, 50. Zhao Y, Wei L, Shao M, Huang X, Chang J, Zheng J, et al. BRCA1-associated
et al. De novo prediction of stem cell identity using single-cell transcriptome protein increases invasiveness of esophageal squamous cell carcinoma.
data. Cell stem cell 2016;19:266–77. Gastroenterology 2017;153:1304–19.
26. Yao J, Cui Q, Fan W, Ma Y, Chen Y, Liu T, et al. Single-cell transcriptomic 51. Chang J, Tan W, Ling Z, Xi R, Shao M, Chen M, et al. Genomic analysis of
analysis in a mouse model deciphers cell transition states in the multistep oesophageal squamous-cell carcinoma identifies alcohol drinking-related muta-
development of esophageal cancer. Nat Commun 2020;11:3715. tion signature and genomic alterations. Nat Commun 2017;8:15290.
27. Lin DC, Wang MR, Koeffler HP. Genomic and epigenomic aberrations in 52. Krueger F, Andrews SR. Bismark: a flexible aligner and methylation caller for
esophageal squamous cell carcinoma and implications for patients. Gastroen- Bisulfite-Seq applications. Bioinformatics 2011;27:1571–2.
terology 2018;154:374–89. 53. Feng H, Conneely KN, Wu H. A Bayesian hierarchical model to detect differ-
28. Nagtegaal ID, Odze RD, Klimstra D, Paradis V, Rugge M, Schirmacher P, et al. entially methylated loci from single nucleotide resolution sequencing data.
The 2019 WHO classification of tumours of the digestive system. Histopathology Nucleic Acids Res 2014;42:e69.
2020;76:182–8. 54. Kim N, Kim HK, Lee K, Hong Y, Cho JH, Choi JW, et al. Single-cell RNA
29. Zhang P, Yang M, Zhang Y, Xiao S, Lai X, Tan A, et al. Dissecting the single-cell sequencing demonstrates the molecular and cellular reprogramming of meta-
transcriptome network underlying gastric premalignant lesions and early gastric static lung adenocarcinoma. Nat Commun 2020;11:2285.
cancer. Cell Rep 2019;27:1934–47. 55. Lambrechts D, Wauters E, Boeckx B, Aibar S, Nittner D, Burton O, et al.
30. Butler A, Hoffman P, Smibert P, Papalexi E, Satija R. Integrating single-cell Phenotype molding of stromal cells in the lung tumor microenvironment.
transcriptomic data across different conditions, technologies, and species. Nat Med 2018;24:1277–89.
Nat Biotechnol 2018;36:411–20. 56. Lee HO, Hong Y, Etlioglu HE, Cho YB, Pomella V, Van den Bosch B, et al.
31. Korsunsky I, Millard N, Fan J, Slowikowski K, Zhang F, Wei K, et al. Fast, Lineage-dependent gene expression programs influence the immune landscape
sensitive and accurate integration of single-cell data with Harmony. of colorectal cancer. Nat Genet 2020;52:594–603.
Nat Methods 2019;16:1289–96. 57. Li H, Courtois ET, Sengupta D, Tan Y, Chen KH, Goh JJL, et al.
32. Becht E, McInnes L, Healy J, Dutertre CA, Kwok IWH, Ng LG, et al. Reference component analysis of single-cell transcriptomes elucidates
Dimensionality reduction for visualizing single-cell data using UMAP. cellular heterogeneity in human colorectal tumors. Nat Genet 2017;49:
Nat Biotechnol 2018. 708–18.
33. Lun AT, McCarthy DJ, Marioni JC. A step-by-step workflow for low-level 58. Heinaniemi M, Nykter M, Kramer R, Wienecke-Baldacchino A, Sinkkonen L,
analysis of single-cell RNA-seq data with Bioconductor. F1000Res 2016;5:2122. Zhou JX, et al. Gene-pair expression signatures reveal lineage control.
34. Zhang X, Peng L, Luo Y, Zhang S, Pu Y, Chen Y, et al. Dissecting esophageal Nat Methods 2013;10:577–83.
squamous-cell carcinoma ecosystem by single-cell transcriptomic analysis. 2021; 59. Holland CH, Tanevski J, Perales-Paton J, Gleixner J, Kumar MP, Mereu E, et al.
12:5291. Robustness and applicability of transcription factor and pathway analysis tools
35. Teschendorff AE, Wang N. Improved detection of tumor suppressor events in on single-cell RNA-seq data. Genome Biol 2020;21:36.
single-cell RNA-Seq data. NPJ Genom Med 2020;5:43. 60. Chen S, Mar JC. Evaluating methods of inferring gene regulatory networks
36. Thul PJ, Akesson L, Wiking M, Mahdessian D, Geladaki A, Ait Blal H, et al. A highlights their lack of performance for single cell gene expression data.
subcellular map of the human proteome. Science 2017;356:eaal3321. BMC Bioinf 2018;19:232.
37. Roth RB, Hevezi P, Lee J, Willhite D, Lechner SM, Foster AC, et al. Gene 61. Consortium GT. The Genotype-Tissue Expression (GTEx) project. Nat Genet
expression analyses reveal molecular relationships among 20 regions of the 2013;45:580–5.
human CNS. Neurogenetics 2006;7:67–80. 62. Zhang Y, Yang Y, Jiang M, Huang SX, Zhang W, Al Alam D, et al. 3D modeling of
38. Oki S, Ohta T, Shioi G, Hatanaka H, Ogasawara O, Okuda Y, et al. ChIP-Atlas: a esophageal development using human PSC-derived basal progenitors reveals a
data-mining suite powered by full integration of public ChIP-seq data. critical role for notch signaling. Cell Stem Cell 2018;23:516–29.
EMBO Rep 2018;19:e46255. 63. Trisno SL, Philo KED, McCracken KW, Cata EM, Ruiz-Torres S,
39. Madissoon E, Wilbrey-Clark A, Miragaia RJ, Saeb-Parsy K, Mahbubani KT, Rankin SA, et al. Esophageal organoids from human pluripotent stem
Georgakopoulos N, et al. scRNA-seq assessment of the human lung, spleen, and cells delineate Sox2 functions during esophageal specification. Cell Stem
esophagus tissue stability after cold preservation. Genome Biol 2019;21:1. Cell 2018;23:501–15.
40. Shehata M, Teschendorff A, Sharp G, Novcic N, Russell A, Avril S, et al. 64. Jeong Y, Rhee H, Martin S, Klass D, Lin Y, Nguyen le XT, et al. Identification and
Phenotypic and functional characterization of the luminal cell hierarchy of the genetic manipulation of human and mouse oesophageal stem cells. Gut 2016;65:
mammary gland. Breast Cancer Res 2012;14:R134. 1077–86.

2536 Cancer Res; 82(14) July 15, 2022 CANCER RESEARCH


Preneoplastic Cells of High Stemness and Cancer Risk

65. Yanagi T, Watanabe M, Hata H, Kitamura S, Imafuku K, Yanagi H, et al. Loss of 78. The Cancer Genome Atlas Research Network. Integrated genomic charac-
TRIM29 alters keratin distribution to promote cell invasion in squamous cell terization of oesophageal carcinoma. Nature 2017;541:169–75.
carcinoma. Cancer Res 2018;78:6795–806. 79. Gao YB, Chen ZL, Li JG, Hu XD, Shi XJ, Sun ZM, et al. Genetic
66. Smirnov A, Lena AM, Cappello A, Panatta E, Anemona L, Bischetti S, et al. landscape of esophageal squamous cell carcinoma. Nat Genet 2014;46:
ZNF185 is a p63 target gene critical for epidermal differentiation and squamous 1097–102.
cell carcinoma development. Oncogene 2019;38:1625–38. 80. Yokoyama A, Kakiuchi N, Yoshizato T, Nannya Y, Suzuki H, Takeuchi Y, et al.
67. Xiong Z, Ren S, Chen H, Liu Y, Huang C, Zhang YL, et al. PAX9 regulates Age-related remodelling of oesophageal epithelia by mutated cancer drivers.
squamous cell differentiation and carcinogenesis in the oro-oesophageal epi- Nature 2019;565:312–7.
thelium. J Pathol 2018;244:164–75. 81. Tomasetti C, Poling J, Roberts NJ, London NR Jr, Pittman ME, Haffner MC, et al.
68. Watanabe H, Ma Q, Peng S, Adelmant G, Swain D, Song W, et al. SOX2 and p63 Cell division rates decrease with age, providing a potential explanation for the
colocalize at genetic loci in squamous cell carcinomas. J Clin Invest 2014;124: age-dependent deceleration in cancer incidence. Proc Nat Acad Sci USA 2019;
1636–45. 116:20482–8.
69. Wu Z, Zhou J, Zhang X, Zhang Z, Xie Y, Liu JB, et al. Reprogramming of the 82. Yamashita S, Kishino T, Takahashi T, Shimazu T, Charvat H, Kakugawa Y,
esophageal squamous carcinoma epigenome by SOX2 promotes ADAR1 depen- et al. Genetic and epigenetic alterations in normal tissues have differential
dence. Nat Genet 2021;53:881–94. impacts on cancer risk among tissues. Proc Nat Acad Sci USA 2018;115:
70. Jiang Y, Jiang YY, Xie JJ, Mayakonda A, Hazawa M, Chen L, et al. Co-activation of 1328–33.
super-enhancer-driven CCAT1 by TP63 and SOX2 promotes squamous cancer 83. Yoshida K, Gowers KHC, Lee-Six H, Chandrasekharan DP, Coorens T,
progression. Nat Commun 2018;9:3619. Maughan EF, et al. Tobacco smoking and somatic mutations in human
71. Jiang YY, Jiang Y, Li CQ, Zhang Y, Dakle P, Kaur H, et al. TP63, SOX2, and KLF5 bronchial epithelium. Nature 2020;578:266–72.
establish a core regulatory circuitry that controls epigenetic and transcription 84. Meissner A, Mikkelsen TS, Gu H, Wernig M, Hanna J, Sivachenko A, et al.

Downloaded from http://aacrjournals.org/cancerres/article-pdf/82/14/2520/3180238/2520.pdf by guest on 25 July 2024


patterns in esophageal squamous cell carcinoma cell lines. Gastroenterology Genome-scale DNA methylation maps of pluripotent and differentiated cells.
2020;159:1311–27. Nature 2008;454:766–70.
72. Li LY, Yang Q, Jiang YY, Yang W, Jiang Y, Li X, et al. Interplay and cooperation 85. Teschendorff AE, Menon U, Gentry-Maharaj A, Ramus SJ, Weisenberger DJ,
between SREBF1 and master transcription factors regulate lipid metabolism and Shen H, et al. Age-dependent DNA methylation of genes that are suppressed in
tumor-promoting pathways in squamous cancer. Nat Commun 2021;12:4362. stem cells is a hallmark of cancer. Genome Res 2010;20:440–6.
73. Rowbotham SP, Kim CF. Diverse cells at the origin of lung adenocarcinoma. 86. Ma S, Zhang B, LaFave LM, Earl AS, Chiang Z, Hu Y, et al. Chromatin potential
Proc Nat Acad Sci USA 2014;111:4745–6. identified by shared single-cell profiling of RNA and chromatin. Cell 2020;183:
74. Cancer Genome Atlas Research Network. Comprehensive molecular profiling of 1103–16.
lung adenocarcinoma. Nature 2014;511:543–50. 87. Adey AC. Single-cell multiomics to probe relationships between histone mod-
75. Cancer Genome Atlas Research Network. Comprehensive genomic character- ifications and transcription. Nat Methods 2021;18:602–3.
ization of squamous cell lung cancers. Nature 2012;489:519–25. 88. Argelaguet R, Clark SJ, Mohammed H, Stapel LC, Krueger C, Kapourani CA,
76. Mascaux C, Angelova M, Vasaturo A, Beane J, Hijazi K, Anthoine G, et al. et al. Multi-omics profiling of mouse gastrulation at single-cell resolution. Nature
Immune evasion before tumour invasion in early lung squamous carcinogenesis. 2019;576:487–91.
Nature 2019;571:570–5. 89. Wagner DE, Klein AM. Lineage tracing meets single-cell omics: opportunities
77. Sharma A, Seow JJW, Dutertre CA, Pai R, Bleriot C, Mishra A, et al. Onco-fetal and challenges. Nat Rev Genet 2020;21:410–27.
reprogramming of endothelial cells drives immunosuppressive macrophages in 90. Spira A, Yurgelun MB, Alexandrov L, Rao A, Bejar R, Polyak K, et al. Precancer
hepatocellular carcinoma. Cell 2020;183:377–94. Atlas to drive precision prevention trials. Cancer Res 2017;77:1510–41.

AACRJournals.org Cancer Res; 82(14) July 15, 2022 2537

You might also like