0% found this document useful (0 votes)

21 views

Computational Method For Single Cell Data Analysis

Uploaded by

ab576302824

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views

Computational Method For Single Cell Data Analysis

Uploaded by

ab576302824

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 270

Methods in

Molecular Biology 1935

Guo-Cheng Yuan Editor

Computational
Methods for
Single-Cell Data
Analysis
METHODS IN MOLECULAR BIOLOGY

Series Editor
John M. Walker
School of Life and Medical Sciences
University of Hertfordshire
Hatfield, Hertfordshire, AL10 9AB, UK

For further volumes:

http://www.springer.com/series/7651
Computational Methods
for Single-Cell Data Analysis

Edited by

Guo-Cheng Yuan
Dana–Farber Cancer Institute and Harvard Chan School of Public Health, Boston, MA, USA
Editor
Guo-Cheng Yuan
Dana–Farber Cancer Institute and Harvard Chan
School of Public Health
Boston, MA, USA

ISSN 1064-3745 ISSN 1940-6029 (electronic)

Methods in Molecular Biology
ISBN 978-1-4939-9056-6 ISBN 978-1-4939-9057-3 (eBook)
https://doi.org/10.1007/978-1-4939-9057-3
Library of Congress Control Number: 2018967307

© Springer Science+Business Media, LLC, part of Springer Nature 2019

This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is
concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction
on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation,
computer software, or by similar or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply,
even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations
and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to
be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty,
express or implied, with respect to the material contained herein or for any errors or omissions that may have been made.
The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This Humana Press imprint is published by the registered company Springer Science+Business Media, LLC, part of
Springer Nature.
The registered company address is: 233 Spring Street, New York, NY 10013, U.S.A.
Preface

The cell is the fundamental unit of life. The biological functions of an organ or tissue are
results of coordinated action of a large number of cells, each having its own properties and
dynamic behavior. While it is traditional to classify cells with similar function and morphol-
ogy as cell types, it is also well-recognized that, even within each cell type, there remain
significant differences, or states, among individual cells. Current knowledge of the reper-
toire of cell types and cell states, as well as their dynamic changes, remains highly incomplete.
Systematic, comprehensive characterization of spatial and temporal organization of cellular
heterogeneity, along with the mechanisms underlying cell-type/state transition and main-
tenance, has important implications in development and diseases.
It is not until recently that it has become feasible to systematically investigate cellular
heterogeneity at the single-cell resolution, thanks to the rapid development of a number of
advanced technologies including sequencing, imaging, and microfluidic devices. Collec-
tively, single-cell technologies have created exciting opportunities to systematically charac-
terize the molecular behavior of individual cells at the omics scale. At the same time, the
analysis and integration of single-cell omic data are difficult due to a number of challenges
such as sparsity, technical variability, and spatial-temporal complexity.
During the past few years, numerous computational methods and software packages
have been developed to overcome these challenges. The aim of this book is to introduce to
the community the state of the art of computational approaches in single-cell data analysis.
Each chapter presents a computational toolbox that is aimed to overcome a specific chal-
lenge in single-cell analysis, such as data normalization, rare cell-type identification, and
spatial transcriptomics analysis. Rather than explaining the mathematical details, here the
focus is on hands-on implementation of computational methods for analyzing experimental
data. Taken together, these chapters cover a wide range of tasks and may serve as a handbook
for single-cell data analysis.
Finally, I would like to thank Prof. John M. Walker for his kind invitation and sustained
support throughout the preparation of this book. I would also like to express my sincere
gratitude to all the contributors for sharing their protocols.

Boston, MA, USA Guo-Cheng Yuan

v
Contents

Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Contributors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
1 Quality Control of Single-Cell RNA-seq . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Peng Jiang
2 Normalization for Single-Cell RNA-Seq Data Analysis. . . . . . . . . . . . . . . . . . . . . . . 11
Rhonda Bacher
3 Analysis of Technical and Biological Variability in Single-Cell
RNA Sequencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Beomseok Kim, Eunmin Lee, and Jong Kyoung Kim
4 Identification of Cell Types from Single-Cell Transcriptomic Data . . . . . . . . . . . . 45
Karthik Shekhar and Vilas Menon
5 Rare Cell Type Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Lan Jiang
6 scMCA: A Tool to Define Mouse Cell Types Based on Single-Cell
Digital Expression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Huiyu Sun, Yincong Zhou, Lijiang Fei, Haide Chen, and Guoji Guo
7 Differential Pathway Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Jean Fan
8 Pseudotime Reconstruction Using TSCAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
Zhicheng Ji and Hongkai Ji
9 Estimating Differentiation Potency of Single Cells
Using Single-Cell Entropy (SCENT) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
Weiyan Chen and Andrew E. Teschendorff
10 Inference of Gene Co-expression Networks from Single-Cell
RNA-Sequencing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
Alicia T. Lamere and Jun Li
11 Single-Cell Allele-Specific Gene Expression Analysis. . . . . . . . . . . . . . . . . . . . . . . . . 155
Meichen Dong and Yuchao Jiang
12 Using BRIE to Detect and Analyze Splicing Isoforms
in scRNA-Seq Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
Yuanhua Huang and Guido Sanguinetti
13 Preprocessing and Computational Analysis of Single-Cell
Epigenomic Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
Caleb Lareau, Divy Kangeyan, and Martin J. Aryee

vii
viii Contents

14 Experimental and Computational Approaches for Single-Cell

Enhancer Perturbation Assay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
Shiqi Xie and Gary C. Hon
15 Antigen Receptor Sequence Reconstruction and Clonality Inference
from scRNA-Seq Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
Ida Lindeman and Michael J. T. Stubbington
16 A Hidden Markov Random Field Model for Detecting Domain
Organizations from Spatial Transcriptomic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
Qian Zhu

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
Contributors

MARTIN J. ARYEE Department of Biostatistics, Harvard T.H. Chan School of Public Health,
Boston, MA, USA; Department of Pathology, Massachusetts General Hospital, Boston, MA,
USA; Broad Institute of MIT and Harvard, Cambridge, MA, USA
RHONDA BACHER Department of Biostatistics, University of Florida, Gainesville, FL, USA
HAIDE CHEN Center for Stem Cell and Regenerative Medicine, Zhejiang University School
of Medicine, Hangzhou, China; Stem Cell Institute, Zhejiang University, Hangzhou,
China
WEIYAN CHEN CAS Key Lab of Computational Biology, CAS-MPG Partner Institute for
Computational Biology, Shanghai Institute of Nutrition and Health, Shanghai Institute of
Biological Sciences, University of Chinese Academy of Sciences, Chinese Academy of Sciences,
Shanghai, China
MEICHEN DONG Department of Biostatistics, Gillings School of Global Public Health,
University of North Carolina, Chapel Hill, NC, USA
JEAN FAN Department of Chemistry and Chemical Biology, Harvard University, Boston,
MA, USA
LIJIANG FEI Center for Stem Cell and Regenerative Medicine, Zhejiang University School of
Medicine, Hangzhou, China; Stem Cell Institute, Zhejiang University, Hangzhou, China
GUOJI GUO Center for Stem Cell and Regenerative Medicine, Zhejiang University School of
Medicine, Hangzhou, China; Stem Cell Institute, Zhejiang University, Hangzhou, China
GARY C. HON Department of Obstetrics and Gynecology, Cecil H. and Ida Green Center for
Reproductive Biology Sciences, University of Texas Southwestern Medical Center, Dallas,
TX, USA
YUANHUA HUANG EMBL-European Bioinformatics Institute, Cambridgeshire, UK
HONGKAI JI Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health,
Baltimore, MD, USA
ZHICHENG JI Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health,
Baltimore, MD, USA
LAN JIANG Howard Hughes Medical Institute, Boston Children’s Hospital, Boston, MA,
USA; Program in Cellular and Molecular Medicine, Boston Children’s Hospital, Boston,
MA, USA; Division of Hematology/Oncology, Department of Pediatrics, Boston Children’s
Hospital, Boston, MA, USA
PENG JIANG Regenerative Biology Laboratory, Morgridge Institute for Research, Madison,
WI, USA
YUCHAO JIANG Department of Biostatistics, Gillings School of Global Public Health,
University of North Carolina, Chapel Hill, NC, USA; Department of Genetics, School of
Medicine, University of North Carolina, Chapel Hill, NC, USA; Lineberger
Comprehensive Cancer Center, University of North Carolina, Chapel Hill, NC, USA
DIVY KANGEYAN Department of Biostatistics, Harvard T.H. Chan School of Public Health,
Boston, MA, USA; Department of Pathology, Massachusetts General Hospital, Boston, MA,
USA
BEOMSEOK KIM Department of New Biology, DGIST, Daegu, Republic of Korea
JONG KYOUNG KIM Department of New Biology, DGIST, Daegu, Republic of Korea
ALICIA T. LAMERE Mathematics Department, Bryant University, Smithfield, RI, USA

ix
x Contributors

CALEB LAREAU Department of Biostatistics, Harvard T.H. Chan School of Public Health,
Boston, MA, USA; Department of Pathology, Massachusetts General Hospital, Boston, MA,
USA
EUNMIN LEE Department of New Biology, DGIST, Daegu, Republic of Korea
JUN LI Applied and Computational Mathematics and Statistics Department, University of
Notre Dame, Notre Dame, IN, USA
IDA LINDEMAN Wellcome Sanger Institute, Hinxton, Cambridge, UK; KG Jebsen Coeliac
Disease Research Centre and Department of Immunology, University of Oslo, Oslo, Norway
VILAS MENON Janelia Research Campus, Howard Hughes Medical Institute, Ashburn, VA,
USA; Columbia University Medical Center, New York, NY, USA
GUIDO SANGUINETTI School of Informatics, University of Edinburgh, Edinburgh, UK
KARTHIK SHEKHAR Klarman Cell Observatory, Broad Institute of MIT and Harvard,
Cambridge, MA, USA
MICHAEL J. T. STUBBINGTON Wellcome Sanger Institute, Hinxton, Cambridge, UK
HUIYU SUN Center for Stem Cell and Regenerative Medicine, Zhejiang University School of
Medicine, Hangzhou, China; Stem Cell Institute, Zhejiang University, Hangzhou, China
ANDREW E. TESCHENDORFF CAS Key Lab of Computational Biology, CAS-MPG Partner
Institute for Computational Biology, Shanghai Institute of Nutrition and Health,
Shanghai Institute of Biological Sciences, University of Chinese Academy of Sciences,
Chinese Academy of Sciences, Shanghai, China; UCL Cancer Institute, University College
London, London, UK
SHIQI XIE Department of Obstetrics and Gynecology, Cecil H. and Ida Green Center for
Reproductive Biology Sciences, University of Texas Southwestern Medical Center, Dallas,
TX, USA
YINCONG ZHOU Stem Cell Institute, Zhejiang University, Hangzhou, China; College of Life
Sciences, Zhejiang University, Hangzhou, China
QIAN ZHU Dana-Farber Cancer Institute, Boston, MA, USA
Chapter 1

Quality Control of Single-Cell RNA-seq

Peng Jiang

Abstract
Single-cell RNA-seq (scRNA-seq) is emerging as a promising technology to characterize and dissect the
cell-to-cell variability. However, the mixture of technical noise and intrinsic biological variability makes
separating technical artifacts from real biological variation cells particularly challenging. Proper detection
and filtering out technical artifacts before downstream analysis are critical. Here, we present a protocol that
integrates both gene expression patterns and data quality to detect technical artifacts in scRNA-seq samples.

Key words scRNA-seq, Quality control, Integrate, Gene expression patterns, Data quality

1 Introduction

Single-cell RNA-seq (scRNA-seq) provides a relatively unbiased

approach to investigate the heterogeneity of cells in complex mix-
tures [1]. It has revolutionized our capacity to understand the
transcriptomic diversity of cellular states [2, 3], lineages [4], and
diseases [5]. However, one of the major challenges of this technol-
ogy is the noise behind the data [6, 7]. For example, profiling low
amounts of mRNA will likely lead to missing transcripts (“dropout”
events) during the reverse-transcription step and also substantially
distort original transcript abundance [6, 8]. Comparison of the top
differentially expressed genes between cell populations shows poor
consistency, suggesting that high variance could be caused by high-
magnitude outliers [8]. On the other hand, gene expression among
cells is inherently stochastic and the cell-to-cell variations can also
be a result of transcriptional bursts or fluctuations [9]. Controlling
quality of scRNA-seq and discarding technical artifacts are very
important for downstream analysis.
To detect potential technical artifacts (bad samples) in scRNA-
seq, previous studies have used various strategies that can be gener-
ally grouped into three categories. The first category involves using
housekeeping genes to perform quality control (QC). For example,
cells are filtered out if certain housekeeping genes (e.g., Actb,

Guo-Cheng Yuan (ed.), Computational Methods for Single-Cell Data Analysis, Methods in Molecular Biology, vol. 1935,
https://doi.org/10.1007/978-1-4939-9057-3_1, © Springer Science+Business Media, LLC, part of Springer Nature 2019

1
2 Peng Jiang

Gapdh) are not expressed or abnormally expressed [10, 11]. The

assumption of this approach is that housekeeping genes are highly
and consistently expressed. This is true for bulk RNAs but it is not
necessarily true for single cells (see Note 1). For instance, a study
using single-cell qPCR has shown that the expressions of
housekeeping genes have high variations among individual cells,
and different cell types have distinguished housekeeping gene
expression patterns [12]. Thus, a reliance on housekeeping genes
to perform QC does not work for scRNA-seq samples. The second
category for QC involves using overall gene expression patterns to
define technical artifacts. For example, cells show distinguished
gene expression patterns if compared with the majority of the
cells are excluded from downstream analysis [13] (see Notes 2–3).
The major problem of these methods in this category is that they
may remove cells with real biological variation. The third category
involves using the number of genes detected and/or the reads
mapping rate to define technical artifacts [14]. However, the num-
ber of genes detected varies among experiments depending on the
quality of a particular library, cell type, or RNA-protocol. The
mapping rate cutoff is also hard to make, and thus the cutoff
settings are typically arbitrary. Thus, although single-cell
approaches hold great promise in investigating the cell heterogene-
ity, QC remains one of major challenges [7]. Nonetheless, our
previous studies and own work demonstrate that integrating both
gene expression patterns and sequencing data quality can be a
reasonable strategy for performing QC [15]. The basic assumption
of this approach is that if gene expression outliers are also associated
with poor sequencing library quality, they are more likely to be
technical artifacts than being real biological variation cells. We also
assume that gene expression outliers contain both cells with real
biological variation and technical artifacts, but the rest of the cells
(main population cells) in general are more likely to contain good
quality cells. Thus, we can use cells of the main population as
controls to estimate data quality cutoffs and a corresponding false
positive rate (FPR) (Fig. 1).
Here, we describe in detail our procedure for detecting techni-
cal artifacts in scRNA-seq using three batches of our published
human embryonic stem cells (ES cells) scRNA-seq data [16].

2 Materials

2.1 Lab Equipment 1. C1 Single-Cell Auto Prep IFC (Fluidigm).

2. EVOS FL Auto Cell Imaging system (Life Technologies).
3. Illumina HiSeq 2500 system.
Quality Control of Single-Cell RNA-seq 3

Data
D Quality (Main Population) False Positives Rate is estimated by % of main
population cells fail to pass data quality cutoﬀs
Cutoﬀ
< 5% of Cells
Gene expression outliers
High Low
Technical Artifact with low data quality

Gene Expression Outliers

Subpopulation Cells

Fig. 1 Illustration of quality control (QC) for scRNA-seq framework. Cells can be separated out based on gene
expression patterns into gene expression outliers and cells of the main population. The data quality cutoffs are
determined by allowing a certain percentage (e.g., <5%) of main population cells that fail to pass them. The
technical artifacts are defined as gene expression outliers that fail to pass data quality cutoffs. The
subpopulation cells are defined as gene expression outliers that can pass data quality cutoffs

2.2 Kits 1. SMARTer PCR cDNA Synthesis kit (Clontech).

2. Advantage 2 PCR kit (Clontech).
3. Nextera XT DNA Sample Preparation Index Kit (Illumina).

2.3 ScRNA-seq Data 1. Raw scRNA-seq dataset (H1) can be accessed by Gene Expres-
sion Omnibus (GEO) with accession number (GSE64016).
2. The downloaded files from GEO are SRA format.
3. SRA toolkit (http://www.ncbi.nlm.nih.gov/Traces/sra/sra.
cgi?view¼software) can be used to convert files from SRA
format to FASTQ format via “fastq-dump” utility.

3 Methods

3.1 H1 Human 1. Undifferentiated H1 human embryonic stem cells (hESCs)

Embryonic Stem Cells were cultured in E8 medium [17] on Matrigel-coated tissue
(hESCs) culture plates with daily media feeding at 37 C with 5% (vol/-
vol) CO2.
2. Cells were split every 3–4 days with 0.5 mM EDTA in 1 PBS
for standard maintenance.
3. Immediately before preparing single-cell suspensions for each
experiment, hESCs were individualized by Accutase (Life
Technologies), washed once with E8 medium, and resus-
pended at densities of 5.0–8.0 105 cells/mL in E8 medium
for cell capture.
4 Peng Jiang

4. The H1 hESCs is registered in the NIH Human Embryonic

Stem Cell Registry with the Approval Number: NIHhESC-10-
0043.
5. Details of the H1 cells can be found online (http://grants.nih.
gov/stem_cells/registry/current.htm?id¼29).

3.2 Single-Cell 1. 5000–8000 cells were loaded onto a medium size (10–17 μm)
Capture and cDNA C1 Single-Cell Auto Prep IFC (Fluidigm).
Library Preparation 2. The capture efficiency was inspected using EVOS FL Auto Cell
Imaging system (Life Technologies) to perform an automated
area scanning of the 96 capture sites on the IFC.
3. Empty capture sites or sites having more than one cell captured
were first noted and those samples were later excluded from
further library processing for RNA-seq.
4. Immediately after capture and imaging, reverse transcription
and cDNA amplification were performed in the C1 system
using the SMARTer PCR cDNA Synthesis kit (Clontech) and
the Advantage 2 PCR kit (Clontech).
5. Full-length, single-cell cDNA libraries were harvested the next
day from the C1 chip and diluted to a range of 0.1–0.3 ng/μL.
6. Diluted single-cell cDNA libraries were fragmented and ampli-
fied using the Nextera XT DNA Sample Preparation Kit and
the Nextera XT DNA Sample Preparation Index Kit (Illumina).
7. Libraries were multiplexed at 24 libraries per lane, and single-
end reads of 67-bp were sequenced on an Illumina HiSeq 2500
system.

3.3 Reads Mapping 1. Using Bowtie [18] to map raw reads against the reference
genes (e.g., human hg19 Refseq reference) allowing up to
two mismatches and a maximum of 20 multiple hits.
2. The mapped expected read counts and TPMs can be estimated
by RSEM [19].

3.4 Classification of 1. Given a cell, calculate a list of Spearman rank correlations

Cells into Gene comparing that given cell to the rest of the cells in the dataset
Expression Outliers (“one-to-others”).
and Cells of the Main 2. Then, that given cell is removed and a list of pairwise Spearman
Population rank correlations is calculated for the remaining cells
(“pairwise”).
3. Uses a one-sided Wilcoxon signed-rank test to assess whether
the “one-to-others” correlation is significantly lower than the
set of “pairwise” correlations.
4. A similar procedure is also performed using Pearson product-
moment correlations.
Quality Control of Single-Cell RNA-seq 5

5. Classify cells as either gene expression outliers or cells of the

main population based on p-values of both tests.
6. In this study, we define gene expression outliers as cells with p-
values less than 0.001 in both Spearman and Pearson tests.

3.5 Metrics to 1. Total number of mapped reads: the sum of mapped reads for all
Evaluate the scRNA- the genes. An extremely low number of mapped reads may
seq Library Quality affect the ability to characterize the transcriptome and could
be due to either a low mapping rate or other technical issues
introduced during sample prep or sequencing.
2. Mapping rate: the total number of mapped reads divided by the
read depth. Mapping rate can be effected by RNA degradation,
contamination with genomic DNA, or other technical issues
introduced during sample prep or sequencing.
3. Reads complexity: the ratio of unique reads (the count of reads
after removing duplicates) over the total number of all reads.

3.6 Combining 1. For each cell, calculates a quantile score (QS) for each quality
Library Quality Metrics metric. Given a metric, the QS of a cell is defined as the number
to Combined Scores of other cells in the dataset with equal or lower values divided
by the total number of cells. For example, if a cell has the 20th
highest mapping rate among a set of 80 cells, then the mapping
rate QS for this particular cell is 0.75. A higher QS indicates
better data quality.
2. Minimal Quantile Score (MQS): the minimal QS of the three
quality metrics.

MQS ¼ minfQSi g
i∈fmapped reads; mapping rate; reads complexityg
MQS assumes that each of the three quality metrics is critical
and that a deficiency in any of the three is a potential indicator of
technical issues. Thus the “final quality” of a cell depends on its
lowest quality metric score.
3. Weighted Combined Quality Score (WCQS): WCQS assumes
that the importance of each quality metric may depend on
specific experimental batches, protocols, and/or conditions.
WCQS assumes that the importance of each quality metric for
detecting technical artifacts is proportional to its ability to
discriminate between gene expression outliers and cells of the
main population. For example, given a batch of cells, if the
mapping rate of a given batch of cells can perfectly discriminate
between gene expression outliers and cells of the main popula-
tion, then it is more likely that the mapping rate is a dominant
player in detecting technical artifacts. In contrast, if a metric
does not indicate differences between gene expression outliers
and cells of the main population, then it should be removed
6 Peng Jiang

from prediction of potential technical artifacts. WCQS calcu-

lates a weighted aggregation quality score for each sample
defined as:
Pk
i¼1 w i Z i
WCQS ¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
q Pk 2
i¼1 w i

where Zi is the transformed Z-score of QS for data quality metric i,

according to Zi ¼ Φ1 (1 Pi) where Φ is the standard normal
cumulative distribution function, and Pi is the probability that
quality metric i is lower in a given cell than in the rest of the cells.
We estimate Pi is as Pi ¼ 1 QSi. To avoid numerical error, we can
set maximal and minimal Zi as +8.5 and 8.5, which corresponds to
Pi < 1016 and Pi > (1–1016), respectively. The wi is the weight-
ing factor for data quality metric i and is estimated according to the
individual quality metric’s ability to discriminate between cells of
the main population and gene expression outliers as
(
AUCi 0:5
wi ¼ ðAUCi > 0:5Þ
0:5
0 ðAUCi 0:5Þ
where AUCi is the area under the curve (AUC) of the receiver
operating characteristic (ROC) curve for quality metric i. If a
quality metric i (e.g., mapping rate) can perfectly discriminate
between cells of the main population and gene expression outliers,
then AUCi ¼ 1 and thus wi ¼ 1. If the values of a quality metric i are
randomly distributed in cells of the main population and gene
expression outliers, then the expected AUCi ¼ 0.5 and thus wi ¼ 0.

3.7 Identification of 1. We assume that good quality cells should pass particular MQS
Technical and WCQS cutoffs. We use cells of the main population as
controls to determine these cutoffs (see Note 4). You can
enumerate all possible combinatorial pairs of MQS and
WCQS cutoffs in a given dataset, calculate the fraction of cells
of the main population that pass both cutoffs of a pair, and then
uses the remaining cells of the main population to estimate the
corresponding false positive rate (FPR) for that pair (Fig.1).
2. If more than one pair of MQS and WCQS cutoffs results in the
same FPR, you can choose the cutoff pair that maximizes the
percentage of gene expression outliers failing to pass.
3. Applies these cutoffs to the gene expression outliers to identify
technical artifacts. Technical artifacts are defined as gene
expression outliers with poor data quality measurements.

3.8 SinQC Software 1. SinQC [15] is designed for implementing (Subheadings

3.3–3.6) (see Note 5).
2. The SinQC software and detailed user manual are available at
http://www.morgridge.net/SinQC.html
Quality Control of Single-Cell RNA-seq 7

4 Notes

1. Several studies used housekeeping genes to perform quality

control for scRNA-seq datasets [10, 11]. To further investigate
the feasibility of using housekeeping genes to perform quality
control for scRNA-seq datasets, we calculated the gene expression
levels (TPMs) for two housekeeping genes (Actb and Gapdh) in a
mouse scRNA-seq dataset [20]. The Gapdh is significantly higher
expressed in ES cells than in MEF cells (P ¼ 5.6e–06, 1-sided
Wilcoxon rank sum test) while the Actb is significantly lower
expressed in ES cells than in MEF cells (P < 2.2e–16, 1-sided
Wilcoxon rank sum test) [15]. It suggests that it is infeasible to use
housekeeping genes to perform QC for scRNA-seq dataset.
2. Using median gene expression values or the number of genes
detected (TPM > 1) to perform quality control (QC): Low
data quality (e.g., low mapping rate) can result in fewer number
of genes detected or low median gene expression values. How-
ever, the number of genes detected (TPM > 1) can also be
biologically related. The number of genes detected varies
depending on the quality of a particular library and cell types
[8]. We calculated the number of genes detected in a highly
heterogeneous scRNA-seq dataset containing 301 cells (mix-
ture of 11 different cell types) [4]. The number of genes
detected is highly cell type dependent, suggesting using the
number of gene detected to identify technical artifacts will
result in substantial bias ([15], Fig. S8). For highly heteroge-
neous scRNA-seq datasets, the technical artifacts detected by
this method are more likely to have fewer genes detected if
compared with QC pass cells. But this does not mean that the
cells with fewer genes detected are technical artifacts.
3. Using “genes detected and/or mapping rate” to perform qual-
ity control (QC): The basic idea of using “genes detected
and/or mapping rate” [14] to perform QC is that the fewer
number of genes detected could be due to both technical issues
and biological heterogeneity. But if a cell with fewer genes
detected is also associated with low mapping rate (mapping
rate is technical related), the cell might be more likely to be a
technical artifact. This approach is the most conceptually simi-
lar to our method. However, our method has strengths on two
aspects: First, since the mapping rate and the number of genes
detected are not directly correlated, the mapping rate cutoff
and the number of genes detected cutoff chosen are very
difficult and arbitrary. Our method maximizes the probability
that the technical artifacts are correctly detected while also
minimizing the false positives by using cells of the main popu-
lation as data quality controls. Second, in addition to mapping
rate, our method also takes other library quality metrics into
consideration (e.g., library complexity).
8 Peng Jiang

4. Our method assumes that gene expression outliers contain

both technical artifacts and biological variant cells, but cells of
the main population, in general, are more likely to contain
good quality cells. Thus, our method uses cells of the main
population as controls to estimate data quality score cutoffs and
a corresponding false positive rate (FPR). However, given a
FPR, it is a challenge to estimate the corresponding false nega-
tive rate (technical artifacts that are missed), due to that
scRNA-seq has no “ground-truth” for “bad samples.” Sensi-
tivity (also called the true positive rate) is the proportion of
positives (“technical artifacts”) that are correctly identified.
Specificity (also called the true negative rate) measures the
proportion of negatives (“good quality single cells”) that are
correctly identified. Since scRNA-seq has no “ground-truth”
for “good samples” and “bad samples,” it is a challenge to
estimate these two measurements directly. To further compare
the sensitivity and specificity of our method in high-
heterogeneity and low-heterogeneity datasets, we applied our
method to datasets with mixture of different portions of cell
types, and compared the overlap of technical artifacts detected
among them. For example, using a mouse scRNA-seq dataset
(48 ES cells and 44 MEF cells) [20], we mixed the cells into
three different categories: high-heterogeneity (48 ES cells +44
MEF cells), medium-heterogeneity (“ES cells (all) + 1/5
(MEF) cells” and (“MEF cells (all) + 1/5 (ES) cells”), and
low heterogeneity ((48 ES cells) and (44 MEF cells), sepa-
rately). Our method detects two technical artifacts (ESC_46
and ESC_32) in the high-heterogeneity dataset (48 ES cells
+44 MEF cells). These two technical artifacts can also be
robustly detected either in medium-heterogeneity dataset or
low heterogeneity dataset. However, if we apply our method to
each individual ES (48 cells) or MEF (44 cells) dataset sepa-
rately, we can detect more artifacts, comparing to apply our
method on pooled mixture datasets (48 ES cells +44 MEF
cells). Therefore, we conclude that our method increases spec-
ificity at the cost of dropping sensitivity when the extent of
heterogeneity in a dataset is high. In highly heterogeneous cell
populations, detecting technical artifacts carries a higher risk of
dropping real biological variation cells. The increased specific-
ity and decreased sensitivity of our method for highly hetero-
geneous cell populations is a good feature that can minimize
the false positives.
5. The running SinQC for scRNA-seq QC is not restrictive to
RSEM output files (“*.genes.results”). For users who do not
use RSEM, they can make a customized RSEM files (“*.genes.
results”) to run SinQC. A detailed manual can be found in
SinQC website (http://www.morgridge.net/SinQC.html).
Quality Control of Single-Cell RNA-seq 9

References
1. Eberwine J, Sul J-Y, Bartfai T, Kim J (2014) 11. Treutlein B, Brownfield DG, Wu AR, Neff NF,
The promise of single-cell sequencing. Nat Mantalas GL, Espinoza FH et al (2014) Recon-
Methods 11(1):25–27 structing lineage hierarchies of the distal lung
2. Trapnell C, Cacchiarelli D, Grimsby J, epithelium using single-cell RNA-seq. Nature
Pokharel P, Li S, Morse M et al (2014) The 509(7500):371–375. Epub 2014/04/18.
dynamics and regulators of cell fate decisions https://doi.org/10.1038/nature13173
are revealed by pseudotemporal ordering of 12. Oyolu C, Zakharia F, Baker J (2012) Distin-
single cells. Nat Biotechnol 32(4):381–386. guishing human cell types based on
Epub 2014/03/25. https://doi.org/10. housekeeping gene signatures. Stem Cells 30
1038/nbt.2859 (3):580–584
3. Xue Z, Huang K, Cai C, Cai L, Jiang CY, Feng Y 13. Zeisel A, Muñoz-Manchado AB, Codeluppi S,
et al (2013) Genetic programs in human and Lönnerberg P, La Manno G, Juréus A et al
mouse early embryos revealed by single-cell (2015) Cell types in the mouse cortex and
RNA sequencing. Nature 500(7464):593–597. hippocampus revealed by single-cell RNA-seq.
Epub 2013/07/31. https://doi.org/10. Science 347(6226):1138–1142
1038/nature12364 14. Kumar RM, Cahan P, Shalek AK, Satija R,
4. Pollen AA, Nowakowski TJ, Shuga J, Wang X, DaleyKeyser AJ, Li H et al (2014) Decon-
Leyrat AA, Lui JH et al (2014) Low-coverage structing transcriptional heterogeneity in plu-
single-cell mRNA sequencing reveals cellular ripotent stem cells. Nature 516(7529):56–61.
heterogeneity and activated signaling pathways Epub 2014/12/05. https://doi.org/10.
in developing cerebral cortex. Nat Biotechnol 1038/nature13920
32(10):1053–1058. Epub 2014/08/05. 15. Jiang P, Thomson JA, Stewart R (2016) Qual-
https://doi.org/10.1038/nbt.2967 ity control of single-cell RNA-seq by SinQC.
5. Patel AP, Tirosh I, Trombetta JJ, Shalek AK, Bioinformatics 32(16):2514–2516. https://
Gillespie SM, Wakimoto H et al (2014) Single- doi.org/10.1093/bioinformatics/btw176
cell RNA-seq highlights intratumoral hetero- 16. Leng N, Chu LF, Barry C, Li Y, Choi J, Li X,
geneity in primary glioblastoma. Science 344 et al (2015) Oscope identifies oscillatory genes
(6190):1396–1401. Epub 2014/06/14. in unsynchronized single-cell RNA-seq experi-
https://doi.org/10.1126/science.1254257 ments. Nat Methods 12(10):947–950.
6. Sandberg R (2014) Entering the era of single- https://doi.org/10.1038/nmeth.3549.
cell transcriptomics in biology and medicine. PubMed PMID: 26301841; PubMed Central
Nat Methods 11(1):22–24 PMCID: PMC4589503
7. Stegle O, Teichmann SA, Marioni JC (2015) 17. Chen G, Gulbranson DR, Hou Z, Bolin JM,
Computational and analytical challenges in Ruotti V, Probasco MD et al (2011) Chemically
single-cell transcriptomics. Nat Rev Genet 16 defined conditions for human iPSC derivation
(3):133–145 and culture. Nat Methods 8(5):424–429.
8. Kharchenko PV, Silberstein L, Scadden DT https://doi.org/10.1038/nmeth.1593
(2014) Bayesian approach to single-cell differ- 18. Langmead B, Trapnell C, Pop M, Salzberg SL
ential expression analysis. Nat Methods 11 (2009) Ultrafast and memory-efficient align-
(7):740–742 ment of short DNA sequences to the human
9. Munsky B, Neuert G, van Oudenaarden A genome. Genome Biol 10(3):R25. https://
(2012) Using gene expression noise to under- doi.org/10.1186/gb-2009-10-3-r25
stand gene regulation. Science 336 19. Li B, Dewey CN (2011) RSEM: accurate tran-
(6078):183–187 script quantification from RNA-Seq data with
10. Ting DT, Wittner BS, Ligorio M, Vincent or without a reference genome. BMC Bioinfor-
Jordan N, Shah AM, Miyamoto DT et al matics 12:323. https://doi.org/10.1186/
(2014) Single-cell RNA sequencing identifies 1471-2105-12-323
extracellular matrix gene expression by pancre- 20. Islam S, Kj€allquist U, Moliner A, Zajac P, Fan
atic circulating tumor cells. Cell Rep 8 J-B, Lönnerberg P et al (2011) Characteriza-
(6):1905–1918. Epub 2014/09/23. https:// tion of the single-cell transcriptional landscape
doi.org/10.1016/j.celrep.2014.08.029 by highly multiplex RNA-seq. Genome Res 21
(7):1160–1167
Chapter 2

Normalization for Single-Cell RNA-Seq Data Analysis

Rhonda Bacher

Abstract
In this chapter, we describe a robust normalization method for single-cell RNA sequencing data. The
procedure, SCnorm, is implemented in R and is part of Bioconductor. Also included in the package are
diagnostic functions to visualize normalization performance. This chapter provides an overview of the
methodology and provides example work-flows.

Key words Single-cell RNA-seq, Normalization, Gene expression, Read count, High-throughput
sequencing

1 Introduction

The purpose of normalization is to remove the effects of systematic

technical variability on the observed gene expression measure-
ments. Popular RNA-seq analyses such as sample clustering or
classification, and differential gene expression require between-
sample normalization as a first step to ensure measurements are
comparable across samples. Variation in sequencing depth (the total
amount each sample has been sequenced) is one particular artifact
that arises during the RNA-seq experiment. As samples are
sequenced more deeply, all gene counts should, on average,
increase proportionally [1]. We call this dependence of the read
counts on sequencing depth the count-depth relationship [2].
Most existing normalization methods developed for bulk
RNA-seq experiments calculate global scale factors to adjust each
sample for sequencing depth (one scale factor per sample is applied
to all genes in the sample) [1, 3–5]. These methods work well when
the count-depth relationship is common across genes, as it is in
bulk RNA-seq data (Fig. 1a). When global scale-factor normaliza-
tion [3] is effective, the resulting normalized count-depth relation-
ships are near zero for most genes (Fig. 1b). Conversely, in single-
cell RNA-seq data (scRNA-seq) genes exhibit variability in the
count-depth relationship (Fig. 1c). In this case, when the count-

Guo-Cheng Yuan (ed.), Computational Methods for Single-Cell Data Analysis, Methods in Molecular Biology, vol. 1935,
https://doi.org/10.1007/978-1-4939-9057-3_2, © Springer Science+Business Media, LLC, part of Springer Nature 2019

11
12 Rhonda Bacher

A Bulk C Single-cell

2.0
7 Expression Expression

9
High High
Un-normalized

8
6

1.5
7
Log Expression

Log Expression
5

6
Density

Density
4

5
Low Low

1.0
4
3

3
2

0.5
2
1

0.0
0

0
14.5 14.7 14.9 15.1 15.3 −1 0 1 2 14.0 14.5 15.0 15.5 16.0 −2 −1 0 1 2 3

Log Sequencing Depth Slope Log Sequencing Depth Slope

B D

2.0
7

9
Expression Expression
High High

8
6

1.5
7
Normalized
Log Expression

Log Expression
5

6
3
Density

Density
4

5
Low Low

1.0
4
3

3
2

0.5
2
1
1

0.0
0

0
14.5 14.7 14.9 15.1 15.3 −2 −1 0 1 2 14.0 14.5 15.0 15.5 16.0 −2 −1 0 1 2 3
Log Sequencing Depth Slope Log Sequencing Depth Slope

Fig. 1 For each gene, median quantile regression was used to estimate the count–depth relationship for a bulk
and single-cell RNA-seq dataset before and after normalization. (a) The left plot shows log un-normalized
expression versus log depth with estimated regression fits for three genes in a bulk RNA-seq dataset
containing no zero measurements and having low, moderate, and high expression defined as the median
expression among nonzero un-normalized measurements in the 10th to 20th quantile (blue), 40th to 50th
quantile (black), and 80th to 90th quantile (red), respectively. On the right, the estimated regression fits for all
genes within ten equally sized gene groups where genes were grouped by their median expression among
nonzero un-normalized measurements. (b) Similar to a for normalized bulk RNA-seq data. (c and d) Similar to
a and b for single-cell RNA-seq data

depth relationship is not common across genes, the application of

traditional global scale factor normalization methods leads to
biased counts (Fig. 1d).
Our method, SCnorm [2], addresses the variability in the
count-depth relationship in scRNA-seq data. SCnorm uses quantile
regression to estimate the count-depth relationship for each gene.
Genes with similar relationships are grouped together, and a second
quantile regression estimates scaling factors within each group. In
this chapter, we will further describe the SCnorm method and its
software implementation.

2 Materials

Our method for normalizing single-cell RNA-seq data is imple-

mented in the R package SCnorm and is available on Bioconductor:
https://bioconductor.org/packages/devel/bioc/html/
SCnorm.html
The package is compatible with R versions greater than 3.4.0.
The package may be installed directly from Bioconductor:
Normalization for Single-Cell RNA-Seq Data Analysis 13

source("https://bioconductor.org/biocLite.R")
biocLite("SCnorm")

3 Methods

We describe our approach step-by-step as follows:

1. Filter genes. Filtering of genes with very low average expression
is a common preprocessing step in RNA-seq analyses. As
detailed in Note 1, we require genes to have at least ten
nonzero expression counts. Filtered genes will not be included
in the normalization.
2. Estimate count-depth relationship per gene. Let Yg,j denote the
log nonzero expression for gene g in cell j for g ¼ 1,...,m and
j ¼ 1,...,n.
XLet
Xj denote the log sequencing depth, calculated
m
as log g¼1
eY g, j
: A median quantile regression is fit to
estimate each gene’s relationship between log un-normalized
expression and log sequencing depth as

Q 0:5 Y g , j jX j ¼ βg , 0 þ βg , 1 X j
The estimated slope, βbg , 1 , represents the count-depth relation-
ship for gene g.
3. Group genes. Genes are clustered into K groups based on their
βbg , 1 using the K-medoids algorithm [6] via the clara function
in the R package cluster [7]. If a cluster contains less than
100 genes, it is joined to the nearest cluster based on the cluster
centers (medoids). Starting at K ¼ 1, SCnorm will increase
K to K + 1 until a convergence criterion is satisfied (described
below in step 5. Evaluating K).
4. Group fit. Within a homogenous gene group, a representative
gene could be selected and the scale factors calculated as the
ratio of its counts in each sample to the gene’s overall mean.
However, due to numerous zeros in the data and high levels of
variability [8], a representative gene for the group must be
estimated. We estimate a representative gene as the predicted
values from a quantile regression between the log nonzero
un-normalized expression counts from all genes in the group
and log sequencing-depth. For computational reasons, we
restrict the number of genes considered (see Note 1) to the
25% whose βbg , 1 is closest to the group’s overall count-depth
relationship, estimated as modeg β^g , 1 .
When fitting the quantile regression, the median may not
always best represent subtle effects of the count-depth relation-
ship for genes in a particular group, therefore we consider
multiple quantiles τ and degrees d and fit:
14 Rhonda Bacher

Q τk , d k Y j jX j ¼ βτ0k þ βτ1k X j þ . . . þ βτdk X dj k

We fit over a grid of combinations of τ and d from τ ∈ (.05, .10,

. . ., .85, .90) and d ∈ (1, 2, . . ., 6). For each model fit, the pre-
dicted values, Y b τk , d k , can be viewed as a representative gene. To
j
∗
evaluate which specific values of τk and d k , τ∗ k and d k , are optimal,
we choose the model in which the predicted gene’s count-depth
relationship is closest to the group’s mode. Specifically, the pre-
dicted gene’s count depth relationship, b η τ1k , d k , is estimated using
median quantile regression:
τk , d k
Q 0:5 Y b jX j ¼ ητ0k , d k þ ητ1k , d k X j
j

The selected model is then the τk and dk which minimize

modeg β^g , 1 j
η τ1k , d k
j^
Once the best model is selected, the scale factors are estimated
∗ ∗
b τk , d k
eY j ∗
as SF j , k ¼ τ∗
where Y τk is the τ*th quantile of expression
eY k

counts in the kth group. The normalized counts Y 0g , j are given by

Y
e g, j
SF j , k .
Evaluating K. To determine if a given K is sufficient, we
evaluate the count-depth relationship of the normalized data. For
every gene, its log normalized counts are regressed against the
original log sequencing depth in a median quantile regression:

Q 0:5 Y 0g , j jX j ¼ β0g , 0 þ β0g , 1 X j

We check that the distribution of β0g , 1 is centered around zero

for genes across all expression levels. To do so, we group genes into
ten equally sized groups based on their median nonzero expression.
For each group, the mode of the slopes is estimated. Any mode
outside of (0.1, 0.1) is evidence that the current K is not optimal,
and in which case K is increased by one and the group fit and
evaluation steps are repeated (see Note 2 for an example).
Multiple Conditions. When multiple conditions or batches are
present, SCnorm utilizes the procedure above for each condition
and then rescales across all cells. During rescaling, all genes are split
into quartiles based on their nonzero median un-normalized
expression measurements. Within each quartile and condition,
each gene is adjusted by a common scale factor defined as the
median of the gene specific fold-changes, where fold-changes are
calculated between each gene’s condition-specific mean and its
mean across conditions. Means are calculated over nonzero counts
(see Note 3 for additional details). If spike-ins are available and are
representative of the biological genes, they may be used in this step
to estimate the across condition scaling factors (additional details
are in Note 4).
Normalization for Single-Cell RNA-Seq Data Analysis 15

4 Notes

Note 1 reviews all arguments for running the main functions in the
SCnorm package. Note 2 demonstrates a standard workflow for
normalizing scRNA-seq data and evaluating the normalization.
Note 3 reviews important considerations for normalization when
multiple conditions are present. Note 4 details how to use spike-ins
for across condition scaling.
1. There are two main functions accessible by the user in the
SCnorm package:
SCnorm—implements the normalization procedure.
plotCountDepth—estimates and graphically displays the
count-depth relationships.
The SCnorm function requires only two arguments: the
un-normalized expression matrix (Data) and a vector denoting
the condition or batch each cell belongs to (Conditions). The
Data argument should contain un-normalized expression measure-
ments with genes (or other features) on the rows and cells on the
columns. The expected format of Data is a data matrix in R or of
class SummarizedExperiment of the SummarizedExperiment
R package [9]. The Conditions argument should be a vector with
length equal to the number of cells and match the exact order of the
columns of the Data argument.
The SCnorm function will implement the entire normalization
procedure described above, automatically iterating until an optimal
K is reached. A dataset with 100 cells and 20,000 genes will take
approximately 15 min to run with three computing cores. The
computation time will increase as the number of cells and genes
increases, though increasing the number of cores can be used to
offset the increased time. In the example given, increasing to seven
cores reduces the time to around 8 min.
The output of SCnorm is a SummarizedExperiment object
containing at minimum the normalized data (NormalizedData)
and a list of the genes not included in the normalization (Genes-
FilteredOut). Additional outputs may be generated using the
non-default options and are described in more detail below.
The full SCnorm function with default arguments is:

SCnorm(Data = NULL, Conditions = NULL, PrintProgressPlots = FALSE,

reportSF = FALSE, FilterCellNum = 10, FilterExpression = 0,
Thresh = 0.1, K = NULL, NCores = NULL, ditherCounts = FALSE,
PropToUse = 0.25, Tau = 0.5, withinSample = NULL, useZerosToScale
= FALSE,
useSpikes = FALSE)
16 Rhonda Bacher

Additional options the user may specify are:

PrintProgressPlots: If set to TRUE, SCnorm will automat-
ically produce count-depth plots evaluating each value of
K attempted. It is highly recommended to set this option to
TRUE and wrap the SCnorm function call in a pdf (or other gra-
phics) device. Although SCnorm determines the optimal
K internally, viewing these plots is one way to evaluate the results
of the normalization.
reportSF: If set to TRUE, SCnorm will output the scaling
factors calculated for each group. This may be useful if users plan
to conduct downstream differential expression (DE) tests, where
the matrix of scale factors may be provided to functions in the
EBSeq [10] and DESeq2 [11] packages which require the
un-normalized data as input. For scRNA-seq DE tools, such as
MAST [12] and scDD [13], the normalized counts may be used
directly.
FilterExpression: A single numeric value denoting the
cutoff used to exclude genes from the normalization based on
average expression. Occasionally (but rarely) very lowly expressed
genes with many tied counts may hinder the convergence of
SCnorm and need to be filtered from the normalization procedure.
FilterCellNum: The minimum number of nonzero counts
required for a gene to be included in the normalization. This value
must be larger than or equal to ten. Since SCnorm fits a median
quantile regression for each gene, we require a minimum of ten
nonzero expression values for each gene by default. This is usually
sufficient in practice although the user may wish to further filter
genes that have expression in at least some proportion of cells. In
that case, the proportion should be converted to an integer for the
specific experiment to be normalized.
Tau: This option likely does not need to be adjusted by the
user. It specifies the quantile used to estimate the quantile regres-
sion. We recommend using the median (Tau¼.5) quantile.
PropToUse: During the group fitting, only a subset of genes is
used in order to speed up computation time. We recommend using
25% of genes in the group. However, if the data are summarized by
transcripts rather than genes, additional reduction in computation
time may be obtained by setting this to 10%.
Thresh: During the evaluation procedure, SCnorm will calcu-
late the distance of the normalized slope modes from zero. The
default distance is 0.1 and in practice this offers the best trade-off
between bias and computation time. A smaller value will take
longer to converge, while a larger value may not result in the
optimal normalization.
K: The number of groups to split the genes into for normaliza-
tion. If any numeric value is supplied the iterative procedure of
SCnorm will not be performed. Users are not advised to set this
option. However, it may be helpful in quickly obtaining normalized
Normalization for Single-Cell RNA-Seq Data Analysis 17

data in the instance that the user previously ran the normalization
but did not save the data or wishes to change arguments that only
affect the across-condition scaling step.
NCores: The number of cores to use. The more cores available,
the faster SCnorm will perform. By default, SCnorm will use one
less than the number of cores available on the machine.
ditherCounts: When this option is set to TRUE, counts will
be randomly jittered by 0.01 prior to fitting. With unique molecu-
lar identifier (UMI) scRNA-seq experiments, the data typically have
many tied count values, which occasionally cause the quantile
regression fit to fail. We find that dithering the counts by a small
value avoids this issue and does not otherwise affect the normaliza-
tion procedure or resulting normalized counts.
withinSample: As demonstrated in previous papers, gene-
specific features may vary across samples. We have implemented
the method from Risso et al. [14] if the user wishes to first normal-
ize the counts based on a gene-specific feature such as GC content
or gene length. This argument expects a vector of equal length to
the number of rows of Data (and in matching order) with values
representing the gene-specific feature to normalize. Note that
within sample normalization should be used with caution as it is
often specific to the experiment and exploratory analyses are highly
recommended.
useZerosToScale: If set to TRUE, the zeros will be used
when scaling across conditions. Use of this argument depends on
which downstream differential expression tool will be used. If using
methods which test zeros separately from continuous counts, such
as MAST [12] or scDD [13], this option should remain FALSE.
However, for methods such as DESeq2 [11] which test all counts
together, this flag should be set to TRUE. A detailed example is
given in Note 3.
useSpikes: We do not implement the use of spike-ins for
within group normalization at this time because there are currently
too few to estimate scale factors robustly in all groups. However,
when multiple conditions or batches are being normalized, if this
argument is TRUE then spike-ins will be used to perform the across
condition scaling. The spike-ins are expected to be named follow-
ing the convention of “ERCC-”. Additional details regarding the
use of spike-ins is given in Note 4.
The plotCountDepth function is used to visualize the count-
depth relationships. It includes a wrapper for internal functions that
estimate the count-depth relationships and then outputs a plot.
During the normalization, if PrintProgressPlots¼TRUE, mul-
tiple calls will be made to the plotCountDepth function, other-
wise the function may be used stand-alone. The required
arguments are Data and Conditions similar to the SCnorm
function. All genes will be split into ten equally sized groups
based on their nonzero un-normalized median expression. A
18 Rhonda Bacher

density curve of the gene-specific count-depth relationships will be

plot with a different color for each expression group. In addition to
outputting a plot, the argument will provide a list object with each
gene’s name, the expression group it belongs to, and its estimated
count-depth relationship.
The default arguments for the plotCountDepth function are:

plotCountDepth <- function(Data, NormalizedData= NULL, Conditions =

NULL,
Tau = .5, FilterCellProportion = .10,
FilterExpression = 0, NumExpressionGroups =
10,
NCores=NULL, ditherCounts = FALSE)

Most of the additional arguments are the same as for the

SCnorm function, including ditherCounts, NCores, Filter-
Expression, and Tau. Here we use the option FilterCellPro-
portion, which allows the user to directly filter genes from
consideration if they do not have a certain proportion of cells
expressed. Arguments unique to this function are:
NumExpressionGroups: The number of equally sized expres-
sion groups used to visualize the count-depth relationships. We
typically use ten groups.
NormalizedData: If evaluating normalized data, then this
argument specifies the normalized data matrix (the format is
expected to be the same as the Data argument). Supplying this
argument is critical to ensure the evaluation of the normalized
measurements is done in terms of the original sequencing depths.
2. Here we apply SCnorm to a scRNA-seq dataset of 92 H1
human embryonic stem cells. Prior to sequencing, and follow-
ing library preparation, the cDNA for each cell was split into
two pools. One pool was sequenced with 96 cells per lane,
while the other pool was sequenced with 24 cells per lane.
Since lanes average similar numbers of total reads, this setup
results in one pool of cells having an average sequencing depth
of one million (H1-1 M) and the other with an average of four
million (H1-4 M). Prior to normalization, the H1-4 M group
will appear four times higher on average than the H1-1 M
group. Since the cells are exactly the same, these data provide
a benchmark to evaluate normalization procedures. It may be
downloaded directly from GEO (file: GSE85917_Bacher.
RSEM.xlsx):
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?
acc¼GSE85917
Normalization for Single-Cell RNA-Seq Data Analysis 19

We will use the first two sheets in the Excel file, which can be
loaded into R by:

> library(readxl)
> h1cells.4M <-
data.frame(read_excel("GSE85917_Bacher.RSEM.xlsx",
sheet=1), stringsAsFactors=F)
> h1cells.1M <-
data.frame(read_excel("GSE85917_Bacher.RSEM.xlsx",
sheet=2), stringsAsFactors=F)

Next, we visualize the variability in the count-depth relation-

ship across each of the datasets.

> library(SCnorm)
> cdr.1M <- plotCountDepth(Data = h1cells.1M, Conditions =
rep("1M",
ncol(h1cells.1M)))
> cdr.4M <- plotCountDepth(Data = h1cells.4M, Conditions =
rep("4M",
ncol(h1cells.4M)))

The results are shown in Fig. 2a, b. The count-depth relation-

ship varies across genes, indicating that SCnorm should be used to
normalize the data.
Using SCnorm we can normalize each batch independently and
then scale across the batches by specifying the Conditions
argument:

> Conditions <- c(rep("1M", ncol(h1cells.1M)), rep("4M",

ncol(h1cells.4M)))
> allData <- cbind(h1cells.1M, h1cells.4M)
> myNormData <- SCnorm(Data = allData, Conditions = Conditions,
PrintProgressPlots = TRUE, reportSF = TRUE)

Since PrintProgressPlots¼TRUE, each attempted value of

K will be plot until the procedure reaches convergence (Fig. 2c, d).
SCnorm finished in 11 min using three cores.
The output is obtained by using the results function:

> normData <- results(myNormData, type = "NormalizedData")

> genesOUT <- results(myNormData, type = "GenesFilteredOut")
> scaleFactors <- results(myNormData, type = "ScaleFactors")

3. Single-cell specific differential expression methods such as

MAST [12] and scDD [13] test the continuous (or nonzero)
counts separately from zero counts. Thus, the default option
20 Rhonda Bacher

A H1 - 1M
Expression Group Medians B 2.5
H1 - 4M
Expression Group Medians
2.0 1.67 (lowest) 2 (lowest)

3 5
2.0
4.5 8.5
1.5
15

Density
6.71 1.5
Density

10 27
1.0 48.62
15 1.0
23.5 88
0.5 0.5 155.88
39
71.32 299.6
0.0 0.0 852.85 (highest)
202.96 (highest)
−2 0 2 −2 0 2
C Slope
D Slope

Fig. 2 Count-depth relationships before and during normalization for the H1-1 M and H1-4 M data. (a) For the
H1-1 M dataset, the estimated count-depth relationships for all genes within ten equally sized gene groups
where genes were grouped by their median expression among nonzero un-normalized measurements. (b)
Similar to a for the H1-4 M dataset. (c) The count-depth relationship is shown for the normalized counts for
each value of K tried by SCnorm for the H1-1 M dataset. The genes remain in their initial expression groups as
shown in a. (d) Similar to C but for the H1-4 M dataset

for SCnorm when scaling across conditions is use-

Zeros ¼ FALSE in order to ensure the nonzero counts are
scaled appropriately. The reason for this is demonstrated in
Fig. 3 for a single-gene example in the H1-1 M and H1-4 M
datasets.
In this example, since the two conditions contain the exact
same cells, the normalized counts should have the same mean
across the two groups. When useZeros¼FALSE, SCnorm esti-
mates across condition scale factors based on the nonzero expres-
sion counts and the resulting nonzero means are equal (Fig. 3a).
This will lead to an appropriate call of not differentially expressed by
methods such as MAST [12] and scDD [13]. Figure 3b demon-
strates that if the DE method includes zeros in the differential
expression test, then the overall mean of the two groups will not
be equal. This effect is more pronounced in this example since the
two conditions have very different proportions of zeros. To avoid
an incorrect call of differential expression, the argument
Normalization for Single-Cell RNA-Seq Data Analysis 21

Mean of non-zero Mean of all

expression expression
A B
useZeros =FALSE

5
H1−1M mean H1−1M mean
H1−4M mean H1−4M mean
4

4
Log (Expression + 1)

Log (Expression + 1)
3

3
2

2
1

1
0

0
13.0 14.0 15.0 16.0 13.0 14.0 15.0 16.0
Log Sequencing Depth Log Sequencing Depth
C H1−1M mean
D H1−1M mean
useZeros =TRUE

H1−4M mean H1−4M mean

5
4

4
Log (Expression + 1)

Log (Expression + 1)
3

3
2

2
1

1
0

13.0 14.0 15.0 16.0 13.0 14.0 15.0 16.0

Log Sequencing Depth Log Sequencing Depth

Fig. 3 For a single gene, the log of the normalized counts for both the H1-1 M (blue) and H1-4 M (red) datasets
versus log sequencing depth are shown. A constant of one was added to the counts before taking the log to
highlight the zero counts. The top and bottom rows contain the normalized counts when useZer-
os¼FALSE or useZeros¼TRUE, respectively. The left column shows the condition-specific means
calculated on the nonzero counts only. The right column shows the condition-specific means calculated over
all counts

useZeros¼TRUE should be used instead and SCnorm will scale

across conditions including the zeros in the scale factor estimation.
Under this option, the means of the nonzero counts will not be
equal if the proportion of zeros is different across conditions and
will result in the nonzero mean of the group with more zeros to
appear higher (Fig. 3c), however the gene means including zeros
will be equal (Fig. 3d).
The results of the across condition scaling under the two
options are most critical when the conditions have very different
22 Rhonda Bacher

proportions of zeros. The useZeros¼FALSE option is most opti-

mal for DE methods which treat zeros separately, while useZer-
os¼TRUE is most optimal for DE methods that consider all counts
together.
4. If spike-ins are present in the data, they may be used to perform
the across condition scaling. First, we want to check that the
spike-ins do not take up an extreme proportion of total counts.
Following with the data above, we can check this directly:
> spikes <- grep("ERCC-", rownames(allData), value=TRUE)
> spikeRatio <- colSums(allData[spikes,]) / colSums(allData)
> head(sort(spikeRatio, decreasing = TRUE))
P96_H1b5s_026 P24_H1b5s_026 P24_H1b5s_061 P96_H1b5s_061
P24_H1b5s_049
0.1975064 0.1972390 0.1703899 0.1677037
0.1458059
P96_H1b5s_049
0.1433341

Cells with more than 20% of spike-ins are typically removed

prior to analysis during quality control. In addition, it should be
checked that the proportion of spike-ins is not drastically different
across cells or groups. Here we check the average proportion across
the two groups:

> mean(spikeRatio[colnames(h1cells.1M)])
0.04553682
> mean(spikeRatio[colnames(h1cells.4M)])
0.0460152

Specifically, for SCnorm, we want to make sure the spike-ins

span the range of expression and are representative of biological
genes. We check this by calculating how many spike-ins are in each
expression group:

> spikeGroups <- subset(cdr.1M[[1]], Gene %in% spikes)

> table(spikeGroups$Group)
1 2 3 4 5 6 7 8 9 10
5 3 6 4 7 4 4 3 3 16
> spikeGroups <- subset(cdr.4M[[1]], Gene %in% spikes)
> table(spikeGroups$Group)
1 2 3 4 5 6 7 8 9 10
7 3 3 5 6 6 4 5 3 16

Not surprisingly, most of the spike-ins are in the tenth expres-

sion group which is the most highly expressed genes. In this exper-
iment, they appear to span the range of expression well and may be
considered for across condition scaling.
Normalization for Single-Cell RNA-Seq Data Analysis 23

> myNormData.spikesUsed <- SCnorm(Data = allData, Conditions =

Conditions, PrintProgressPlots = TRUE, useSpikes=TRUE)

The use of spike-ins for normalization should be carefully

considered in practice; they must be of high-quality across cells
and representative of biological genes [8, 15].

References

1. Robinson MD, Oshlack A (2010) A scaling 9. Morgan M, Obenchain V, Hester J, Pagès H

normalization method for differential expres- (2017) SummarizedExperiment: summarize-
sion analysis of RNA-seq data. Genome Biol 11 dExperiment container. https://bioconductor.
(3):R25 org/packages/release/bioc/html/Sum
2. Bacher R et al (2017) SCnorm: robust normal- marizedExperiment.html
ization of single-cell RNA-seq data. Nat Meth- 10. Leng N et al (2013) EBSeq: an empirical Bayes
ods 14(6):584–586 hierarchical model for inference in RNA-seq
3. Anders S, Huber W (2010) Differential expres- experiments. Bioinformatics 29(8):1035–1043
sion analysis for sequence count data. Genome 11. Love MI, Huber W, Anders S (2014) Moder-
Biol 11(10):R106 ated estimation of fold change and dispersion
4. Li B, Dewey CN (2011) RSEM: accurate tran- for RNA-seq data with DESeq2. Genome Biol
script quantification from RNA-Seq data with 15(12):550
or without a reference genome. BMC Bioin- 12. Finak G et al (2015) MAST: a flexible statistical
formatics 12:323 framework for assessing transcriptional changes
5. Risso D, Ngai J, Speed TP, Dudoit S (2014) and characterizing heterogeneity in single-cell
Normalization of RNA-seq data using factor RNA sequencing data. Genome Biol 16
analysis of control genes or samples. Nat Bio- (1):278
technol 32(9):896–902 13. Korthauer KD et al (2016) A statistical
6. Kaufman L, Rousseeuw P (1987) Clustering by approach for identifying differential distribu-
means of medoids. In Statistical Data Analysis tions in single-cell RNA-seq experiments.
Based on the L1 Norm and Related Methods Genome Biol 17(1):222
(pp. 405–416). North-Holland; Amsterdam. 14. Risso D, Schwartz K, Sherlock G, Dudoit S
7. M€achler M, Rousseeuw P, Struyf A, Hubert M, (2011) GC-content normalization for
Hornik K (2012) Cluster: cluster analysis basics RNA-seq data. BMC Bioinformatics 12(1):480
and extensions. R package version, 1(2), 56 15. Stegle O, Teichmann SA, Marioni JC (2015)
8. Bacher R, Kendziorski C (2016) Design and Computational and analytical challenges in
computational analysis of single-cell RNA-se- single-cell transcriptomics. Nat Rev Genet 16
quencing experiments. Genome Biol 17(1):63 (3):133–145
Chapter 3

Analysis of Technical and Biological Variability

in Single-Cell RNA Sequencing
Beomseok Kim, Eunmin Lee, and Jong Kyoung Kim

Abstract
Profiling the transcriptomes of individual cells with single-cell RNA sequencing (scRNA-seq) has been
widely applied to provide a detailed molecular characterization of cellular heterogeneity within a population
of cells. Despite recent technological advances of scRNA-seq, technical variability of gene expression in
scRNA-seq is still much higher than that in bulk RNA-seq. Accounting for technical variability is therefore a
prerequisite for correctly analyzing single-cell data. This chapter describes a computational pipeline for
detecting highly variable genes exhibiting higher cell-to-cell variability than expected by technical noise.
The basic pipeline using the scater and scran R/Bioconductor packages includes deconvolution-based
normalization, fitting the mean-variance trend, testing for nonzero biological variability, and visualization
with highly variable genes. An outline of the underlying theory of detecting highly variable genes is also
presented. We illustrate how the pipeline works by using two case studies, one from mouse embryonic stem
cells with external RNA spike-ins, and the other from mouse dentate gyrus cells without spike-ins.

Key words Single-cell RNA-seq, Technical variability, Biological variability, Cell-to-cell variability,
Gene expression noise, Highly variable genes

1 Introduction

Since the first paper showing the feasibility of characterizing the

transcriptomes of individual cells was published by Tang and col-
leagues in 2009 [1], single-cell RNA sequencing (scRNA-seq) has
been widely applied to dissect cellular heterogeneity within a popu-
lation of cells [2]. The early experimental approaches for scRNA-
seq were limited in terms of both technical noise and throughput of
cells [3]. Two technological breakthroughs have resolved these two
issues. First, amplifying a minute amount of mRNA in a single cell
by either PCR or in vitro transcription leads to a high level of
technical cell-to-cell variability in gene expression [4]. A molecular
barcoding approach originally developed for bulk RNA-seq [5],

Beomseok Kim and Eunmin Lee contributed equally to this work.

Guo-Cheng Yuan (ed.), Computational Methods for Single-Cell Data Analysis, Methods in Molecular Biology, vol. 1935,
https://doi.org/10.1007/978-1-4939-9057-3_3, © Springer Science+Business Media, LLC, part of Springer Nature 2019

25
26 Beomseok Kim et al.

which barcodes each mRNA molecule with randomly synthesized

oligonucleotides (also known as unique molecular identifier
(UMI)), has been adapted to protocols for scRNA-seq to reduce
technical noise arising from amplification bias [6]. Second, single-
cell isolation methods combining microfluidics technologies with
combinatorial cellular indexing have enabled the transcriptome of
tens of thousands of single cells to be profiled [7, 8].
Despite recent technological advances of scRNA-seq, several
technical challenges remain to be overcome. Both inefficient
reverse transcription and insufficient sequencing depth per cell
result in dropout events and sparsity in a gene expression matrix,
which are the major cause of much higher technical variability of
gene expression in scRNA-seq than in bulk RNA-seq [4]. Stochastic
gene expression arising from random biochemical reactions and
transcriptional bursting is also responsible for a higher level of
biological variability in scRNA-seq [9]. Accounting for both tech-
nical and biological variability is therefore a prerequisite for cor-
rectly analyzing single-cell data.
To overcome this problem, a general workflow for analyzing
scRNA-seq data has been widely adopted [10]. Given a gene
expression matrix, we first remove poor-quality cells based on
various quality control metrics [11]. After properly normalizing
raw read or UMI counts by correcting for the effects of cell-level
technical covariates, we identify highly variable genes showing
higher cell-to-cell variability than expected by technical variability
[4]. These genes are then used to visualize high-dimensional single-
cell data in a two or three-dimensional space, and to cluster cells
into groups sharing similar expression patterns. Finally, based on
the identified cell clusters, we identify marker genes for each cluster
and reconstruct developmental trajectories by ordering cells in
pseudo-time [12].
In this chapter, we describe a pipeline for identifying highly
variable genes from scRNA-seq data with or without external RNA
spike-ins. We first provide an outline of the underlying theory for
accounting for technical variability in terms of variance decomposi-
tion. We then describe how highly variable genes can be detected
with the help of external RNA spike-ins. Since external spike-ins are
not available in high-throughput scRNA-seq data, alternative
methods inferring technical noise from the overall mean-variance
trend of endogenous genes are also discussed.

2 Materials

Our protocol for analyzing technical and biological variability from

scRNA-seq data requires the R programming language for statisti-
cal computing and optionally an integrated development environ-
ment for R (e.g., RStudio).
Analysis of Technical and Biological Noise in scRNA-Seq 27

2.1 R/Bioconductor The following R/Bioconductor packages are required:

Packages
1. scater [13]: Single-cell analysis toolkit for gene expression data
in R (https://bioconductor.org/packages/release/bioc/
html/scater.html),
2. scran [14]: Methods for single-cell RNA-seq data analysis
(https://bioconductor.org/packages/release/bioc/html/
scran.html).
The Bioconductor packages can be installed by using biocLite():

source("https://bioconductor.org/biocLite.R")
biocLite("scater")
biocLite("scran")

2.2 scRNA-seq To demonstrate how the pipeline works with or without the help of
Datasets spike-ins, we use two different scRNA-seq datasets: (1) mouse
embryonic stem cells (mESCs) with external RNA spike-ins [15],
and (2) micro-dissected cells from mouse dentate gyrus without
external RNA spike-ins [16].

2.2.1 scRNA-seq Dataset This dataset consists of 704 mESCs cultured in three different
with Spike-Ins conditions: serum + LIF (three replicates), 2i + LIF (four replicates)
and alternative 2i + LIF (two replicates). All the mESCs passed cell-
level quality control criteria. For each replicate, 96 single cells were
captured with the Fluidigm C1 system and 92 ERCC RNA spike-
ins were added to cell lysate. The cDNA and Illumina library were
prepared with the SMARTer Kit and the Nextera XT Kit, respec-
tively. Of four replicates of 2i-cultured mESCs, we will use two
replicates: 2i2 and 2i3. The first replicate (2i2) has poor-quality
spike-ins while the other replicate (2i3) has good-quality spike-ins.
The raw read count table is publicly available at https://www.ebi.ac.
uk/teichmann-srv/espresso.

2.2.2 scRNA-seq Dataset This dataset contains 5454 cells from mouse developing dentate
without Spike-Ins gyrus, which were sampled at four postnatal time points (P12, P16,
P24, and P35). Cells were dissociated and captured with the 10
Genomics Chromium platform on two experimental days: day1 (P12
and P35), and day2 (P16 and P24). Low-quality cells and doublets
were filtered out. The UMI count table and the corresponding
annotation data are publicly available from the Gene Expression
Omnibus (GEO) at the accession number of GSE95315.

3 Methods

Accounting for technical variability in scRNA-seq data has become

an essential component of statistical methods for identifying highly
28 Beomseok Kim et al.

variable genes and quantifying biological variability. Most such

methods decompose the total variance into technical and biological
components based on the law of total variance [4, 14, 17]. We
provide a brief introduction to the common statistical framework
for these methods.

3.1 A Statistical Suppose that xij is a random variable denoting the unknown num-
Framework to Account ber of transcripts (or concentration) of gene i in cell j. The number
for Technical of transcripts (or concentration) of gene i in cell j available for
Variability sequencing after cell lysis, reverse transcription, and cDNA amplifi-
cation steps is denoted by zij. We also denote by kij as the observed
read or UMI count of gene i in cell j. By the general theorem of
variance decomposition [18], the variance of kij can be decomposed
into:

Var kij ¼ E Var kij jz ij ; x ij þ E Var E kij jz ij ; x ij jx ij

þ Var E kij jx ij :
The first term explains the technical variability arising from
sequencing noise, which is usually modeled using a Poisson process
[19]. The second term quantifies the technical variability generated
by stochastic mRNA loss during the single-cell library preparation
steps, which is a major source of technical variability. The last term
quantifies the biological variability. The basic idea of identifying
highly variable genes is to find genes whose observed variance is not
dominated by the first two technical variability terms. In other
words, highly variable genes can be defined as genes showing
significant nonzero biological variance.
In principle, the technical variability terms should be estimated
from external RNA spike-ins since we can eliminate the biological
cell-to-cell variability of xij from the decomposition formula. Then,
the variance of kij for spike-ins can be simplified by the law of total
variance:

Var kij ¼ E Var kij jz ij þ Var E kij jz ij :
It should be noted that the above two terms correspond to the
first two technical variability terms if we assume xij is a fixed and
known quantity, which is a reasonable assumption for spike-ins (see
Note 1). To plug-in the estimated technical variability of spike-ins
into that of endogenous genes, we make an assumption that the
technical variance of spike-ins is a nonlinear function of their mean
expression levels. By fitting a curve to the mean-variance
(or variance derived quantities like coefficient of variation) data of
spike-ins using a nonlinear regression function, we can estimate the
average technical variance of each endogenous gene at the given
mean expression level. The biological variance of endogenous genes
Analysis of Technical and Biological Noise in scRNA-Seq 29

can be estimated by subtracting the average technical variance from

the total observed variance. If spike-ins have poor quality or are not
available, we directly estimate the mean-variance trend of technical
noise from endogenous genes by assuming that most genes are
dominated by technical variability.
To test whether the biological variance of kij is equal to a
specified value (usually set to 0), the ratio of the sample variance
of the normalized counts to the expected variance under the null
hypothesis is usually used as a test statistic. The test statistic under
the null hypothesis follows a chi-squared distribution if we assume
that the normalized counts follow a normal distribution. To make
this normality assumption more reasonable, some methods (e.g.,
scran) take log-transformed normalized counts as their input
expression values.

3.2 Identifying We first load all R and Bioconductor packages we need in this
Highly Variable Genes protocol, and then load the raw read count table for mESCs from
with External RNA counttable_es.txt.
Spike-Ins
3.2.1 Data Loading
and Normalization

library(scater)
library(scran)
ct <- as.matrix(read.table("counttable_es.txt",
sep = " ",
header = T,
row.names = 1,
check.names = FALSE))

From the count matrix, we select rows whose names start with
“ENSMUSG” (Ensembl mouse gene ID) or “ERCC” (ERCC spike-in
ID), and columns corresponding to one replicate of the 2i condi-
tion (2i3). The chosen count matrix is used to create a Single-
CellExperiment object from scater which will serve as a data
container compatible with many other Bioconductor packages
including scran. The rows corresponding to spike-ins can be set
using the isSpike function from scran.
30 Beomseok Kim et al.

ct <- ct[grepl("^ENSMUSG|^ERCC-", rownames(ct)), grepl("_2i_3",

colnames(ct))]

sceset <- SingleCellExperiment(assays = list(counts = ct))

isSpike(sceset, "ERCC") <- grepl("^ERCC-", rownames(sce))

sceset

## class: SingleCellExperiment

## dim: 38653 59

## metadata(0):

## assays(1): counts

## rownames(38653): ENSMUSG00000000001 ENSMUSG00000000003 ...

## ERCC-00170 ERCC-00171

## rowData names(0):

## colnames(59): ola_mES_2i_3_10.counts ola_mES_2i_3_11.counts ...

## ola_mES_2i_3_92.counts ola_mES_2i_3_96.counts

## colData names(0):

## reducedDimNames(0):

## spikeNames(0):

The raw read count of a gene in a cell is affected by cell-specific

technical and biological factors. The cell-to-cell differences in the
efficiency of cell lysis, capture efficiency of mRNA transcripts, and
sequencing depth are known to affect the raw read counts. The
total RNA content and cell size can also differ from cell to cell,
which are biological factors affecting the amount of biological RNA
available for sequencing and the raw read count. To eliminate the
effects of these technical and biological factors, we normalize the
raw read count by dividing each count by its cell-level size factor.
Two different size factors are calculated with the computeSum-
Factors and computeSpikeFactors functions from scran (see
Note 2): biological size factors using endogenous genes and tech-
nical size factors using spike-ins. The biological size factors are
computed by the deconvolution approach to deal with the sparsity
of single-cell data [20]. Based on the assumption that a majority of
genes are not differentially expressed, cell-specific biases arising
from both technical and biological factors are normalized with
Analysis of Technical and Biological Noise in scRNA-Seq 31

the biological size factors. In contrast, the technical size factors are
computed based on the spike-in counts to adjust for the effects of
technical factors. Since the same amount of spike-ins are added to
each cell, the cell-to-cell differences in total RNA content are not
normalized with the technical size factors (see Note 3). Using the
normalize function from scater, we normalize the raw read
counts of endogenous genes with the biological size factors, and
that of spike-ins with the technical size factors. The function calcu-
lates log2-transformed normalized expression values by adding a
pseudo-count of 1, which are stored in logcounts or exprs of
the returned SingleCellExperiment object.

sceset <- computeSumFactors(sceset, sizes = seq(10, 50, 5))

sceset <- computeSpikeFactors(sceset, general.use = FALSE)

sceset <- normalize(sceset)

3.2.2 Detecting Highly From the log2-transformed normalized expression values, we first
Variable Genes fit a curve to the mean-variance values of spike-ins with the tren-
dVar function of scran (see Note 4).

var.fit <- trendVar(sceset, parametric = TRUE, use.spikes = TRUE)

The fitted value of the curve at a given mean expression level is

used as an estimate of the technical component of the total variance of
an endogenous gene expressed at the same mean level. The biological
variance can be obtained by subtracting the estimated technical vari-
ance from the total observed variance for each endogenous gene.
Under the null hypothesis that the total variance is equal to the
estimated technical variance (or equivalently the biological variance
is equal to 0), we can test whether the total variance is larger than the
expected technical variance with the decomposeVar function of
scran. Highly variable genes can be defined as genes showing signifi-
cant nonzero biological variance. In this example, we consider a gene
to be highly variable if it has a biological variance greater than 0.5 and
its FDR is less than or equal to 0.05 (see Note 5).
32 Beomseok Kim et al.

var.out <- decomposeVar(sceset, var.fit)

hvg <- var.out[which(var.out$FDR <= 0.05 & var.out$bio > .5 ),]
nrow(hvg)

## [1] 2109

head(hvg)

## DataFrame with 6 rows and 6 columns

## mean total

bio

## <numeric> <numeric> <numer

ic>

## ENSMUSG00000000001 6.87377936714626 7.63664354974175

4.94614195802085

## ENSMUSG00000000028 7.16813281638306 8.36720349968789

6.19082343482952

## ENSMUSG00000000131 8.42914208049494 3.41985542516078

2.60116398672207

## ENSMUSG00000000171 8.87511548573431 2.87241703701955

2.2966896068407

## ENSMUSG00000000278 9.14542287426345 2.1818210345152

1.71403288145084

## ENSMUSG00000000295 6.33472585927806 6.86610866121096

2.89637950173032

## tech p.value

## <numeric> <numeric>

## ENSMUSG00000000001 2.6905015917209 3.78117460395952e-12

Analysis of Technical and Biological Noise in scRNA-Seq 33

## ENSMUSG00000000028 2.17638006485837 3.48532179254256e-21

## ENSMUSG00000000131 0.818691438438704 2.24109579972162e-24

## ENSMUSG00000000171 0.575727430178854 1.83553665593285e-32

## ENSMUSG00000000278 0.467788153064363 3.51389775367481e-29

## ENSMUSG00000000295 3.96972915948064 0.000474348732430948

## FDR

## <numeric>

## ENSMUSG00000000001 1.55277820983262e-10

## ENSMUSG00000000028 2.65607694945126e-19

## ENSMUSG00000000131 2.01098169057812e-22

## ENSMUSG00000000171 2.67094826375195e-30

## ENSMUSG00000000278 4.18208059504488e-27

## ENSMUSG00000000295 0.00884495235554632

Figure 1a shows the mean-variance plot of the log2-

transformed expression values of both endogenous genes (black
points) and spike-ins (green points) in 2i3, where the blue line
represents the fitted mean-variance trend based on the spike-ins
and red points correspond to highly variable genes.

plot(y = var.out$total, x = var.out$mean,

ylab = "Variance of log-expression", xlab = "Mean log-expression",
pch = 16,cex = 0.3)
o <- order(var.out$mean)
lines(y = var.out$tech[o], x = var.out$mean[o],
col = "dodgerblue", lwd = 2)
points(var.fit$mean, var.fit$var,
col = "green", pch = 16, cex = 0.3)

points(y = var.out$total[var.out$FDR <= 0.05 & var.out$bio >= 0.5 ],

x = var.out$mean[var.out$FDR <= 0.05 & var.out$bio >= 0.5 ],

col = "red", pch = 16, cex = 0.3)

34 Beomseok Kim et al.

A 15
B

15
Variance of log−expression
Variance of log−expression
10

10
5

5
0
0

0 5 10 15 0 5 10 15
Mean log−expression Mean log−expression

Fig. 1 Mean-variance plots of log2-transformed normalized expression values of endogenous gens (black
points) for 2i3 with a good quality of spike-ins (a) and 2i2 with a poor quality of spike-ins (b). Each green point
represents a spike-in, and the blue line corresponds to the fitted mean-variance trend based on the spike-ins
(a) or endogenous genes (b). Detected highly variable genes are colored by red

If we fail to add the same amount of spike-ins to each cell or the

proportion of reads mapped to spike-ins is too low (<0.01% in 2i2
compared to 44% in 2i3), we can use an alternative approach by
directly fitting the mean-variance trend to the normalized expres-
sion values of endogenous genes under the assumption that the
technical components dominate the total variance in most genes
(Fig. 1b, see Note 6):

var.fit <- trendVar(sceset, parametric = TRUE, use.spikes = FALSE)

3.3 Identifying In this section, we demonstrate how the basic pipeline described in
Highly Variable Genes Subheading 3.2 can be modified to detect highly variable genes for
without External RNA a large-scale scRNA-seq dataset without spike-ins. We first load the
Spike-Ins UMI count table for mouse dentate gyrus cells from
GSE95315_10X_expression_data.tab, which is publicly avail-
able at ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE95nnn/
GSE95315/suppl/GSE95315_10X_expression_data.tab.gz. The
cell metadata containing the batch (experimental days) and cell
type for each cell is loaded from GSE95315_series_matrix.
Analysis of Technical and Biological Noise in scRNA-Seq 35

txt available at ftp://ftp.ncbi.nlm.nih.gov/geo/series/

GSE95nnn/GSE95315/matrix/GSE95315_series_matrix.txt.gz.

library(scater)
library(scran)
library(ggplot2)
ct <- as.matrix(read.table("GSE95315_10X_expression_data.tab",
sep = "\t",
header = T,
row.names = 1,
check.names = FALSE))
strs <- readLines("GSE95315_series_matrix.txt.gz")
dat <- read.csv(text=strs, sep = "\t",
header = T,
check.names=FALSE,
skip = 30,
nrows=length(strs) - 3 )

The count table and metadata is converted into a SingleCel-

lExperiment object of scater.

Annotation <- data.frame(celltype = t(dat)[2:ncol(dat), 13])

Annotation$celltype <- gsub("Neuroblast2", "Neuroblast_two", Annotati
on$celltype)
Annotation$celltype <- gsub("cell cluster: |1|2|glia-like| ", "", Ann
otation$celltype)
length(table( Annotation$celltype))

## [1] 22

cell.info <- data.frame(

cell = colnames(ct),
batch = sapply(colnames(ct), function(x) substr(x, 4, 7)),
celltype = annotation[colnames(ct),"celltype"])
sceset <- SingleCellExperiment(
assays = list(counts = ct), colData = cell.info)
sceset
36 Beomseok Kim et al.

## class: SingleCellExperiment

## dim: 14545 5454

## metadata(0):

## assays(1): counts

## rownames(14545): 0610007P14Rik 0610009B22Rik ... mt-Nd5 mt-Nd6

## rowData names(0):

## colnames(5454): 10X46_1_GCCTACACGGGAGT-1 10X46_1_AAGCACTGATGGTC-1

## ... 10X43_1_ATGAAGGAATGCCA-1 10X46_1_GGACAGGATAGCGT-1

## colData names(3): cell batch celltype

## reducedDimNames(0):

## spikeNames(0):

We normalize the UMI counts with the deconvolution method

implemented in the computeSumFactors function of scran. This
method is based on the strong assumption that most genes are not
differentially expressed across cells. However, this assumption is not
valid for a large single-cell data composed of a mixture of different
cell types. To weaken the assumption, cells are grouped into clusters
based on their gene expression profiles with the quickCluster
function of scran. For each cluster, the deconvolution-based nor-
malization method is applied to compute the cluster-specific
biological size factors, which are then rescaled across clusters.

clusters <- quickCluster(sceset)

sceset <- computeSumFactors(sceset, cluster = clusters)
sceset <- normalize(sceset)

As spike-ins are not present in this dataset, we make an assump-

tion that the contribution of biological components to the total
variance is marginal in most genes. Based on this assumption, we fit
the mean-variance trend to the variances of log2-transformed nor-
malized expression values of endogenous genes by setting
Analysis of Technical and Biological Noise in scRNA-Seq 37

var.fit <- trendVar(sceset, method = "spline", parametric = TRUE,

use.spikes = FALSE, span = 0.2)

.
After we test whether the estimated biological variance is equal
to 0, we define highly variable genes as ones with FDR 0.05 and
biological variance 0.1.

var.out <- decomposeVar(sceset, var.fit)

hvg <- var.out[which(var.out$FDR <= 0.05 & var.out$bio >= 0.1),]

nrow(hvg)

## [1] 312

head(hvg)

## DataFrame with 6 rows and 6 columns

## mean total bio

## <numeric> <numeric> <numeric>

## 1810037I17Rik 0.680310681114725 0.908861494285629

0.22438946719247

## 2010300C02Rik 0.602083745312206 0.785529562526665

0.158613308891032

## 2900055J20Rik 0.642681674116672 0.7894952004639

0.132051146817701

## 6330403K07Rik 0.760365192522513 0.873966446743902

0.136031205136787
38 Beomseok Kim et al.

## Abr 0.723940378793588 0.852462429491801 0.138174667075084

## Acsbg1 0.15950685385264 0.404043372590482 0.205280822499748

## tech p.value FDR

## <numeric> <numeric> <numeric>

## 1810037I17Rik 0.68447202709316 8.49384128900126e-55

1.2317340134449e-53

## 2010300C02Rik 0.626916253635633 9.13838303865434e-35

9.65972247799617e-34

## 2900055J20Rik 0.657444053646199 2.9741904700187e-23

2.45235829855e-22

## 6330403K07Rik 0.737935241607115 4.65272653177871e-20

3.49014478621564e-19

## Abr 0.714287762416717 8.60152901400342e-22

6.80681390145156e-21

## Acsbg1 0.198762550090734 0 0

In Fig. 2, we plot the mean-variance relationship of endoge-

nous genes (black points), where red points represent highly vari-
able genes, and the blue line corresponds to the fitted trend.

plot(y = var.out$total, x = var.out$mean, pch = 16, cex = 0.3,

ylab = "Variance of log-expression", xlab = "Mean log-expression")
o <- order(var.out$mean)
lines(y = var.out$tech[o], x = var.out$mean[o],
col = "dodgerblue", lwd = 2)
points(y = var.out$total[var.out$FDR <= 0.05 & var.out$bio >= 0.1],
x = var.out$mean[var.out$FDR <= 0.05 & var.out$bio >= 0.1],
col="red", pch = 16, cex = 0.3)

Highly variable genes are informative as they vary across differ-

ent cell types or subpopulations. Thus, they can be used as an input
set of feature genes to visualize the high-dimensional single-cell
Analysis of Technical and Biological Noise in scRNA-Seq 39

5
Variance of log−expression
4
3
2
1
0

0 1 2 3 4 5 6 7
Mean log−expression

Fig. 2 A mean variance plot of log2-transformed normalized expression values of genes in dentate gyrus cells.
The blue line represents the fitted mean-variance trend, and highly variable genes are marked in red

data in a two- or three-dimensional space, or to cluster cells into

groups showing similar gene expression profiles. We illustrate that
highly variabe genes are useful for visualizing subpopulation struc-
ture associated with the known cell types, using t-distributed sto-
chastic neighbor embedding (t-SNE) [21] implemented in the
runTSNE function of scater. As a control, we also generate the
t-SNE plot using all genes.

sce.tsne <- runTSNE(sceset, feature_set = rownames(sceset), rand_seed

= 123456)

sce.tsne.hvg <- runTSNE(sceset, feature_set = rownames(hvg), rand_see

d = 123456)

Figure 3 shows the t-SNE plots using all genes (Fig. 3a, c) or
highly variable genes (Fig. 3b, d, see Note 7). Cell types inferred
from [16] are overlaid in Fig. 3a, b. Two experimental days
(“43_1” and “46_1”) are also overlaid in Fig. 3c, d to look for
batch effects caused by experimental days.
40 Beomseok Kim et al.

Fig. 3 t-SNE plots of dentate gyrus cells using all genes (a, c) or highly variable genes (b, d). Cells are
clustered by the annotated cell types (a, b), and partially by the experimental days (c, d)
Analysis of Technical and Biological Noise in scRNA-Seq 41

df <- data.frame(x = sce.tsne@reducedDims$TSNE[, 1],

y = sce.tsne@reducedDims$TSNE[, 2],
expression = sce.tsne@colData$celltype)
colors <- c("#E5D8BD", "#A6CEE3", "#B3CDE3", "#FDB462", "#FED9A6",
"#CCCCCC", "#FDC086", "#377EB8", "#FC8D62", "#FDCDAC", "#FFFF99",
"#999999", "#E78AC3", "#7570B3", "#E7298A", "#DECBE4", "#66A61E",
"#386CB0", "#E5C494", "#F2F2F2", "#FFED6F", "#B3B3B3")
ggplot() +
geom_point(data = df, aes(x = x, y = y, colour = expression),
size = 0.5) +
scale_color_manual(values = colors)+
ylab("Component 2") +
xlab("Component 1") +
guides(col = guide_legend(nrow = 22), override.aes = list(size = 5))

When we select all genes as an input set of feature genes,

granule and neuroblast cells are largely separated by their experi-
mental days. In contrast, when highly variable genes are selected,
such separation is reduced, suggesting that highly variable genes are
more useful for identifying biologically meaningful structure of
high-dimensional scRNA-seq data than all expressed genes.

4 Notes

1. The external RNA spike-ins cannot be used to estimate the

technical variability of stochastic RNA losses arising from inef-
ficient cell lysis since the spike-ins are added to cell lysates.
2. One can change the number of cells per pool by setting the
sizes argument of the computeSumFactors function. By
default, the number of cells per pool is set to a range from 20 to
100. However, an error will occur when the total number of
cells in the sample is less than this range. In this case, the
maximum pool size should be smaller than the total number
of cells as shown in Subheading 3.2. The recommended mini-
mum pool size for sparse UMI data is 20, but smaller pool size
may be possible for scRNA-seq data with high sequencing
depth. When negative size factors are obtained for some cells,
the range of pool size should be increased. Another approach is
to filter out both lowly expressed genes and poor-quality cells
with low total counts.
42 Beomseok Kim et al.

3. To preserve the cell-to-cell differences in total RNA content,

we can set general.use ¼ TRUE of the computeSpikeFac-
tors function. In this case, both endogenous genes and spike-
ins are normalized with the technical size factors estimated
from the spike-ins.
4. If the dataset consists of multiple batches, the batch informa-
tion should be incorporated into a design matrix, which can be
passed into the trendVar function (e.g., design ¼ model.
matrix(~replicates)).

5. A similar approach based on the squared coefficient of variation

(CV2) of [4] is also implemented in the technicalCV2 func-
tion of scran.
6. An alternative approach, which is implemented in the DM func-
tion of scran, calculates the distance-to-median (DM) values
after fitting a running median curve to the log-transformed
CV2 of genes against to the log-transformed mean. For each
gene, the DM value is defined as the distance between the
observed CV2 and the fitted trend. Genes with high DM values
can be identified as highly variable genes.
7. The t-SNE plots can be also generated with the plotTSNE
function of scater. The code for t-SNE plots in the Subheading
3.3 is written for a better visualization.

Acknowledgments

This work was supported by the National Research Foundation of

Korea funded by the Ministry of Science, ICT and Future Planning
(2017R1C1B2007843, 2017M3C7A1048448,
2017M3A9B6073099, 2017M3A9D5A01052447), and by Busi-
ness for Cooperative R&D between Industry, Academy, and
Research Institute funded by the Ministry of SMEs and Startups
(C0452791).

References
1. Tang F, Barbacioru C, Wang Y, Nordman E, and biology of single-cell RNA sequencing.
Lee C, Xu N, Wang X, Bodeau J, Tuch BB, Mol Cell 58(4):610–620. https://doi.org/
Siddiqui A, Lao K, Surani MA (2009) mRNA- 10.1016/j.molcel.2015.04.005
Seq whole-transcriptome analysis of a single 4. Brennecke P, Anders S, Kim JK, Kolodziejczyk
cell. Nat Methods 6(5):377–382. https://doi. AA, Zhang X, Proserpio V, Baying B, Benes V,
org/10.1038/nmeth.1315 Teichmann SA, Marioni JC, Heisler MG
2. Tanay A, Regev A (2017) Scaling single-cell (2013) Accounting for technical noise in
genomics from phenomenology to mechanism. single-cell RNA-seq experiments. Nat Meth-
Nature 541(7637):331–338. https://doi.org/ ods 10(11):1093–1095. https://doi.org/10.
10.1038/nature21350 1038/nmeth.2645
3. Kolodziejczyk AA, Kim JK, Svensson V, Mar- 5. Kivioja T, Vaharautio A, Karlsson K, Bonke M,
ioni JC, Teichmann SA (2015) The technology Enge M, Linnarsson S, Taipale J (2011)
Analysis of Technical and Biological Noise in scRNA-Seq 43

Counting absolute numbers of molecules using 13. McCarthy DJ, Campbell KR, Lun ATL, Wills
unique molecular identifiers. Nat Methods 9 QF (2017) Scater: pre-processing, quality con-
(1):72–74. https://doi.org/10.1038/nmeth. trol, normalization and visualization of single-
1778 cell RNA-seq data in R. Bioinformatics 33
6. Islam S, Zeisel A, Joost S, La Manno G, (8):1179–1186. https://doi.org/10.1093/
Zajac P, Kasper M, Lonnerberg P, Linnarsson bioinformatics/btw777
S (2014) Quantitative single-cell RNA-seq 14. Lun ATL, McCarthy DJ, Marioni JC (2016) A
with unique molecular identifiers. Nat Meth- step-by-step workflow for low-level analysis of
ods 11(2):163–166. https://doi.org/10. single-cell RNA-seq data with bioconductor.
1038/nmeth.2772 F1000Res 5:2122. https://doi.org/10.
7. Macosko EZ, Basu A, Satija R, Nemesh J, 12688/f1000research.9501.2
Shekhar K, Goldman M, Tirosh I, Bialas AR, 15. Kolodziejczyk AA, Kim JK, Tsang JC, Ilicic T,
Kamitaki N, Martersteck EM, Trombetta JJ, Henriksson J, Natarajan KN, Tuck AC, Gao X,
Weitz DA, Sanes JR, Shalek AK, Regev A, Buhler M, Liu P, Marioni JC, Teichmann SA
McCarroll SA (2015) Highly parallel genome- (2015) Single cell RNA-sequencing of pluripo-
wide expression profiling of individual cells tent states unlocks modular transcriptional var-
using Nanoliter droplets. Cell 161 iation. Cell Stem Cell 17(4):471–485. https://
(5):1202–1214. https://doi.org/10.1016/j. doi.org/10.1016/j.stem.2015.09.011
cell.2015.05.002 16. Hochgerner H, Zeisel A, Lonnerberg P, Lin-
8. Klein AM, Mazutis L, Akartuna I, narsson S (2018) Conserved properties of den-
Tallapragada N, Veres A, Li V, Peshkin L, tate gyrus neurogenesis across postnatal
Weitz DA, Kirschner MW (2015) Droplet bar- development revealed by single-cell RNA
coding for single-cell transcriptomics applied sequencing. Nat Neurosci 21(2):290–299.
to embryonic stem cells. Cell 161 https://doi.org/10.1038/s41593-017-0056-
(5):1187–1201. https://doi.org/10.1016/j. 2
cell.2015.04.044 17. Kim JK, Kolodziejczyk AA, Ilicic T, Teichmann
9. Kim JK, Marioni JC (2013) Inferring the kinet- SA, Marioni JC (2015) Characterizing noise
ics of stochastic gene expression from single- structure in single-cell RNA-seq distinguishes
cell RNA-sequencing data. Genome Biol 14 genuine from technical stochastic allelic expres-
(1):R7. https://doi.org/10.1186/gb-2013- sion. Nat Commun 6:8687. https://doi.org/
14-1-r7 10.1038/ncomms9687
10. Stegle O, Teichmann SA, Marioni JC (2015) 18. Bowsher CG, Swain PS (2012) Identifying
Computational and analytical challenges in sources of variation and the flow of information
single-cell transcriptomics. Nat Rev Genet 16 in biochemical networks. Proc Natl Acad Sci U
(3):133–145. https://doi.org/10.1038/ S A 109(20):E1320–E1328. https://doi.org/
nrg3833 10.1073/pnas.1119407109
11. Ilicic T, Kim JK, Kolodziejczyk AA, Bagger 19. Marioni JC, Mason CE, Mane SM,
FO, McCarthy DJ, Marioni JC, Teichmann Stephens M, Gilad Y (2008) RNA-seq: an
SA (2016) Classification of low quality cells assessment of technical reproducibility and
from single-cell RNA-seq data. Genome Biol comparison with gene expression arrays.
17:29. https://doi.org/10.1186/s13059- Genome Res 18(9):1509–1517. https://doi.
016-0888-1 org/10.1101/gr.079558.108
12. Trapnell C, Cacchiarelli D, Grimsby J, 20. Lun ATL, Bach K, Marioni JC (2016) Pooling
Pokharel P, Li S, Morse M, Lennon NJ, Livak across cells to normalize single-cell RNA
KJ, Mikkelsen TS, Rinn JL (2014) The dynam- sequencing data with many zero counts.
ics and regulators of cell fate decisions are Genome Biol 17:75. https://doi.org/10.
revealed by pseudotemporal ordering of single 1186/s13059-016-0947-7
cells. Nat Biotechnol 32(4):381–386. https:// 21. Van der Maaten L, Hinton GE (2008) Visua-
doi.org/10.1038/nbt.2859 lizing Data using t-SNE. J Mach Learn Res
9:2579–2605
Chapter 4

Identification of Cell Types from Single-Cell

Transcriptomic Data
Karthik Shekhar and Vilas Menon

Abstract
Unprecedented technological advances in single-cell RNA-sequencing (scRNA-seq) technology have now
made it possible to profile genome-wide expression in single cells at low cost and high throughput. There is
substantial ongoing effort to use scRNA-seq measurements to identify the “cell types” that form compo-
nents of a complex tissue, akin to taxonomizing species in ecology. Cell type classification from scRNA-seq
data involves the application of computational tools rooted in dimensionality reduction and clustering, and
statistical analysis to identify molecular signatures that are unique to each type. As datasets continue to grow
in size and complexity, computational challenges abound, requiring analytical methods to be scalable,
flexible, and robust. Moreover, careful consideration needs to be paid to experimental biases and statistical
challenges that are unique to these measurements to avoid artifacts. This chapter introduces these topics in
the context of cell-type identification, and outlines an instructive step-by-step example bioinformatic
pipeline for researchers entering this field.

Key words Single-cell RNA-sequencing, Transcriptomic classification, Cell-type identification, Cell

taxonomy, Clustering, Unsupervised machine learning, Cross-species comparison of cell-types

1 Introduction

The human body contains approximately 40 trillion cells, which

exhibit a breathtaking diversity of form and function [1]. Classifying
these cells into “types” is increasingly viewed as a foundational
requirement to gain a detailed understanding of how tissues func-
tion and interact, and to uncover specific mechanisms that underlie
pathological states [2]. Provisionally, cells of a particular type share
a common identity defined by multiple, measurable properties
pertaining to tissue location, function, signaling properties, mor-
phology, electrophysiological response, molecular composition,
and physicochemical interaction with other cell types (see below).
Knowledge of what cell types exist and what features distinguish
them will (a) facilitate genetic access to specific types so they can be
labeled and manipulated in model organisms and culture systems,

Guo-Cheng Yuan (ed.), Computational Methods for Single-Cell Data Analysis, Methods in Molecular Biology, vol. 1935,
https://doi.org/10.1007/978-1-4939-9057-3_4, © Springer Science+Business Media, LLC, part of Springer Nature 2019

45
46 Karthik Shekhar and Vilas Menon

(b) provide a framework to investigate the staggering cellular het-

erogeneity that abounds in organisms, (c) provide mechanistic
insight into the generation of this heterogeneity during early devel-
opment, (d) provide a framework for rationally improving in vitro-
derived cell types, (e) facilitate cross-species comparisons [3], and
(f) implicate roles for specific cell types and their interactions [4] in
complex diseases [5, 6].
Although the genomes of complex mammals contain ~30,000
genes (and their multiple isoforms), the expression patterns of these
genes are not all independent of each other. Gene regulatory pro-
cesses induce correlations between the expression levels of genes,
which in turn results in a “modular” structure of the transcriptome
[7]. One consequence of this modularity is that the molecular states
of cells occupy a low dimensional subspace (often referred to as a
“manifold”) within the full space of gene expression. Advances in
single-cell RNA-sequencing (scRNA-seq) technology have enabled
cell types to be defined using the transcriptomic state of thousands
of individual cells [8–10]. In addition, the development of single-
nucleus profiling techniques has allowed for thorough investiga-
tions of frozen and banked tissue, including challenging tissues
such as adult human brain sections [11, 12]. A flurry of recent
work has shown that unbiased classification of single-cell transcrip-
tomes using computational methods rooted in clustering and
dimensionality reduction not only recovers classically defined sub-
sets of cells, but also enables discovery of novel types with unknown
functional roles [13–15]. Our goal is to introduce the reader to the
conceptual [16] and computational [17] challenges of scRNA-seq
data analysis, followed by a description of a basic practical workflow
of scRNA-seq analysis using the R statistical language.

1.1 What Is a Cell While every cell is unique, experience of biologists over many years
Type? has suggested that cells can be organized into groups based on
shared features that are quantifiable. This categorization makes
possible systematic and reproducible analyses of complex tissues,
similar to the concept of “species,” which greatly simplifies the
diversity of organisms into an interpretable taxonomy, while not
denying the individuality of any single member [18]. Features used
to define cell types include lineage, location, morphology, activity,
interactions with other cell types, epigenetic state, responsiveness to
certain signals, and molecular composition (including mRNA and
protein levels) [16].
scRNA-seq-based cell classification involves partitioning the
data into “clusters” of single cells, wherein each cluster is defined
by a unique gene expression “signature” relative to other clusters,
and therefore, represents a putative cell type. It must be noted,
however, that a computationally defined cluster may not necessarily
correspond 1:1 to a cell type, as the molecular state of the cell
assayed by scRNA-seq may not necessarily reflect all of the features
Identification of Cell Types from Single-Cell Transcriptomic Data 47

noted above. Moreover, certain molecular attributes are more tran-

sient than others during the lifetime of a cell, necessitating a dis-
tinction between a cell’s type (its principal identity) and its current
“state” (e.g., temporary changes in firing rate of neuron during
“up” and “down” states, or different levels of secretory activity of
endocrine cells). scRNA-seq may resolve different “states” of the
same cell type if their transcriptional signatures are sufficiently
distinct, and collapse two distinct, but closely related types if the
molecules that specified their identities during early development
are no longer expressed during the stage of the experiment.
Thus, even when restricted to the molecular state, the differ-
ence between cell “types” and “states” is not resolvable through
RNA-seq alone, and may require examination in other modalities,
such as those that capture information about the cell’s epigenetic
state or its dynamical responses. Taken together, these caveats
warrant caution in the interpretation of scRNA-seq data, especially
in the context of identifying cell types exclusively from transcrip-
tomic information. The notion of cell types is being constantly
refined through ongoing work as part of large-scale projects such
as the Human Cell Atlas [2] and the BRAIN initiative [19].

1.2 A Brief Overview scRNA-seq is not a single method, but a suite of protocols, each
of scRNA-Seq with its strengths and limitations [20]. Currently, every scRNA-seq
protocol consists of three steps (Fig. 1): (1) single-cell capture and
barcoding, (2) library preparation, and (3) sequencing. Current
protocols isolate single cells by tissue dissociation, followed by
either fluorescence- activated cell sorting (FACS) into separate
wells on a plate or capturing individual cells in microfluidic cham-
bers, microwells, or individual droplets. Prior to single-cell capture,
the dissociated cells can be optionally taken through a sorting step
using FACS or magnetic activated cell-sorting (MACS) to enrich or
deplete cells expressing a specific combination of markers. Library
preparation involves reverse transcribing mRNA into cDNA and
amplifying, either using polymerase chain reaction (PCR) or
in vitro transcription (IVT). Recently developed protocols tag tran-
scripts during the capture stage (step 1 above) with unique molec-
ular identifiers (UMIs), which are random nucleotide sequences
[21]. Every captured transcript is, in principle, tagged with a dis-
tinct UMI, which enables downstream correction of amplification
biases. The amplified cDNA is then fragmented, followed by the
addition of molecular adapters at the end of amplicon fragments
that allow for high-throughput sequencing. Libraries can either
retain the full length of every transcript or tag either the 30 or the
50 end of each mRNA—the choice is informed by further consid-
erations. Sequencing is generally highly multiplexed, and can either
be single-end or paired-end depending on upstream choices. An
important consideration can be the depth of sequencing per cell,
which is often related to the number of cells profiled [22].
48 Karthik Shekhar and Vilas Menon

Tubes/wells with oligonucleotide barcodes Barcoded cDNA

Lysis &
Isolation Reverse transcription

Cells
Droplets with oligonucleotide barcodes

Pooling &
Amplification

Cell barcode
ACCGTTTCAAGTAGCGT
TGTCG ACCGT CGGTT
Alignment, CGGTTACTGGTATAGAC
Transcript 1 0 1 0 Next-Gen
Demultiplexing,
ACCGTGATCATTCAGAT Sequencing
& Quantification
Transcript 2 1 0 1
CGGTTGTGGACACTTAC
Transcript 3 0 2 0 TGTCGGTGGACACTTAC

Transcript 4 1 0 2 TGTCGCTAAATGCGATG
TGTCGACTGGTATAGAC
Transcript 5 1 0 0
CGGTTGTGGACACTTAC
ACCGTGATCATTCAGAT

Fig. 1 General experimental workflow for single-cell RNA-sequencing, as described in detail in the text,
starting from cell isolation and extending through to the generation of counts tables showing the detection of
each gene in each cell

1.3 Batch Effects Data-driven identification of cell types can be confounded by batch
in scRNA-Seq Analysis effects, which result from minor, but systematic differences
between experimental replicates prepared either at different times,
using different reagent batches, different experimenters, or a com-
bination of the three [23]. Batch effects can result in variation in the
transcriptomic state of identical cell types across different replicates
due to technical factors; when such effects are strong, cells can
cluster by batch identity rather than biological identity. Batch
effects can also arise if in addition to transcriptional differences,
the frequencies of specific cell types are different across batches
[24, 25]. If different biological conditions of interest (e.g.,
control vs. perturbation) or different sample sources (e.g., biopsies
from cancer patients) are processed in different batches, it is statis-
tically impossible to deconvolve biological versus technical effects.
While batch effects can be mitigated through careful experimental
design involving an even distribution of different biological condi-
tions across experimental batches (“block design”), although this
may not always be logistically feasible if delays in sample processing
can compromise quality. In such circumstances, cell-types and
Identification of Cell Types from Single-Cell Transcriptomic Data 49

molecular signals identified in a single experimental batch must be

treated with suspicion and results should only be believed if they are
supported across multiple independent replicates, or in other data
modalities. Detecting and correcting batch effects is an ongoing
area of computational innovation, and a number of approaches
have been recently proposed [24–26].
Future promising avenues of research involve the integration of
scRNA-seq data directly with other data modalities. In particular,
recent developments linking RNA-seq to spatial location (such as
FISSEQ [27] and “Spatial Transcriptomics” [28]), combined with
the advent of high-resolution and expansion microscopy, are on the
verge of collecting transcriptome-wide information at the single-
cell level in situ, without the need for cell dissociation. Besides
removing any dissociation-related biases in cell type or transcript,
integration of transcriptomics and spatial location would create
tissue-based atlases of cell types, providing an unbiased version of
highly multiplexed in situ hybridization methods [29, 30]. Similarly,
other cross-modality technologies are also at various stages of
maturation: these include linking single-cell RNA-seq with electro-
physiological measurements (Patch-Seq [31]), gene perturbations
(CRISPR-Seq and Perturb-Seq [32]), protein expression (CITE-
Seq [33]), and lineage tracing (MEMOIR [34], scGESTALT
[35]). The large-scale use of all of these technologies, as well as
others, is on the horizon, and will result in new multi-modal
classification and characterization of cell types in complex tissues.
Ultimately, the power of single-cell transcriptomics, and its asso-
ciated computational methods, will continue to progress as a key
component in generating new hypotheses about the organization,
regulation, and function of complex tissues. Despite all of these
developments, the underlying approach to scRNA-seq data analysis
for cell type identification still rests on a basic framework, described
below.

2 Methods

The following workflow (overview in Fig. 2) describes basic

computational steps for identifying molecularly distinct cell types
from single-nucleus (sn) RNA-seq data. It does not, however, cover
any of the steps relating to the preprocessing, alignment, and
quantification of raw sequencing data, which have been described
elsewhere [36, 37]. We use the R programming language (https://
www.r-project.org), which is a versatile platform for many kinds of
genomic analyses, and benefits from the availability of a wide array
of statistical and bioinformatic libraries. Over the years, a number of
software packages have been developed for single-cell transcritpo-
mic analysis (https://github.com/seandavi/awesome-single-cell),
and many of them are available through Bioconductor (https://
50 Karthik Shekhar and Vilas Menon

Expression
Cells

Distribution of expression

Variance of expression
Selection of
Transcripts

Normalization variable genes

= gene

Cells Mean expression

Cell 1 Cell 2 Cell 3 ...

Transcript 1 0 1 0
Transcript 2 5 0 14
Transcript 3 0 4 0 Dimensionality
Transcript 4 204 65 122 reduction
Transcript 5 9 0 0
...

= cell = cell

Differential
Gene expression

Expression Clustering
Feature Axis 2

Feature Axis 2
Cluster Feature Axis 1 Feature Axis 1

Fig. 2 Standard computational workflow for identifying transcriptomic types from single-cell RNA-sequencing
data, as described in detail in the text, starting from counts tables through to cluster assignment and
differential gene expression identification. Although not every computational approach incorporates all of
these steps in this order, most involve variations on this set of procedures

www.bioconductor.org/), an open-source archive of bioinformatic

R libraries with an active user community. This workflow predomi-
nantly uses the Seurat package [38], an actively maintained set of
tools for scRNA-seq analysis (https://satijalab.org/seurat/).
Here, we analyze single-nucleus (sn)RNA-seq data covering
human frontal cortex (FC), visual cortex (VC), and cerebellum
(CB) [39]. While the main text mostly refers to single “cells,” the
methods below and the general concepts are equally applicable to
snRNA-seq data, and also to other single-cell level measurements,
such as epigenomic and protein (e.g., mass cytometry) data
(although statistical considerations differ).
Our workflow begins with the gene expression matrix X, whose
rows correspond to genes, and whose columns represent single
cells. Entries of the matrix represent digital counts of reads or
transcripts, depending on the scRNA-seq protocol that generated
the data. Although our presentation employs a specific example
dataset, the steps below can be carried out with any gene expression
matrix (Fig. 2). The following steps are implemented in RStudio, a
free and open source integrated development environment (IDE)
for R.
Identification of Cell Types from Single-Cell Transcriptomic Data 51

2.1 Preprocessing: 1. First, we load necessary packages. utilities.R is a script that

Read the Count Matrix contains some custom functions written by the authors for this
and Setup the Seurat workflow (https://github.com/karthikshekhar/
Object CellTypeMIMB).

2. We then read in the individual data matrices corresponding to

the FC, VC, and CB downloaded from the Gene Expression
Omnibus submission of [39] (NCBI Gene Expression Omni-
bus, GSE97942) [39]. These are stored in a locally accessible
folder named Data. Since the majority of the entries of these
expression matrices are “0,” we immediately convert them to
the sparse matrix format using the Matrix package to reduce
the memory footprint.

3. Next, we add a “tissue of origin” tag to the three tissue matrices

and bind them into a single matrix. The rows of the final matrix
correspond to the union of the genes in each of the three tissue
matrices. Genes that are missing in any matrix are assumed to
not be expressed. We use the rBind.fill function in the
Matrix.utils package to fill in the missing genes,
52 Karthik Shekhar and Vilas Menon

4. Next, we initialize an S4 R object of the class Seurat. The

various downstream computations will be performed on this
object.

snd@raw.data is a slot in the Seurat object that stores the

original gene expression matrix. We can visualize the first 10 rows
(genes) and the first 10 columns (cells),

5. We then check the dimensions of the normalized expression

matrix and the number of cells from each sample. Here
snd@ident stores the sample ID’s of the cells, corresponding
here to their brain region of origin.
Identification of Cell Types from Single-Cell Transcriptomic Data 53

nGene nUMI
6000

3000

4000

2000

2000
1000

Cerebellum FrontalCortex VisualCortex Cerebellum FrontalCortex VisualCortex

Identity Identity

Fig. 3 Sample-wise (x-axis) distribution of the number of genes per cell (Left, y-axis) and number of UMIs (i.e.,
transcripts) per cell (Right, y-axis) depicted as violin plots. Dots represent individual cells

6. Thus, we have 23,413 genes and 34,234 cells, with 19,368

cells from the VC, 10,319 cells from the FC and 4637 cells
from the CB, respectively. We can visualize common metrics
such as number of genes per cell (nGene) and number of
transcripts/UMIs per cell (nUMI) as “violin plots” (a fancier
version of the good old “box and whisker” plots) using the
Seurat plotting command VlnPlot (see Fig. 3).

2.2 Normalize 1. Because of technical differences in cell-lysis and mRNA capture

the Data efficiency, the count vectors of two equivalent cells can differ in
the total number of transcripts/UMIs across all genes. This
makes it necessary to normalize the data first to attenuate these
differences, which is carried out in two steps.
(a) We rescale the counts in every cell to sum to a constant
value. Here, we choose the median of the total transcripts
per cell as the scaling factor. This is often referred to as
“library-size normalization.”
(b) We apply a logarithmic transformation to the scaled
expression values such that E log(E + 1) (the addition
of 1 is to ensure that zeros map to zero values). This
transformation has two desirable properties,
54 Karthik Shekhar and Vilas Menon

l It shrinks values such that the data are more uniformly

spread across its range of values, which is especially
beneficial if there are outliers.

l Since logðA Þ logðB Þ ¼ log AB , it converts distances
along a gene-axis to log-fold change values. This has
the consequence that expression differences across
cells/samples are treated equally, irrespective of the
absolute expression value of the gene. This might be
especially desirable for lowly expressed genes such as
transcription factors.

2.3 Feature 1. It is common in analysis of high-dimensional data to choose

Selection: Identify features that are likely to be informative over features that
Highly Variable Genes represent statistical noise, a step known as “Feature Selection.”
In scRNA-seq data, this is accomplished by choosing genes that
are “highly variable” under the assumption that variability in
most genes does not represent meaningful biology. An addi-
tional challenge is that the level of variability in a gene is related
to its mean expression (a phenomenon known as heterosceda-
city), which has to be explicitly accounted for. We perform
variable gene selection using a recently-published Poisson-
Gamma mixture model [40], which was demonstrated to accu-
rately capture the statistical properties of UMI-based scRNA-
seq data (Fig. 4).

Thus, we find 1307 variable genes in the data. We refer the

reader to other variable gene selection methods, e.g., M3Drop
[41], mean-CV regression [42] or Seurat’s in-build function
FindVariableGenes.
Identification of Cell Types from Single-Cell Transcriptomic Data 55

50
NPY

20
CLDN5
STAB1
SLC6A12
ANXA1
ID1
NPAS4
KLF2
AC004862.6
GPR17
NOTCH3
SLC12A7
SLC38A5
AP000472.2
TNFAIP8
HPGDMGPCARTPT
RP11−231C18.1
CTB−95D12.1
LINC00348
IFI44L
PACRG−AS3
DSP
CTD−2533K21.4
CYR61
HIF1A−AS2
RP11−627D16.1
AC007364.1
DIRAS3 NPSR1
F13A1
TNC
MRC1
ITGAX
SRGN
SCGN
PIK3R5
GADD45B

CV (counts)
CX3CR1
CTGF
NPR3 EPAS1
ITIH5
AC079135.1
RP11−427M20.1
SMTN
RP13−492C18.2 SPATA22
DCN IFNG−AS1
QRFPR
MFSD2A
SYT6
HLA−E
RGS16
BTNL9
DLEU7−AS1
LINC01344
NPFFR2
COL8A1
RP11−384G23.1
RP5−947P14.1
LYPD1
ADAMTS1
LINC01099
RP11−268G12.3
ZFPM2−AS1
LMCD1
ABI3BP
TAGLN

10
SYK
GALR1
LY86
CSF1R
TFAP2B
MYH11
PAX3
FER1L6−AS2
PTHLH
EDNRA
MECOM
COL20A1
POSTN
CRISPLD2
RP11−556E13.1
ADGRF5 FLT1
NPSR1−AS1
FLI1
SYT10
CTC−321K16.1
RP11−325K19.1
RP11−480C22.1
BX255923.3
LINC01036
CDH15
A2M
ITGA11 PENK HTR2C
ADAM33
ATP10A
SLC22A3
RP11−96H17.1
WDR49
C5orf17
SLC19A1
RP11−154H17.1
XAF1
NMU
CD36
SFRP1
P2RY12
MCTP2
CD74
ADAM28
RP11−384F7.1
EYA1
CRHBP
ENPP7P8
PODXL
PNOCC3 TAC3
SNTB1
HCRTR2
RP1−223B1.1
FYB
NR4A3
RP11−384J4.2
RP11−122F24.1
RP11−23D24.2
CTD−3239E11.2
XACT
ISG15
F3 VIP
ZIC2
MDFIC
LINC00499
AC007091.1
LINC01088
IGFBP7
TNNT2
LINC01500
ROBO3
CBLN3
ZIC4
SLCO1C1
NR2F2
MT2A
KMO FOS
ABCB1 SST
AC002429.5
SLC27A6
MTND1P23
TBXAS1
LPCAT2
RP11−238K6.1
ZFP36L1
CHST9
CALB1
CCBE1
BHLHE22
COL9A1
SLCO2B1
PALMD
RANBP3L
JUNB GFAP
AC092684.1
LINC00534
SAMD3
RP11−445F12.1
COLEC12 PVALB
APBB1IP
FBN2 RP11−767I20.1
CTC−806A22.1
RP11−653B10.1
TMLHE−AS1
RP11−563P16.1
RP4−668E10.4
SULF1
RP11−33A14.1
THEMIS
ADCYAP1
ELFN1
SHISA8
TACR1
ANKRD26P3CALB2
RP11−525K10.3
LHX6
GJA1
F11−AS1
LINC01470
NOVA1−AS1
NXPH2
ANKRD62
DOCK8
AGT
CMTM8
ARHGAP15
AQP4 TAC1
TTN
RP11−79E3.2
RP11−475J5.10
SGCG
LGI2
CNTNAP3B
RP11−897M7.4
PKP2
HLA−B
CHRNA7
SLC7A11
MAP2K6
STK32A
TRPC6
RP11−436K8.1
STK32B
ANOS1
KCNK2
NR4A2
SLC32A1
SPP1
RP11−460M2.1
MIR219A2
WIF1
CA8
NFKBIA
RNF152
ORAOV1
MAFB
INPP5D
LRRC63
ACSBG1 CRH
THSD7B
OLIG1
RFX4
LINC01090
CA2
SV2C
SCN9A
FAT1
RP11−69I8.3
CA3
RP11−50D16.4
EYA4
C5orf64
ADAM12
CBLN1NOS1
LINC01314
MSR1
PKIB
ARX
AC007319.1
NR2F2−AS1
MIR3681HG
FAT2
SMOC1
FAM160A1 NEFH
AC007682.1
IL16
NRP1 KIT CXCL14
CNTNAP3
SULF2
ITGA8
S100B
TRPC3
RP11−368L12.1
HIF3A
EBF1
TOX3
ACSS1
PLCE1
RP11−886D15.1
RMST
HERC2P3
LINC01619
EPS8
SLC26A4
KLHL1
RP11−308N19.1 ZNF98
GS1−57L11.1
B2M
CTC−552D5.1
ERMN
TNS3
FGFR3
CBLN4
SYT2
RUNX2
MKX
DISC1
IGF1
ALDH1A1
COL25A1
IL36B
OCA2
AGBL1
RGS20
PAMR1
LINC01202
RNU6−6P
FREM1
RP11−267C16.1
ALDH1A2
SPON1 CNR1
SGOL1−AS1
RP11−665G4.1
CTD−2058B24.2
VAT1L
COL21A1
RP11−58C22.1
ASPAAJ006998.2
CCDC175
RP11−707A18.1
ADAMTSL1
MAL
RP11−307P5.1
LAMP5
MAF
RAB3B
GLRA2
BCL11B
HHIP
MIR4300HG
ADAMTS6
SEMA3E
AOAH
PRKG2
PLEKHH2 PRELID2

5
TMEM155
ANKRD55
COBLL1
RP11−266O8.1
RP11−776H12.1
UGT8
HTR4
NHSL1
ZIC1
LAMA4
MOGCOL5A2
GRM4
MET
PDGFD
ATP8B4
ADAMTS3
SLCO1A2
BAIAP3
CARNS1
PCDH8
NR4A1
ENOX2
CDK18 VCAN RELN
RNF219−AS1
PIEZO2
NDRG2
RP11−30J20.1
TLL1
RYR1
MIR31HG
CNDP1
MEGF11GRIN3A
HS3ST2
AC133680.1
CRYAB
TPD52L1
CYP1B1−AS1
CTC−575N7.1
EGR3CST3
RP11−867G2.8
CLMP
LINC01197
RP11−454P21.1
MARCH11
MIR646HG
LINC01170
BACH1
BMPR1B
RP11−739G5.1
DACH1
RASGEF1B
MSC−AS1
LPAR1
ROR1EYS
EXPH5
SCHLAP1
CHRM2
LINC00276
EPDR1
SAMD5
GLUL
THSD4
LINC01266
SEMA5A PCP4
RP11−649A16.1
SLC5A11
COL5A3
ZBBX ATP1A2
RP11−380P13.1
LRRC3B
MYO10
EGFR
MTATP6P1
VRK2
PCSK1
ADRA1A
MEGF10
NPTX2
NNAT
LINC01608
RP11−594C13.1
ATP10B
C3orf67−AS1
COL4A5
PDE3A
COX7A1C10orf11
MMD2
RP11−206L10.9
ENTPD3CTC−535M15.2
CEMIP
ARHGAP24
LINC00923 EPHA5
TOX2
RERG
BCAN
SCD
LIMA1
CERCAM
PCBP3
PALM2
CDH7
ADAMTSL3
KLF3−AS1
PLD1
RASGEF1C
FRMD3
BCAS1
VWC2L
ETV1USP24
RXFP1
DLX6−AS1
RP11−314P15.2
PARD3B
NTN4
C10orf90
MCHR2
LRRK1
SHROOM4
LINC00970
SEMA5B
GSG1L
VGF
ANKRD18A
DNAH14
CDR1−AS
MCC
ENPP2
GHR
ALOX12P2
CRYM LAMA2
CTC−340A15.2
ALK
HPSE2
GAD2
NRIP3
EGR1
PCSK6
VAV3
LINC00507
PLCH1
ANKRD20A11P
ENPP7P4
HOPX
LINC00609
LINC00693
COL24A1
ANKRD20A5P
ST6GAL1
C4orf22
DPF3
COL11A1
SLC38A11
SNAP25−AS1
ZNF521
PLAGL1
PTGDS
TCERG1L
RP11−475O6.1
CNP
SNTB2BTBD11TSHZ2
EGFEM1P
RP11−420N3.3
RP1−90G24.10
COL19A1
MIR325HG
MIR2052HG
SPHKAP
PTN
PCDH17
RP11−624C23.1
SLC35F4
TMEM132C
AC010127.3
LAMB1
TRPC5
CNTN6 TF
RP11−123M6.2
GALNT14
AC012593.1SLC1A3
ADGRV1
ADAMTS17
DNAH6
ADAMTS9−AS2
PHACTR2
TMEM144
RP13−578N3.3
IPCEF1
ANO3
PARD3
SERPINE2
AC067956.1
NHS
GLIS3 LHFPL3 GRIK1
NEFM
FGF13
PART1
GREB1L
TSHZ3
DOCK5
GABARAPL2
GRIK3
HTR2A
LGI4
GRM8
SFMBT2
KIAA1211
ADAMTS19
PREX2
DGKD
KCNIP1
EPHA3
UBASH3B
CLDN11
NRN1
RARB
ANK1
GULP1
NSUN6
TIMP2
BICC1
FOXP2
HTR1E
HS3ST5 PCDH15
IL1RAPL2
SLC4A4
SHISA6GAD1
AP000769.1
MOBP ADARB2 GRID2
CHST11
DLGAP1−AS4
ST6GAL2
TLE4
IQCA1PTCHD4
AC114765.1
RP11−766N7.3
LY86−AS1
TMTC1
DGKH
CPLX1MYO16
NKAIN3
U91319.1
SLC9A9
CHD7
NYAP2
KLHL4
KCNH5
SCN1B
RP11−32K4.1
PLPPR1
MPPED1
SLIT1
UTRN
SATB1−AS1
LRP8
NOS1AP
PVRL3
COX6C
PLCXD3
RP11−444D3.1
KCNA2
GRIK4
CDH22 TAOK1
SLC6A1
LINC01378
EML5
PTPRZ1
PCDH11Y
SLC24A3 SLC1A2
KCNH8
RFTN1
TRPS1
PARM1
SOD1
CH17−437K3.1
L3MBTL4
RIMS3
KIF26B
SIPA1L3
ADCY8
CHGA SOX6
DNER
NWD2NXPH1
FBXL7
RBMS3
PID1
PRR16
HPCAL1
DIRAS2
ARL4C
KLHL5
KCNH3 PDZRN4
AC007563.5
NTNG1
SCG2
HINT1
ELAVL2
MAN1A1
BAIAP2
PGK1 PLD5
RIT2
SLC22A10
EPB41
GNAL
NIPAL2
SLC6A7
FRMD4BC8orf34
GRM1
LUZP2
TMSB10 GPC5
CCK
CYP46A1
LDHB
SCN3B
GOT1
GABRD
MT3
CAMKK1FRAS1
MDH1
ITM2C
MPP6
CALD1
NETO2
STXBP5−AS1 RGS12
TOX
XIST
C1orf61
SLIT3
PIP4K2A
AC091878.1
NPTXR
NPTX1
SPOCK2
RPL21
PRMT8
TENM1
PDP1 RORB
SDK1
TESPA1 PLP1
ST18
MAML3
ANKRD30BL
CNTN3
RP11−123O10.4
TAGLN3
MAST3
NREP
MAN2A1DLC1
HIP1
LMCD1−AS1
RPS27A
NAP1L3
KCNT2
MAML2
VWC2
MIR137HG
AQP4−AS1
NETO1
NDUFA4
DYNLL1
AP001347.6
PTK2B
GABRA2
EPHB6
ATP6V1B2 HS3ST4
PTPRM
CPNE4
RGS4
SORCS3
PLCB4
LHFPCHN2
CDH9
LINC01250
BEX2
UPF3B
KIAA1211L
CORO6
TMEM117 NEFLINPP4BSGCZ
SYNPR
SOX2−OT
RP11−586K2.1
LMO3 CBLN2
ZNF804B
AC011288.2
GRIP1
SNRPN
TENM3
TMSB4X
ZNF804A
TRHDE
PTCHD1−AS
CHGB
EPN2
CUX2
SATB2
TCEAL2
NCKAP5
TUBB2A
CACNG2
PDE8A
EPHB1
ZNF536
RPS14
RAB6B
NGEF
PHYHIP
PRKAR1B
SGK1
PACRG
SNED1
COX4I1
NAPB
LIN7A
DIAPH2
CSGALNACT1 NEAT1
LINC01322
KCNAB1
POU6F2
SOX5
XYLT1
SEMA6D
FHOD3
MAP7D2
PDE1C
MEIS2
CAMKV
MTCL1 RYR3
THSD7A
MLIP
CLSTN2
CDH8
FAM189A1
ATP5B
MSI2
ELAVL4
BCL11A MBP
RAB3C
PCDH11X
TIMM23B
SULT4A1
DGKG RNF220
GPC6
SLIT2 GALNTL6ERBB4
CEP126
TTTY14
CACNG3
DKK3
CAMKK2
CACNA1EUNC13C
ITPR2
CPLX2
SCG5
ARPP19
DOCK10
SUSD4
NUAK1
CNTNAP4
PLXDC2
ADRBK2
PNMA2
PDE8B
PDE3B
FAM153B
BRINP2
CDH20
FAIM2
TLN2
ALDOA
RPL31
PDE7B
NLGN4Y
KCNB2
SV2A
KCTD1
RP5−921G16.1
STMN1
DPY19L2P1
PITPNC1 ENC1
CDH13
FSTL5
DCC
PDZD2
NDST3
HOMER1
INADL
CREG2
UBA6−AS1
PPFIBP1
CAMK2D
ACTG1
SNCA
RPH3ANFIA
RASGRF2
EPHA4
GAP43
SLC12A5
DLGAP2
CABP1
CACNG8
YWHAH
RPL34
NCDN
LINC01122
SHANK2
TSPYL1
MAGI3
PLXNA4
SYNDIG1
CUX1 CTNNA3
RP11−436D23.1
RP11−191L9.4
ZMAT4
SCAIPRKG1
PRICKLE1
FTH1KCNC2
SLC17A7
ROBO1
SGCD
ST6GALNAC5
GUCY1A2
SEPW1 NRGN
FAM19A1
TUBA1B ZNF385D
CNTN5
KCNB1
KIAA1549L
WASF1SPOCK3
CALN1
C16orf45
NFIB
TMEM232
MIAT
RUNX1T1
FAT3
ALCAM
ERICH1−AS1
SERPINI1
EIF4A2
HSPH1
KLC1
SARAF
PRPF38B
PHLPP1
GLCCI1
GRM3
DOK6
PCSK2
DPYD
RP1−34H18.1
MAP1A
PPM1E
DGCR5
CALM3
MEF2C−AS1
ATP2A2 UNC5C
LMO4
TIAM1
PTPRO PTPRT
TRPM3
KCTD8
MCTP1
BEX1
NSF
CAMK2A
EXT1
STMN2
TMTC2
CDH10
KCNQ1OT1
CAMK2N1
ANO4
MYO1D
OIP5−AS1
AC074363.1
GALNT18
FMN1
LINC00657
MAP3K5KCNH7
THY1
PTPRR
RPS6KA2 TNR
NECAB1 NRG1
EPHA6
FSTL4
KIAA1217 RORA
IDS
TMEM59L
ATP1A1
SORCS1
OLFM3
ESRRG
SEZ6L
ADCY1
SLC35F1
KLF12
PPP2R2C
UCHL1
AEBP2
SHISA9
ITPR1
CAP2
SYP
RFX3 NELL1
BRINP3
RGS6
CHSY3
SV2B
TENM4 SYN3
PTPRK
CCNI
SLC35F3
TSPAN7 ROBO2
2

NDRG4
CACNA1D
GRIN1
PIK3R1
GAPDH
NLK
NGFRAP1
CSMD2
RTF1
SHTN1
PLCL1
TMEFF2
BASP1
PAM
CAMK4
DSCAML1
SRRM3 MGAT4C
CA10
CADPS2
SLC44A5
EML6
GABRA1
GRIA1
LINC00599
SLC44A1
ZFPM2
MKL2
CLU
MAST4
TNIK
BRINP1
MMP16
TSC22D1
KHDRBS3
ZEB1
FLRT2
EFNA5
SORBS2
ABLIM1
EPB41L2
CHL1
GALNT13
TMEM108
YWHAG
SH3GL2
FOXP1
NCALD
ARNT2 OLFM1
UNC5D
SRRM4ADGRL2
YWHAB
MYRIP
RP11−384F7.2
NELL2
PPIG
SETBP1
LL22NC03−2H8.5
EDIL3
RAPGEF5
AK5
HDAC9
GPR158
NPTN
GABRB1
PPM1L
SCN1A
MARCH1
FHIT
PEBP1
ZEB2
PRKCA
CADM1 VSNL1
TMEM132D
KCNH1
RBFOX3
TMEM132B
HIVEP2XKR4
KIRREL3
SPTBN4
PDE1A
LARGE
DNM1
KCNQ3
ARHGAP32
LRRTM3
PAK3
CDK14
CHD5 QKI
GABRG3SPARCL1
FAM19A2
LINGO2
KAZN
GRIA3 CHN1
KCTD16
SLC2A13
PRNP
GNAS
ARHGAP26
PSAP
ENO2
BTBD9
CAMK1D
DGKI
CACNA2D1
WBSCR17
CAMK2B
CELF4
ADCY2
THRB
HECW1
FBXW7
ASTN2
RASAL2
GABRB3
SPOCK1
CACNB2
CLSTN1
PRKCE
CKB
PREPL
STXBP5
NTRK3
PTPRN2
DOCK4 DGKB
HS6ST3
CDH12
LDB2
ZBTB20
NPAS3
ATP1B1
CDH18 ASIC2
LRRC4C
PCDH7
CACNA2D3
GRM7
DAB1
IQCJ−SCHIP1
CHRM3
CNKSR2
STXBP1
ZNF385B
ROCK2
AFF3
CACNB4
TANC2
HCN1
CNTN4
SYN2
ENOX1
TMEM178B
PLEKHA5
PDE10A
PTPRG
SORBS1
WWOXFRMPD4
ATP8A2
CADPS
MAN1A2
FMN2 KCND2
PRKCB
ERC1
SIPA1L1
ANKRD26
BDP1 CNTNAP5
PEX5L
MTUS2
PDE4B
MSRA
GRIN2B GRIK2
LRFN5
KHDRBS2
ARPP21
SLC8A1
ATP2B1
CACNA1C
SNAP91
NEBLGRIN2A
GABBR2
RTN3ERC2
ATP2B2
GABRB2R3HDM1
RTN1
GRM5
NCAM2
MACROD2
CACNA1B
DTNA
KIF5CDSCAM
RIMS1
SLC4A10
DCLK1
HSP90AB1
SMYD3 TENM2
KCNQ5
GRIA4
SNAP25
CALM1
CSMD3 ATRNL1
PPP3CA
MEF2C
NRCAM
SCN2A
CAMTA1
NTRK2
TRIM9 PHACTR1
KALRN
MDGA2
CACNA1AOXR1
SLC24A2
FRMD5 CTNNA2
RYR2 SNTG1
RALYL
NKAIN2
CELF2
AGBL4
ADGRL3
TCF4
KCNMA1
DMD
RTN4
LRRC7
GRIA2
PCLO
MAP2 MAP1B
PPFIA2
HSP90AA1
AUTS2
MYT1L
MAPK10 PDE4D
FGF12
CCSER1
AHI1
DPP6 PLCB1
NBEA
CNTN1
RBM25FRMD4A
JMJD1CNEGR1
FTXRGS7
OPCML
RIMS2
GPM6A
NLGN1
PPP2R2B
NTMFGF14
IL1RAPL1
CTNND2
FAM155A
ANK2
DOCK3
MAGI2
1

1e−03 1e−02 1e−01 1e+00 1e+01

Mean Counts

Fig. 4 Mean (x-axis) vs. Coefficient of variation (CV, y-axis) of genes (dots). Two
null-models of mean-CV relationship—Poisson (dashed-red line) or the Poisson-
Gamma mixture model—are also plotted

2.4 Z-Score the Data 1. Variation in scRNA-seq data that is relevant to cell identity can
and Remove Unwanted be masked by many unwanted sources of variation. A common
Sources of Variation challenge is batch effects, which can be reflected in both tran-
Using Linear scriptomic differences and cell-type compositional differences
Regression between equivalent experimental batches. As mentioned ear-
lier, variations in lysis efficiency, mRNA capture, and amplifica-
tion can result in substantial differences between the
transcriptomes of equivalent cells. There can be additional
sources of variation resulting from biological processes such
as cell cycle, response to dissociation, stress, and apoptosis that
might dominate the measured transcriptomic state of the cell.
Correcting for such effects continues to be an active area of
research, and many sophisticated approaches have been recently
introduced [24, 25], but a comprehensive overview is beyond our
scope. Here, for demonstrative purposes, we remove variation in
gene expression that is highly correlated with library size nUMI.
Seurat performs a linear fit to the expression level of every gene
using nUMI as a predictor, and returns the residuals as the “cor-
rected” expression values. Next, the expression values are z-scored
or standardized along every gene,
56 Karthik Shekhar and Vilas Menon

E ij Ei
E ij
σi
Here Eij is the corrected gene expression value of gene i in cell
j , Ei and σ i are the mean and the standard deviation of gene i‘s
expression across all cells. The transformed expression values now
have a zero mean and standard deviation equal to 1 across all genes.
2. Removing the effects of nUMI and z-scoring are performed
together using Seurat’s function ScaleData, which then
stores the transformed gene expression values in the slot
snd@scale.data.

2.5 The Curse 1. Analysis of high-dimensional scRNA-seq data presents numer-

of Dimensionality ous challenges, which are often collectively termed the “curse-
and Dimensionality of-dimensionality” (COD) [43]. For data that is high-
Reduction Using PCA dimensional and noisy, samples from the same and different
cell subpopulations (i.e., cell types) can appear equidistant from
each other, making it difficult to distinguish variability within
types and variability across types. Usually COD is dealt with in
two ways (Fig. 2). First, the number of features/genes can be
filtered to only include highly variable genes, as described in the
previous section. Second, the data can be projected to a lower
dimensional subspace using an algorithm that preserves some
important properties of the original data, including gene-gene
correlations, a choice that is usually informed by the underlying
biological question of interest.
There are multiple approaches to dimensionality reduction, such
as principal component analysis (PCA) [44], independent compo-
nent analysis (ICA) [45], non-negative matrix factorization (NMF)
[46], autoencoders, and diffusion maps (DM) [47]. Dimensionality
reduction results in the compression of raw gene expression data
into fewer “composite” variables, each of which is a complex com-
bination of the original gene features, which may be linear or
nonlinear depending on the algorithm. These composite features
encode the modular structure of the transcriptome alluded to
earlier, and may be interpreted as gene modules or “metagenes,”
with each metagene being defined by a weighted combination of
genes. Each cell’s observed expression profile can then be inter-
preted as an aggregate of each metagene weighted by its activity in
that particular cell. A situation where multiple metagenes are active
in some cells but not others can result in a separation of cells in gene
expression space. In this picture, every cell type is a well-separated
Identification of Cell Types from Single-Cell Transcriptomic Data 57

cloud of points in the reduced dimensional space, whose location is

defined by the activity patterns of gene expression modules.
2. Here, we perform Principal Component Analysis (PCA), a
classical and extremely versatile dimensionality reduction
method that identifies a linear subspace that most accurately
captures the variance in the data [44]. Each of the individual
axes of this subspace, known as principal vectors (PVs), are
linear combinations of the original genes, and the projections
of the original data onto these axes are known as principal
components (or PCs.)

Each PV is defined by a set of weights corresponding to the

genes (known as the “loadings”). A PV is said to be “driven” by
genes with high weights (positive or negative), and two PVs repre-
sent independent, orthogonal directions. The printed output of
RunPCA lists the genes with the highest magnitude loadings (posi-
tive and negative) along the top PVs.

2.6 Visualize PCA 1. Seurat allows multiple ways to visualize the PCA output, and
Output these are useful to gain biological intuition. VizPCA shows the
genes with the highest absolute loadings along any number of
user specified PVs (Fig. 5).

2. PCAPlot allows plotting the cells in a reduced dimensional

space of PCs, and can often highlight subpopulation structure
(Fig. 6).

3. Figures 5 and 6 show that the cells with high values of PC1 are
oligodendrocytes, characterized by the high loadings of char-
acteristic genes such as Proteolipid Protein 1 (PLP1) and Mye-
lin Basic Protein (MBP) (Fig. 5). Next, PCHeatmap allows for
58 Karthik Shekhar and Vilas Menon

QKI ZNF385D
PLP1 FSTL5
CTNNA3 GRID2
ST18 TIAM1
RNF220 UNC13C
ZBTB20 RP11−649A16.1
MOBP FGF14
MBP KCND2
NCKAP5 GRIK2
SLC44A1 RELN
PDE4B RORA
DOCK10 CDH18
TF RIMS1
PHLPP1 TRPM3
CLDN11 CALN1
FRMD4B FAT2
PTGDS GRM4
DOCK5 TENM1
PPP2R2B CHN2
AGBL4 ZNF521
OLFM1 CADPS2
NRGN CA10
LDB2 ST18
ATRNL1 PPP2R2B
SNAP25 CTNNA3
KCNQ5 SLC44A1
CHN1 MOBP
PLCB1 MBP
PHACTR1 RNF220
KALRN PLP1

−0.05 0.00 0.05 0.10 −0.10 −0.05 0.00 0.05 0.10

PC1 PC2

Fig. 5 Genes (y-axis) with the highest negative and positive loadings (x-axis) for the top two principal
components, PC1 and PC2

Cerebellum
PC2

FrontalCortex

0 VisualCortex

−10

−10 0 10 20
PC1

Fig. 6 Scatter plot showing the scores of individual cells (points) along the top two principal components, PC1
and PC2

easy visualization of the gene expression variation along each

PC in the data, and can be particularly useful when trying to
decide which PCs to include for further downstream analyses
(Fig. 7). Both cells and genes are ordered according to their
PCA scores and loadings respectively along each PC. Setting
cells.use to a number plots the “extreme” cells on both
ends of the spectrum. For example, here we see that genes with
Identification of Cell Types from Single-Cell Transcriptomic Data 59

PC 1 PC 2 PC 3
QKI ZNF385D CHN2
PLP1 FSTL5 PLP1
CTNNA3 GRID2 ST18
ST18 TIAM1 NKAIN2
RNF220 UNC13C MBP
ZBTB20 RP11−649A16.1 MOBP
MOBP FGF14 PDE1A
MBP KCND2 RNF220
NCKAP5 GRIK2 CADPS2
SLC44A1 RELN CTNNA3
PDE4B RORA CDH18
DOCK10 CDH18 FSTL5
TF RIMS1 TF
PHLPP1 TRPM3 SLC44A1
CLDN11 CALN1 UNC13C
R3HDM1 CLDN11 CTNNA2
ASIC2 EDIL3 SLC4A4
OPCML TF RYR3
KHDRBS2 IL1RAPL1 HPSE2
AGBL4 PEX5L COL5A3
OLFM1 DPYD ZNF98
NRGN SLC24A2 NKAIN3
LDB2 ST18 FGFR3
ATRNL1 PPP2R2B ATP1A2
SNAP25 CTNNA3 PITPNC1
KCNQ5 SLC44A1 RNF219−AS1
CHN1 MOBP SLC1A3
PLCB1 MBP ADGRV1
PHACTR1 RNF220 GPC5
KALRN PLP1 SLC1A2

PC 4 PC 5 PC 6
CADPS2 PDZD2 VCAN
RALYL GRIA4 LHFPL3
TRPM3 GRID2 SNTG1
CA10 CA10 TNR
RP11−649A16.1 DGKB DSCAM
CDH18 HS6ST3 RORB
SV2B CBLN2 POU6F2
CHN2 NLGN1 LRRC4C
SLC17A7 DSCAM RORA
SLC1A3 DMD C10orf11
PPP3CA FGF14 DCC
SLC1A2 IL1RAPL1 FOXP2
ADGRV1 TNR PTPRZ1
CAMK4 ADGRL3 ST6GAL1
GPC5 SYN3 PHACTR2
SPOCK3 STMN1 RGS12
KCNC2 CKB ROBO1
PTPRM GNAS MGAT4C
CXCL14 EIF4A2 SYNPR
ERBB4 NDRG4 CHRM3
SLC6A1 CALM1 RASGRF2
GRIP1 PEBP1 CDH9
GAD2 NGFRAP1 GRIK2
DLX6−AS1 ENO2 ENC1
ADARB2 TUBA1B CXCL14
NXPH1 OLFM1 ROBO2
ROBO2 NDUFA4 EPHA6
GAD1 GAPDH DAB1
RP11−123O10.4 SNAP25 HPCAL1
GRIK1 SPARCL1 CBLN2

PC 7 PC 8 PC 9
DOCK8 PHACTR2 APBB1IP
APBB1IP PRKG1 DOCK8
C10orf11 UNC5D ST6GAL1
ADAM28 FRMD4A HS3ST4
C3 PLCH1 RASGEF1C
ST6GAL1 INPP4B C3
INPP5D AUTS2 ADAM28
P2RY12 MAGI2 OPCML
SLCO2B1 UNC5C ROBO2
TBXAS1 RORA P2RY12
PLXDC2 RYR1 PLXDC2
FYB CLMP FOXP2
RP11−480C22.1 TENM2 OXR1
ATP8B4 FRMPD4 SLCO2B1
AOAH HCN1 INPP5D
PPP2R2B BCAN CNTNAP5
NKAIN2 MEGF11 GRIA4
IL1RAPL2 CHST11 KIT
CTC−535M15.2 SMOC1 GRID2
CTNNA2 EPN2 TFAP2B
CTC−340A15.2 XYLT1 CLMP
DCC COL9A1 INPP4B
SNTG1 LUZP2 RORA
CNTN5 DSCAM UNC5C
NRG1 SEMA5A EML5
OPCML PTPRZ1 RP11−886D15.1
MAGI2 TNR GRM1
RALYL PCDH15 PHACTR2
POU6F2 RP4−668E10.4 VCAN
RORB LHFPL3 SYN3

PC 10 PC 11 PC 12
ADARB2 FAM19A2 KIT
CXCL14 SLIT2 FGF13
RGS12 FGF13 PRELID2
DLX6−AS1 TENM2 SV2C
GALNTL6 CNTN5 CTC−806A22.1
CNR1 ERBB4 PTPRT
VIP RP11−767I20.1 GAD2
CRH TRPC3 PTCHD4
CCK HS6ST3 LAMP5
CALB2 NTNG1 EYA4
TAC3 ATRNL1 BCL11B
KCNT2 PTCHD4 MGAT4C
KIT CUX2 SGCZ
NR2F2−AS1 GRIN2A GRIN3A
C8orf34 CUX1 FREM1
LRP8 TMSB10 BTBD11
NMU GALNT14 HS6ST3
LHX6 SEMA3E RGS12
MAFB NKAIN2 PLXNA4
RASGRF2 SHISA6 GRM7
KIF26B SYNPR SYNPR
SPARCL1 DMD PLCE1
GRIK3 GRID2 TENM2
PCDH15 CLSTN2 ERBB4
TAC1 SST ZNF804A
GRIK1 GRIK3 GALNTL6
SST RXFP1 CALB2
SOX6 PDZRN4 VIP
NXPH1 EGFEM1P THSD7A
KIAA1217 ROBO2 TAC3

Fig. 7 Heatmaps showing expression of top 15 positive and negative loading genes in individual cells along
PC1–PC12

low values of PC3 are astrocytes, characterized by the expres-

sion of the transporters SLC1A2 and SLC1A3.
60 Karthik Shekhar and Vilas Menon

Standard Deviation of PC
5

1
0 10 20 30 40 50
PC

Fig. 8 Standard-deviation (y-axis) accounted for by the top 50 PCs (x-axis) to

approximately identify the number of significant PCs based on the presence of an
“elbow.” Approximately 25 PCs are chosen for downstream analysis

While there are many formal methods to determine the number

of statistically significant PCs (e.g., see Shekhar et al., Cell, 2016
[13]), a particularly easy and popular method is to examine the
successive reduction in variance captured by increasing PCs, and
identify an “elbow” where inclusion of PCs is of marginal utility
(this is often called the “noise floor”). We do this using the Seurat
function PCElbowPlot (Fig. 8).

2.7 Identify Clusters 1. We choose 25 PCs based on Fig. 8. Every cell in the data is thus
reduced from ~23,000 genes to 25 PCs (a ~1000 fold reduc-
tion in dimensionality!). Next, we determine subpopulations in
this data using Graph-based Clustering [48] using the Seurat
FindClusters function. Graph clustering has been widely
used in recently scRNA-seq papers and has many desirable
properties compared to other methods such as k-means clus-
tering, hierarchical clustering, and density-based clustering.
Here, we first build a k-nearest neighbor graph on the data,
connecting each cell to its k-nearest neighbor cells based on
transcriptional similarity. The nearest neighbors are determined
based on proximity in PC space using a Euclidean distance
metric. Next, similar to the strategy employed in Levine et al.
[49] and Shekhar et al. [13], the graph edge weights are refined
based on the Jaccard-similarity metric, which removes spurious
edges between clusters. FindClusters implements an algo-
rithm that determines clusters that maximize a mathematical
Identification of Cell Types from Single-Cell Transcriptomic Data 61

function known as the “modularity” on the Jaccard-weighted

k-nearest neighbor graph. The function contains a resolu-
tion parameter that tunes the granularity of the clustering,
with increased values leading to a greater number of clusters.
We use a value of 1, but variations in this parameter need to be
tested to check for robustness.

2. Thus, we obtain 26 clusters in the data. We can visualize the

clusters using t-distributed stochastic neighbor embedding
(t-SNE) [50], a 2-d embedding of the cells that preserves
local distances (Fig. 9). The cells are colored according to the
cluster labels,

40
21 19

22 8
23 24

14 0 13
20
1 14
5
12 2 15
0
17 3 16
16 25 4 17
tSNE_2

5 18
0 3 2 10 6 19
11
7 20
8 21
4 7
9 9 22
10 23
13 11 24
−20 1
12 25

15 20
6

−40
−20 0 20 40
tSNE_1

Fig. 9 Visualization of Lake et al. data using t-distributed neighbor embedding (t-SNE). Cells are colored
according to their cluster membership
62 Karthik Shekhar and Vilas Menon

29
31
28 30
35
37
38
36
41 40
39
42 33
43 44
45 46 32

47 48 50
34 51
49
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
Fig. 10 Dendrogram showing transcriptional relationships between clusters (nodes)

3. Next, we arrange the clusters on a dendrogram based on the

similarity of their average transcriptomes using Seurat’s
BuildClusterTree function (Fig. 10). This helps in visualiz-
ing relationships between clusters, and also reveals subgroups
of related clusters.

4. At this point, it is important to note that whether or not we

have found the “optimal” number of clusters is open to inter-
pretation. Importantly, the criterion of what constitutes a cell
type cluster must be independent of the algorithm’s objec-
tive—it could be data driven, such as a minimum number of
differentially expressed genes enriched in that cluster compared
to the rest, or the ability of the algorithm to recover certain
well-known types (i.e., ground truth). Often, however, the
validation of scRNA-seq clusters requires aligning molecular
identity to other cell modalities such as morphology, location,
and function through experimental techniques.
Here we adopt a data-driven criterion to assess cluster stability.
Briefly, Seurat’s AssessNode function trains a classifier on each
binary node of the dendrogram, and calculates the classification
error for left/right clusters. We can use this information to collapse
any node that exhibits >15% classification error.
Identification of Cell Types from Single-Cell Transcriptomic Data 63

2.8 Compare 1. Here, we see that the maximum “out of bag classification
Clusters with Original error” (OOBE), is less than our threshold. Thus, we retain
Cell Type Labels from all 26 clusters. Next, we compare our clustering result to the
Lake et al. [39] cluster labels published in Lake et al. [39], which nominated
33 clusters in their analysis. While we have obviously fewer
clusters, it would be interesting to examine how they compare
to Lake et al.’s results. We first read in their cluster labels,

Here, Ast refers to astrocytes, End refers to endothelial cells,

Ex1 refers to Excitatory neuron group 1, and so on. To compare
our cluster labels against Lake et al.’s, we plot a “confusion matrix,”
where each row corresponds to one of Lake et al’s 33 clusters, while
each column corresponds to our cluster (Fig. 11). The matrix is
row-normalized to depict how each cluster of Lake et al. distributes
across our clusters.
64 Karthik Shekhar and Vilas Menon

Purk1
Purk2
Gran
In7
In8
In6a
In6b
In4b
In1c
In2
In3 Percentage
In1a 0
In1b 25
In4a 50
OPC 75
Ex6b
Known

Ex6a
Percentage
Ex8 100
Ex3a 75
Ex3b
50
Ex3c
Ex3d 25

Ex3e 0
Ex4
Ex5a
Ex5b
Ex1
Ex2
Mic
Per
End
Ast
Oli
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Predicted

Fig. 11 Transcriptional correspondence between clusters determined from the Lake et al. dataset in this study
and in the original study. Circles depict the percentage of cells of a given Lake et al. cluster (row) assigned to a
cluster determined above (column)

2. Encouragingly, we see that although our analysis workflow was

agnostic to the results reported in the original paper, many of
our clusters exhibit a 1:1 correspondence with the clusters of
Lake et al. For example, Cluster 21 (n ¼ 624) corresponds to
Identification of Cell Types from Single-Cell Transcriptomic Data 65

perc
0
10
Cerebellum
20

30
Sample

40
FrontalCortex

perc
40

VisualCortex 20

0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Cluster

Fig. 12 Cluster composition of each brain region. Circles indicate the proportion of each cluster (columns)
within each region (row). Each row sums to 1

Microglia (Mic), while Cluster 25 (n ¼ 4058 cells) corre-

sponds to Oligodendrocytes (Oli). In cases where multiple
Lake et al. clusters map to our clusters, these are related. For
example, Purkinje cell clusters Purk1 and Purk2 map to Clus-
ter 1 (n ¼ 977), while inhibitory neurons In6a and In6b map
to Cluster 6 (n ¼ 1462). It is likely that a second round of
iterative clustering might be necessary to resolve differences
between closely related types such as In6a and In6b. While
all of this is encouraging, we also note some discrepancies—
Clusters 2 (n ¼ 390), 24 (n ¼ 139) and 26 (n ¼ 30), do not
really correspond to any of the Lake et al., clusters, while
clusters 18 (n ¼ 2061) and 19 (n ¼ 2877) appear to nonspe-
cifically map to many Lake et al. clusters.
3. We can visualize the cluster composition of each of the three
brain regions (Fig. 12),

As can be seen Clusters 1–4 and 26, which include Purkinje

neurons and Cerebellar granule cells, are exclusive to the CB sam-
ple, while majority of the remaining clusters are derived from the
FC and VC samples.

2.9 Identify Cluster- 1. Next, we find cluster-specific markers by performing a differ-

Specific Differentially ential expression (DE) analysis between each cluster and the
Expressed Genes rest using Seurat’s FindMarkers function. FindMarkers
supports the use of multiple statistical approaches for DE (spe-
cified in the test.use parameter, see Seurat documentation).
Here, we use the Student’s t-test, as it is computationally
efficient. However, we note that there are many limitations to
using the t-test for single-cell RNA-seq data, particularly its
66 Karthik Shekhar and Vilas Menon

inability to account for zero inflation. Readers must explore

other methods such as MAST and tweeDEseq supported by
Seurat (for a comprehensive review on DE methods, see Sone-
son and Robinson [51]).

2. The output is a data.frame object summarizing the cluster-

specific markers. Here, each row is a gene that is enriched in a
cluster indicated in the column cluster. pct.1 is the pro-
portion of cells in the cluster that express this marker, while
pct.2 is the proportion of cells in the background that express
this marker. We can examine markers for a given cluster as
follows,

3. As expected, the top two genes are PLP1 (Proteolipid Protein

1) and MOBP (Myelin-Associated Oligodendrocyte Basic Pro-
tein), classical markers of Oligodendrocytes. Next, we examine
cluster 12 (an excitatory neuronal cluster), which corresponds
to Ex6a, and is marked by multiple genes including HTR2C
and NPSR1-AS1 (Fig. 13).
SLC17A7
2.0

1.5

1.0

0.5

0.0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Identity

HTR2C
2.0
1.5
1.0
0.5
0.0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Identity

NPSR1−AS1
1.5

1.0

0.5

0.0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Identity

Fig. 13 Cluster-expression of a pan-excitatory neuronal marker SLC17A7 (top) and markers specific to cluster
12 HTR2C (middle) and NPSR1-AS1 (bottom)
68 Karthik Shekhar and Vilas Menon

Examining the identity of these clusters in detail is beyond the

scope of this workflow. Readers are encouraged dig deeper, and
attempt to test variations in the methods outlined above. We end by
demonstrating two common approaches to interpret results:
(a) Examining gene-set enrichments, and (b) aligning clusters to
alternative datasets.

2.10 Examine 1. After identifying markers, we can evaluate whether cluster-

Clusters specific genes are enriched for any Gene Ontology (GO), Dis-
for Enrichment ease Ontology (DO), or Disease Gene Network (DGN) gene
of Biological lists or categories. Each of these calls has multiple parameters,
Processes reflecting stringency of statistical overlap, but they are useful
tools to evaluate clusters for functional or disease relevance.
Identification of Cell Types from Single-Cell Transcriptomic Data 69

2. As an example, view the GO, DO, and DGN categories

enriched for genes distinguishing cluster 1 (Purkinje Neurons).
Note that the categories are arranged by adjusted p-value, and
many are not significantly enriched.
70 Karthik Shekhar and Vilas Menon
Identification of Cell Types from Single-Cell Transcriptomic Data 71

2.11 Compare 1. One of the many challenges in cell type classification studies is
with Mouse Cortical that of aligning clusters across different datasets, which might
Cell Types include different batches, different conditions (e.g.,
normal vs. disease), or even different species. Here we attempt
to map clusters from a dataset of visual cortex (VC) neurons
isolated and profiled from adult mouse using the Smart-seq
method [15] to our Human CB, VC, and FC clusters using a
supervised learning algorithm. We use a multiclass classification
approach described previously [13].
First, we read in the mouse VC data comprised of 1679 cells
and create a Seurat S4 object. To match the gene ID’s to Human
data, we capitalize all gene names—note that a more exact, albeit
lengthier approach, would be to match genes based on an appro-
priate orthology database. We also read in the cluster assignments
of each cell. Tasic et al. identified 49 transcriptomic types, compris-
ing 23 inhibitory, 19 excitatory, and 7 non-neuronal types [15]. We
next select features to train our classifier. We identify variable genes
using Seurat’s FindVariableGenes function (Fig. 14), which is
more appropriate for Smart-seq data [40]. After expanding the set
of variable genes in the snRNA-seq data using NB.var.genes, we
compute the common variable genes to train a multi-class classifier.
VTN
APOD P2RY12ITM2A

6
RGS5 CTSS
LY6A LY6C1 PTGDS
GPR34
FCRLS
PLTP
SLCO1A4 HEXB ALDOC TRF
CSF1R
CX3CR1 C1QB
LAPTM5 ZFP36
IL33 GJA1
CNP
CP
CD53 MAL MFGE8SLC1A3 CAR2
MBP
OLFML3
TGFBR1
RNASE4 CFHGJB6 SPARC
MOG
ERMN F3
PLA2G7
SLCO1C1
SEPP1
PDGFRA EDNRB TTYH2ATP1A2
GSTM1 PENK
CRHBP CST3
PLP1
SDC4

4
ACSBG1S1PR1
TIMP3
CLDN11
AQP4
UGT8A TSPAN2 CD9SIRT2
GATM
TAC2 BTG2 ENPP2
PRDX6 IGF1 VIP
GPR37L1
FAM107A
SGMS1
DIO2
ZFHX3
ETS1 PTPRZ1 SEPT4
TAC1
B230120H23RIK
ABCG2
CLDN10
FYN
PPAP2B
LPAR1
SLC7A11
EVI2A
HTR3ACHN2 CRYAB MMD2SGK1 QK SST
CLU NPY
HBA−A1 TMEM100 LASS2
CPLX3
PCDH15 CAR4
QDPR
DBNDD2GLUL CALB2
SAT1
ACTA2
DCN PLAT
PMP22 DHRS3 MT1 KIT LGMN
TAGLN
CCNB1
SPP1IFNGR1
CMTM5HAPLN1
SOX9 SP8 GM2A FOS
SERPINE2
B2M
Dispersion

SLC6A13 TMEM176B
EPB4.1L2
ENPP6 ID3
PLEKHB1
KCTD12 ARL4A
PIK3R3 RELN
GRM3BTG1
PSAT1 RAB31 ID2
RAB3B
POSTN
MYH11
GM5127 ATP13A5
MYL9
LUM NPY2R
MYL12ACROT
S100A16
PHGDH
ITIH5 GRNELOVL5
KITL BSG LUZP2
NDRG2 SC4MOL RNF13 GAD2
P2RY13
SLC22A8 DLX6AS1 SCD2 FTH1 SLC1A2
COL1A2RGS1
EMCN
MYLK
9630013A20RIK
BGN
CCL4
SLC38A5
IGF2
SCAMP2
FCGR3
SLC4A4
SLC2A1
SLC40A1
AKR1C18
PTRF
MTUS1
TEK
NUPR1
NEU4
ABHD3
LIX1
G530011O06RIK
PTHLH
3632451O06RIK
JAM3
DDIT4L
RASGEF1B FSTL1
MAOA
BDNF
NR2F2
KLF6
CD63
GJC3
RGS10
SPRY4 TSC22D3
FAM19A1 RORB FXYD6SEPT2
6330527O06RIK SYNPR
2
MRC1
CRIP1
SLC19A3
TOP2AGIMAP6
CMBL
8430408G22RIK
SERPING1
ITIH2
FERMT3
SLC6A20A TXNIP
VCAM1
ELTD1
ITGAM
GPR17
TMEM204
IL2RG
PRC1TYROBPSEMA3C
FERMT2
CD93
LY86
HSD11B1
SIGLECH
PPP1R3C
SELPLG
FN1 PTTG1IP
ADARB2
CTSZ CAV2NFIA EGR1 PDE1A
LAMA1
HMCN1
CEACAM1
PCOLCE
SLC22A6 ANXA2
PGLYRP1
PROM1
CCL3
P2RY14
PTPRB
GM4070ATF3
RAB43
UBA7
SLC16A4
IL10RA
HMGCS2
PTPRC
CD86
ART3
KRT73
TBX18 FLT1
PLD4
CD14
FCER1G
TM4SF1
NCKAP1L
2610305D13RIK LCP1
C3AR1
OCLN
EFEMP1
C1QC
EMR1
4632428N05RIK
ABCB1A
STAB1
IKZF1
TBXAS1
OGN KLHL5
GRPR
MOBP
ANXA3
AKAP13
CTSH
SLC39A12
IIGP1
TRIM30D
TMEM119 ESAM
CCDC141
GPR116
C5AR1 SS18
DUSP1
GPR37
TH
PLEKELMOD3
NKAIN4
ITGB1
TCF7L2
RLBP1 NPC2
SLC12A2
OPALIN
ALDH6A1 TSNAX
QPCT
PVALB
NFASC
PNOC
FAM46A
SPON1
SLC6A6 KLHL13
RGS2
ANXA5
ACSL6 ETV1
SYPL CDH13
IVNS1ABP DUSP6 LY6E CTSD
ZCCHC12
PPP1R16B RGS4
GPR30
TMEM45A
ABCC4
GSTM2
MYCT1
GGTA1
CKAP2L
ASPNHPGDS
ICAM2
FBLIM1
MPEG1
AOX3
HIGD1B
SERPINB1A
A2M
UNC93B1
EMP1
SIGIRR
PLSCR4 MSN
2810468N07RIK
ADAMTS1
NFATC1PAQR5
GVIN1
MYL1
SERPINB6B SRGN
CXCL12
IL10RB
SLC38A11
RAB3IL1
GPR84LAIR1
TTF2CHRDL1
SUSD3
FAS
C1QA
CASP8
ATP10A CD68
LPAR6
KANK1
LGALS9
CCR5
GFAP
BCAS1
TAGAP
TLR7 VIM
TMEM63A
OLFR558
SFT2D2
FAM55B
OLFR78 FCGRT
CCL6
HK3
CTLA2BFGD2
CLEC5A
SYK
2310008H04RIK GSTA4
MPZL1
NTSR2
ASPA
IL1R1 TCN2
MAG
HK2 TBL3SSH2
9430020K01RIK
MAN2B1
F11R NRP1
UNC5C
SEMA5A
LITAF
EML1SNX5
GPD1
IQGAP1
LAP3
SLC17A6 CD34
IDH1 OLFM3
ARC
LAMP2 SLITRK2
3110035E14RIK
EPB4.1L3
GSTM5 PCP4
KCNC2 RAB3C
ARHGAP36
BRCA1
TAGLN2
ITGA1
SLC2A5
ARHGAP4
ITGB2
TGTP1
TMEM173
TLR5
SLITRK6 PLN
PECAM1
TREM2
RHOH
UGT1A6A
P2RY6
CENPE
NOSTRIN
FZD4
PSD4 LYN
AHNAK
CPT1A
SP100
IGSF6 H2−K1
NID1
IL1A
TNFAIP3
INPP5D
ABCA9
MT_AK131586
TLR4
ENTPD1
CTSC
DAPP1
CD33
5430407P10RIK
ARHGDIB
SLC39A8
ZBTB46
POLD1ZFP36L1
TES
BCO2
DDX60
SQRDL
GCNT1
FOXC1
ST6GALNAC2XDH
BMP4
EBF1
ATP13A4
ENG
LRRC33
NFKBIE
ENPEP
TMEM154
MFNG
NFAM1
IRF7
GBP2
LPCAT2
FCGR2B
SLCO2B1
AGMO
GALNT6
KIF2C
CKAP2
LOXL3
GYPC VCAN
MERTK
IRF1
CHODL
RP2HCAV1
EPAS1
TLR3FLI1
NDRG1
LIMS2
OLFML1
LCAT
IFITM3
THBD
KIF20A
E130306D19RIK CD59A
UCP2
FYB NFKBIA
PHYHD1
NFE2L2
CSF3R
CD37
CYSLTR1
RFX2
GRAP
MCM3
TSPAN18
SH3TC2
RBM47
TRIM16
HHEX GPC5 SLC7A3
VAMP3
RGS8
PCDH20
PLS1
TRIL
CYP2J9
FCHSD2
PGCP PTPRE CALB1
E130309F12RIK
CTSO
GOLM1
FMO1
GJB1
PREX2
ITIH3
LCP2
SEMA4D
GPRC5C
APOBEC1
SLC14A1
HAVCR2
SASH3 CALCRL
SLC38A3
JAG1 ARRDC3
EPHX2 MAP3K1
ZEB1 UTRN
ATP1B2
TTYH1
TSPAN12
CYFIP1 DNER
MAF PER3
CD164 PTN
DDAH1 NRN1
SLC38A2
GSTP1NFIB MARCKS SCG2
BMPR1B
PLA2G4A
CXCL16
RCSD1
AGBL2
ANTXR2CSPG4
HPGD
AF251705
PDGFRL
RNF135
CYTH4
KLHL6
AGXT2L1
CCRL2
CYP4V3
TGIF1
SERPINF1
GGT5
VIPR2
ADRB2
TRIM34A
SPC25
SLC25A45 TNFAIP6
2010002N04RIK
CASQ2
USP18
FILIP1L
THBS1
SLC15A2
GJA4
GPR160
COL4A3 CBS
RASGRP3
MEGF10
ITGB5
LAMA2
PLCD4
CYBA
ITPRIPL1
PLEKHA2
CYP1B1
GALNT10DEGS2
FCGR1
TMEM123
DNASE2AARSG
SNX33
CYP2D22
SYNGR2
KDR
GPR146
ANXA1 SLA
PON2
PRRG1
SALL1
ALDH1L1
AFAP1L2
ABI3
ARHGAP29
PALMD
CTTNBP2NL
GPR183
GM10790 PHKA1
CML5
PIK3CG AIF1
GDPD2
DOCK2 GBP7
TAX1BP3
CD97
COL4A1
X99384RHOG
ECSCR
GAB1
PROS1
ENTPD2
MYO18B
PPIC
MLC1
RFX4
OASL2
TMEM140
HSD3B7
CDC14A
PTGFR
ATP2A3
ADAMTSL3
TNMDTMBIM1
ZFP90
ZIC4
BC028528
VAV1
LYZ2
BFSP2
SNCG
ZIC3
TNFAIP8
AW112010
PDGFRB
AASS
DMP1
PMEL
IFI44
PTPN6
PLSCR1
PARP14 NDE1
PLD1
MFSD2A
FMNL3
MS4A6D
SLC52A3CMTM6
CD48
CENPF
HVCN1
KCTD12B
BCL2A1B
SEMA3D
TNFRSF1B
TGFBI
CD274
PTPLAD2
ANLN
IRGM2
RNF122
5430416O09RIK
KCNH8 FGFR3
TIMP4
KCNJ10
PDYN
PDLIM2
AIF1L
PLXNB3
CCL12
HEYL
CYP2J12
ADHFE1
SLC25A18
MFSD7C
LRMP
GATA2 CD82
NEAT1
LGALS3BP
A630033H20RIK
TRH
SLC29A3
GM14023
CFTR CD38
FABP7
RBL1
KCNJ16
PRODH
SH3BP4
RNF125
GJC1
CLEC4A3 PLLP
JAM2
GNG12
GABRG1
PDPN
CAR14 PTGS1
GSN
STXBP3A
PRSS23CR1L
LYPD6
HTRA1
CHST11
CALD1
1700017B05RIK
RASL12 ELOVL1TOB1
RHOB
GPR56
PTPRR
MAML2
SEMA6A
RIN2
LGALS1
GPX3
APPL2
ABCA1
EGFL7
AMPD3 LEPROT
FSTL5
CTNNA1
GLUD1
MYH9
CLCC1
CAT
SORCS3
LAMP1
EPN2
LIPA ZNRF3
PAMR1
WLS
COCH HBP1
BHLHE40
TALDO1
TMEFF2
ACSL3 PCP4L1
IDI1 SH3BGRL
DKK3
CRIP2
ASAH1
CHST2 NRIP3
GALNTL6
ELMO1
MT_AK139026 DBPHT2
RESP18 LRRC58
RCAN2
ARPP21ADO
CD81 ENC1
LDHB
PDGFD
TNFRSF1A
FRMD7
ITPR2
MAP3K8
KCNE4KLF4
1810011H11RIK
ACVRL1IL16
SALL3
PODXL
FAM107B
CRISPLD2
HRCT1
APOLD1
IRAK4
WWTR1
FRRS1
HMHA1
TGFBR2
ACOT11
NCAPH
MAB21L1
PSMB8
SGK2
ARHGEF26 MAOB
RTP4
ABCA8A
PNPLA7
MLXIPL
ARHGAP31
PLSCR2
MXRA8
LMOD1 TFPI
ALOX5AP
COL6A3
LPAR4
ZFP764
LRRK1
ABCA6
EMP3 TLN1
DAAM2
TRIM25
TGFA
GBP3
IL13RA1 ID1
TMEM98
EMP2
PRKCD
GAL3ST4
FANCI
GRB14
VWF
SPATA13
PNLIP
ARHGAP30
RASSF4
STARD6
SYNPO2
FOLH1
SERPIND1
PAX6
ZFP418
LAT2
ABCC3
SAMD9L
DDX58
LTC4S
BANK1
LMCD1
PRRX1
PRKCQ
PRELP
ROBO4GLDC
CDKN2C
EDN1
PON3
GM10345
TEC
GRAMD3 TPM4
PPFIBP1
NCF1
SMPDL3A
CPT2
SSFA2
TNS1
MAVS DNAJC13
TMEM2
ZFP658
ITGA6
TNFRSF19
ANXA4
NCF2
ECT2
ARAP1 PDE3B
TST
LHFPL2 BTD
FXYD5
A930038C07RIK
CDH19
FGD6
1700019G17RIK CYR61
TGFBR3
FLNA
NOTCH2
FGFR2
BCAN
LPCAT3
FRMD8
EZR
FAM105A FRMD4A
RRBP1
RRP8
ADM ANO6
SLC31A1
ZFP420
CRYM
TMEM18
GLTP
MYO10
NPC1
S100A1
CYP2J6
LATS2
SEMA6D
6330503K22RIK
A130022J15RIK
SUCLG2
CAPN3
EHD2
MMRN2
ZFP456
RAC2
NFKB1
SLC46A3
DENND3
MS4A6B
SORBS3
TMC6
RHOJ
CXCR7
TIFAB
RFTN2
TNFAIP8L2
SAMSN1
PLXNB2
GSDMD
PLXNB1
ADAM12
ALPL
FAM198A
RBMS2
HIF3A
APBB1IP
RNF43
MYO1F
HCLS1
DNA2
OSMR
KIF13A
SLC7A2
IL4RA
PDK4 CLIC1
LTBP1
ZFP619
SLC44A1
FAM129A
PCDHGC3
ZSCAN20
PDLIM5
SKAP2
SHE PAG1
ITFG3ABHD4
DOCK1
TMEM229A
GM98
SLC16A12 NDP
ADAP2
DOCK11
AI464131TMCC3
APOE
TRP53INP1
IRF8
0610040J01RIK
B4GALT1
AMOT
GBP9
KLF2
RCN3
CCDC90A
SLC5A7
PLCE1
MCAM
IFI47
HYAL1
SULT1A1
ZFP229
EGFLAM CPM
COBLL1PIGA VCL
RGS12
ZFP36L2
CLIC5
SLC30A10
VAMP8
GM9897
NIPAL4
1110015O18RIK FRMD4B
DOCK10
IFITM2
ST5
LIMA1
FAM163A
TMEM176A
INPP4B
NT5E
ESYT1
CNN2 ACAA2
VWA5A UGDH
CLMN
AI987944
PTGS2 CBLN2
SQLE
GSSUGP2
AGPAT5
LHFPL3
EDEM1
MYO6
IFIT2
CADPS2 THSD7A
SERINC3
NEK7
IL1RAP
TBC1D14
MCL1
GNB4 LBRSDC2
TMEM47
TUBGCP5 ZFP62
GPCPD1
TMEM129
SOX2OT
MT_AK157367LGR4PLCXD2
FAM5C
IRS1
4732418C07RIK
DBI NEUROD6
CCND2
TFRC
SOX11
AFAP1 CTSA
KCNIP3
CAMK4
PITPNC1
KCNAB1 NRSN1
GOLGA7B
R3HDM1
VSTM2A
OSBPL1A GAD1
AI593442
SLC6A1
CHI3L1
FAM55D
CGNL1
FKBP9
HHIP
5430435G22RIK
MGST1
SEPT10
SLC26A6
MSRB3
HEY2
DCT
CAR8
MTMR10
C1QTNF7
ARRDC4
RARRES2
BCHE
SERPINB9
ICOSL
ICAM1
FIGNL1
THSD1
GM973
CD1D1
ANGBLNK
FIGN
PLOD1
DDC
MYH4
DBX2
ECM2
HACL1
ANO1
TBX3
EDN3
SELENBP1
CPXM1
STOM
ERAP1
SWAP70
STK17B
PLCG2
TUBA1C
TNFSF13
FAM20A
PDE8ADAB2
AFF1
MX2
STEAP3
RAB7L1
HPS1
NOTCH1
SAMHD1
GPSM2
TAF4B
PHLDB2
PTGR1
FAM70B
PYROXD2
WFDC1 MGP
TPX2
CECR2
GNGT2
ZCCHC24
CEP72
FBN2
FKBP10
ACSS1
SDPR
AURKA
MMP11
APLN
KCNJ13
0610007N19RIK
TSPO
CRYBB1
KIF20B
PLEKHO2
TACC3
ZFP783
SLC11A1
UGT1A6B
HFE
ALDH3B1
DDO
CSRP2
EMID1
VRK2
TMEM144
SHC4
GPX8
UBE2C
MYH13
GPR77
SPRY1
H6PD
TMEM146
TUBB6
GLI3
CDH5
NFATC2
2900052N01RIK
LDLRAD3
DOCK5
ADORA2B
A230001M10RIK
ARAP3
TAP2
GM15880
PLEKHH1 EPS8
AKAP12
AXL
MRVI1
AQP11
ZFP521
ANGPT1
TTPA
I830012O16RIK
COL15A1
SLC9A3R1
CMTM3
RREB1
OLIG1
TCIRG1
PCSK6
HEATR5A
NWD1
HGF
TLCD1 IGTP
LHFP
IKZF2
9830001H06RIK
EPHX1
4930594C11RIK KCNJ2
PROX1
HADH
SMPD2
CABLES1
TRIM12A PSD2
S100B
STAT6
P2RX7
ZBTB37MYC
EIF2AK3
TRAF6
S100A11
CDK5RAP2GRHL1
MEIS1
ITGA7
ISLR
ZFP174
CENPI
AOX1
DHTKD1
CYP4F14
SLC13A3
FGD3
GMIP
TNNI1
ZFP41
NHSL1
GM5069
ABCD1
PRKAB1 PNP
TMC7
FAM70A
SCRG1
PLEKHG1
CYP4F15 SLC16A1
GNG11
FAM176A
LONRF3
PDLIM3
SNAP23
ATG4A
TRIP6
CAPG
NCKAP5
RENBP ARSK
TSPAN4
KCNMB1 PPCDC
ZFP59MYL4
PYGB
LRP4
SPG20
RASA3
PCDHB7
SLC27A1
PBXIP1
CLEC14A ROD1
FAM63A
SERINC5
MT2
THRSP
MYBPC1
GPLD1
CTXN3
XLR3B
FA2H MDGA2
ATP6V0E
IFIT3 MUT
TMED7
ERBB2IP
TAB2
BC013529
RSU1
ACER3
MAGT1
IGFBP5
CEP110
SOAT1
GM20199
PCDH11X
TMEFF1
GPR155
ADAM9
HPS5FBXO7
COL25A1
TYK2
FAM114A1
WNT5A
ALDH1A1
ADAMTSL4SPG21
MT_BC006023 GNS LRP1
POU3F2
SRGAP2
SLC48A1
IL6ST
ZFP407
ARHGAP25
CD83 DRAM2
HDAC9
ABCD2
NSDHL
ZFP868
TANK STT3A
CSMD3
PAPSS1
PNRC2
PIGS
GPD2
MT_BC055066
ASRGL1
CHST10
FAM102B
ERCC5
MCM4
CML1
PIK3R5
FGF1 LPP ILK DEGS1
KLF10
BEND4
NXPH1
GRIK2
TOX
CNTN4
KCNIP4
IGSF3 LIFR
WDR1
LMO3 ECE1
EPHA4
SIRPA
NECAB1
SCG3 PDP1
SV2B
NTRK2 STX1A
D0H4S114
RASGRP1 DBC1HMGCS1CHGB
SCHIP1SPNB2
SNCA
PRKCB
PTPN13
ADAMTS4LAMA4
RIPK1
ZFP820
CEBPA
ACAT3
GMFG
PPFIBP2GHR
KIF18A
CCL9
HTRA3
GM216
WIPF1
PCDHGA2
MC4R
PNPLA2
CACNG4
MVP
H2−DMA
ELOVL2
UNC5B
PRKD3
PIK3R6
2810459M11RIK
B3GNT5
RPS6KA1
COL27A1
PARP3
GPAM
KANK3
NBEAL2
APOBEC3
UACA
CCDC18
GSTM7
NT5DC1
EHD4
STARD8
FUT10
INSC
CHD7
D2HGDH
HSPBAP1
ARHGAP18
TBC1D4
LDLR
SCARA3
SOX7
RAPGEF3
KCNK13
CYP27A1
CENPN
CREB5
ITGA8
NEDD9
NAALAD2
NHLRC1
FPGS
GLIPR1
CHRNA3
UHRF1
CRLF3
LSP1
CYP39A1
KLHL25
IGFBP7
SGMS2
TMEM106A
TJAP1 PTPRO
PION
TPRNANKRD28
ST18
LLGL1
HTR2C
CNN3
RNF103
EFNA1
PLIN2
LIMD1
UHRF1BP1
SRD5A1
COL4A2
AA387883
GPC3
BST2
SLFN2
SLN
ST3GAL4
MRE11A CRH
CASP2
KIF23
SUSD5
CHAT
PCDHGA5
OLIG2
VANGL1
ARHGAP6 SMOX
PDS5A
DCDC2A
1700110I01RIK
SOCS3
NXN
2510039O18RIK
CENPO
PCDHGB7
RAMP2
INHBA
GEM
EBI3
9530051G07RIK
5031410I06RIK
DPY19L4
ZFP438
ARHGAP19BCAR3
LMO2
LECT1
SLC12A4
ALDH4A1
TOP3A
PRTG
SMOC1
MKI67
COL23A1 TOM1L1
FANCC
UTP14B
CAPRIN2
TRIM45
SERPINA3N
VAV3
AMOTL2
TPK1
HSPA1A
VEGFC
SRGAP1
ST6GALNAC3
FTSJD1
E030010A14RIK
ST6GAL1
RAD54B NRM
ZFP566
ARRDC1
ADAMTS6
E130114P18RIK AS3MT
TXLNA
ZFP52
PLIN3
SULF1
RECQL5
ADAM17
CHD1L
EDNRA
CDC42EP1
PCDHA7
PCDHGA4
LTBR
LEF1 ESYT2
RCBTB2
EYA1
CPNE3
PAN2KLHL4
PPP1R15A
PEX5L
SIAH1A
PCDHB9
PLXDC2
STK40
PGGT1B
NAAA
SGPP2
MAP3K3
TNFRSF11B CLIC4
GPT
FIBIN
LTBP4
CARHSP1
GCNT4
FGD5
CASP7
SCARF1
NPHP3
KLK8
ZFHX4CTH
TPM2
GULP1
LAYN
EGFL6
SCD1
AW551984
PRR5L
DDR2
CCNB2
SHCBP1
PAPSS2
DCLRE1B
ARPC1B
FZD6 PCCA
RHOQ
GNA13
DHX33PTPRG
PPP2R1BHIP1
RPTOR
CXCL14
DYNLT1C
PRKD1 JUN
QRFPR
NEDD1
TRIM59
TMEM37
CASP1ABL1
LSS
POLK
ZFP934
SLC9A9
PIGN
6430706D22RIK
TICAM1 WFS1
TIFA
BBS10
LRIG1
S1PR3
MT_AF093677
FZD1
SLC9A3R2
CSRP1
SERPINH1
COL9A3
TMEM125
MBNL3
BCL2L11
SLC7A10
ZFP709
AGTURGCP
FAT4
PELI2
B3GNTL1HIC2
ABCD4
NEK3
SLC22A4 ZFP873
TNFAIP8L3
FKBP7
HN1L
PCDHB19
2610034M16RIK WSCD1
FOXO1
SVIL
C030030A07RIK
POLR3E
CTSK
HMOX1
CNTD1
CALCA
CSF1
FAM69C
CCBL2
ECM1 AK3
ZFP110
AMOTL1
CD2AP
ZFP51
TMEM168
GEMIN5
HCRTR2
GALNT3
KIRREL2
GSTK1
MOB3B
LOC545261
TNPO1
SPATA17
SERHL
CYP4F13
SYCP2
SCARB2
MAFK
SEC16B
ARHGEF10
MUSTN1 HSDL2
CAST ENPP4
ZFP143
ANO4
DECR1
CYBASC3 UST
PRCP CLDN12
ADAMTS3
PCDHB20
D3ERTD751E
1500010J02RIK
XLR
CYP20A1
ABTB2 ZFP866
TNR NELL1
NBEAL1
FAM60A
ERLIN1
HEATR1
9930111J21RIK1
ZFP870
ATP7A
FAT1
GPT2
PAQR8 RNF138
TTI1
IL18
GM5141
GM5595
CDH4
PRR14
TSPAN14
NET1
GCNT2
SLC44A2
SLC19A1
TMEM39A
GM10635
PLEKHF1
C2CD2
PCDHGA7
RFX1
NR1H3
D930014E17RIK
KLF15
MUM1L1
SREBF1 VWC2L
GUSB
PCDHGA11
MID1IP1
CISH EGFR
BMPER
ZFP429
ABHD1
ZBTB39
NINJ2MCM7
MCM5
PCSK7 CD84
DDHD1 TRIB2
ECH1
EXOC6B
ZFP433
SLC5A6
ZFP87
SEMA3E
GABRB1
USP24
PCDH18
PID1 IGF1R
TMEM209
TMEM132C
SLC20A2 VPS13B
ADI1
IDUA
ST8SIA4 ZMYM6
2700078E11RIK
ARHGEF6
PFKFB4
SOX21
IL6RA NQO2
SLC35F5
MT_BC081549
MED12
CCDC114
EZH2
ENGASE B4GALT4
NDST2
NKD1 FNBP1
APH1B
CDK19
AKT2
SH3GLB1
SMAD4 CRK
INTS4
RALGPS2
ZFP788 DGKB
SFXN5
HEXA
NPY1R
JMJD1C
ZFP292
LGALS8
ZFP719
ZFP948
SORCS1
ZFP192 TLE4 STT3B
ULK2
COL19A1
ZFP57
REV3L
ZFP869
GFPT2
CDH9
MYNN
MAFB
LRCH3
GIT2 GM11549
FOXN3
CCDC50
LIMCH1
SNX1
P4HB
BPGM
FGF13
NR1D2
ZFP280D
4933407H18RIK
ZFP740
SLC46A1 CHD9
ZFP617
KCNV1 PDE4B
SEMA3A
DUSP14
CTNNB1
CALU
ALCAM
SLC2A13
CACNA2D1
TPP1
ZMAT4 HNRNPF
KPNA2
ADK LRRTM4
PTCHD1
RPN2
MAL2KCNA1
FOXP1
TPM3 PRDX1
SPOCK3
PTPRD
PAM
SLC39A10
CTSL
NELL2 SYNE1
CPLX1SEPT11
THRB
NRXN3 ARPP19
GPM6B SPARCL1
INPP5FSERPINI1
PCDHGA6
CYP4F16
OSTN
PARP12
ETAA1
CHRNA2
PRR18
RDH5 SYNE2
MYLIP
SNX29
SEMA5B
IFIT1
SMC4
TCEANC
FMO5
SOSTDC1
PCDHGA8
GM9079
CCL28
CLEC4A2
CNKSR3
DHX58
ZBTB5
FXYD1
AGTRAP
SLC25A29
HUS1
RMST
PPAPDC1A
CD55
ADAMTS10
ACY3
DERA
CACHD1 ZFP810
PLAGL1
NUF2
GM16523
GM4636
TTC7IRF9
1700066M21RIK
OAF
3110062M04RIK MIOS
FARP1
TM6SF1
MIF4GD
ANGPTL1
IFI35
TSGA14
SLC41A1
HIST2H2BE
1700028K03RIK
RHPN2
DNAHC7B
SARDH
PARP9ELK3
BC046404
ZFYVE21
FBLN5
PXMP2 ELMOD2
RASSF2
PGF
ZFP273
HYAL2
TJP2
IDH2
MSH6
PCDHGB4
ZFP493
CCND1
LRP10
ANKRD6
PABPC4L
TTC32 RPRM
TM7SF3
C1RL PHC2
PLXNA3
PLEKHG3
C330027C09RIK
AMIGO3
B3GNT2
TUSC5
LGALS4
SNX9
GM10052
MAP3K14
MCM6
2900005J15RIK
SERTAD2
CCDC46 ZFP606
OSBPL7
GM16515
CCDC129
CCNYL1
CLEC1A
4930503L19RIK
MOCS1
STK32A
CLN5ZFP518A
CSGALNACT1
WNT7A
AGPHD1
PHXR4
DOCK6
GSTT2
4930523C07RIK
GM10220
ADAM4
LAMA3
TESK2
RNPEPL1 PUS10
STAT2
CBFB
PDE1C
MEPCE
RDH12
SRPK3
PDLIM4
RHBDD1
SLC16A6 ZFP119B
MTRR
FOXP2
LRRC16APCDH19
LMF2
ATP8A2
PVRL4
HTR4
POMT1
GFRA1
TSC22D4
ZSCAN22
4930422G04RIK
DENND2A
FBXW17
HMGB2
MYT1
SMURF1
ZFP710
PIK3CD
NECAP2
MKS1
MARVELD2 SGK3
NMRAL1
RETSATF8
PCDHAC1
MEX3C
CBL
MAP6D1
DDX21
PIGG
AW146154
ELOVL7
ZBTB20
FAM101B
PLEKHF2
AGTR2 NARG2
EFSEDEM2
SERP1
CD52
FBLN2
SOX10
RBP1
CACNG5
CDCA4
TEP1
SFXN2
1700029I01RIK
FAM82A1
NADSYN1
RRAS2
NTF3
2610019F03RIK
NTS
DAP
DLL1
PHLDB1
SFRP1 CRAT
ALG9
CTBS
AA987161
NKIRAS2
CDC42EP2
TRP53I11
ABHD2
ITGB4 AI854517
FMNL2
GM10509
CD46
ANKRD13A
PTH2R
WT1
TGFB2
AEBP1
PLA2G16
SORCS2
4732471D19RIK
1190002H23RIK
ZFP82
CR2
SMAD7
PCDHB16
LGI4
FAM122B
OBSL1
ABCC10
RASL11A
ONECUT2
CLDN1
TENC1
PTGES
RCOR1
KLF11
SLC30A1
ZFP775
SRPX2
ZFP595
CD24A
UNC119B
D330045A20RIK
COL24A1
TOPBP1
RTEL1
TBC1D8B
ZFP28
NR2E1
CDK18
ARID5A
TFCP2L1
HECA
ENO3
PBX3
STC1
TRIM68
BAG3
PTGDR
UNC13C
RECK
MYD88
9030617O03RIK
CDCA8
BC025920
EML3
CCDC89
KCTD14 FCHO2
F630111L10RIK
TUBB2B
BC017643
UBOX5
RCBTB1
NXPH2
ABHD14B LXN
TET1
ZFP14
IER3
ENTPD5
ZFP808
B3GALT5 P4HA2
GOLPH3LZFP26
CD300A
PCDHB17
4930452B06RIK
A130040M12RIK
4632415K11RIK
NFKBIZ
C030019I05RIK SLC22A23
SLC1A4
SIN3A
MAS1
CPNE1
PPFIA4
PRAMEF8
PIGZ
CYP7B1
HSPB8RGS9
KDELC1
GEMIN4
ASCC2
PREX1
PCDHGA1
PFKFB1 PPP2R3A
CTNS
HPS3
TBC1D5
G2E3
ATR
TMPO
GLRA2
RASSF5
1110020G09RIK
PCDHGA10ATXN1LIER5
CCNG2 MDM2
SP1
INTS2
2610008E11RIK
RDM1
PARS2
PER2PXDN
TMTC4
MAP4K5
6720401G13RIK
SCN7A STAT1
FKBP15 GPR83
H2−D1
INPP5B
GNG5ACO1
MITF
SGPL1
EIF2C4
MYCL1
ZFP189
ITGA9USP38
OSBPL11 MLH3
BCL2L13
SLC30A7
BCOR
HIST1H2BC
TSPAN15
MSL3
ZFP280C
KNDC1
4931406P16RIK
DPY19L3
9130023H24RIK
TGDS
PCDHA9
SNX22
CTDSP1 ACPL2
NDOR1
STX2
CCDC77SYT6
TANC1
1110031I02RIK
ZFP472
TMEM200A
6530401N04RIK
EFNB3
WDR89
ZFP85−RS1
GLP2R
9430015G10RIK
ZFP963
RNF182
PCDHA6
PCDHA4
CRISPLD1
GPR82
PCDHA1 ACOX3
NUP188
PCDHA10 CASC3
ZFP772
OXTR
SEPHS1
SNX18BICD2
SLC25A37
NUMA1
1810041L15RIK
SCGN
HIST1H1C
SLC26A11 SMAD5
PTCHD2
CTDSP2
WDR5BSTXBP4
MED12L
GALNT14
CTNND1
QSER1
EGR2
USP35
ZFP182
AU041133
XRCC3
2310042E22RIK
GLB1LZFP623
TRP53BP2
1700084C01RIK
ZBTB1
PXN
SLC35C1
FBXO30
GM3414 TRPS1
SUV39H1
0610007P08RIKMAD1L1
MAST4
RELA
DDX19B
PAQR7
SLC7A1
TSTD2
2810055G20RIK
SLC35B2
INSR
ZFP7
ACVR1C
MEI1 GGH
8430427H17RIK
ARHGAP42 KRBA1
ZDHHC15
GATAD2A
ELF1
6030446N20RIKTBCCD1
WDR20A
1700049G17RIK
ZFP811FGD4
MEIS2 RRM1
ZFP157
PANK3
CLN8
TTC30B
PDLIM1MTHFD1
IBTK
PCDHGC4
TSPAN9
CCDC80
TMEM229B
LAG3
HTR3BPOLA1
B130024G19RIK
VAMP5 POLR1A
MYBBP1A
DCC
ZFP748
ABHD5
ALG12
ZBED4 ATM
TRAF7
LIMS1
PPPDE2
RYBP CNTN6
ADD3
ASPH
PARM1
DOCK9
CNTNAP4
CLN3
AGPS
CORO1B
SOX5
ZFP84
PCSK5
ST3GAL6
FGFR1
LRRN1
PHLPP1
PPPDE1
PLA2G15
CIDEB OMA1
1810014B01RIK IPO8
KIRREL3
OSTF1
DLG5 RB1 BTBD11
SULF2
EXPH5
PIGQ
GRIK1
MARS
SLMAP
IFI27L1
WNK1
RAP1B
ZFP35
ZFP874A
PDE5A
ATXN7 BAZ2B
GLCE
FAM102A
RBPJ RDX
HDAC6
PHF8
CHUKSESN1
SLC39A9
INTS12
RFX7
TIPARP
INTS3
PHRF1
PDK1
ORMDL1
9830147E19RIK
CHRNA4
PLD5 GCDH 2900092D14RIK
PLCXD3
ARID2
HDAC1
TBRG3
2610203C20RIK
ALDH3A2
PHKB
DHX40
GM3002
PDGFB P4HA1
GNG4 PICALM
B3GALT2
TNKS2
RAB8BFMNL1
TRPM7SHROOM2
TCP11L2
LRRN3
ADCYAP1R1
NTRK3
C
TAF9B
CSNK1G1
TXNDC16
ABCD3
ADRBK2
ATP2B4 RBBP9
SLC35A4
DSCR3
GPKOWBIRC2
ZFP763 CDR2L
FAM196A
SLC3A2
OSBPL9
ARHGAP12
ECI2 PAK7
SEZ6 ADORA1
GUCY1A3
NEFM AP3M1
HADHB
AW549877 FLRT3
NTNAP2 NDN
NRSN2
CAPZA1
SC5D MACF1
BAIAP2
ABAT
BCAT1
TGOLN1
ARHGAP5 EDIL3
CAMK2N1
PLK2
PTP4A1
OLFM1 PGM2L1
TMEM130TACC1
ASNS
LRRC4C
FAM81A
MKL2
DIP2B
CHL1
SKIL DCLK1 STMN1
TUBB4A
MEF2C
FREM1
NUSAP1
SLC18A3
AKR1B8AFAP1L1
EFEMP2
A530054K11RIK
TMEM86A
AQP5MKL1
PLCD1
GDAP10
ERBB3CCNF
ASH2L
COL6A1
POLM
ZDHHC7
PDIA5
NUBP1
ARHGAP17
CECR5
FAM40B
STK32B
FAM78A
DPYDSAMD10
KBTBD8
PCDHGB6
MYO1D
PAK4
ADAMTSL1
TNFSF12
STARD13
CD44
SNCAIP
RHOC
HOMEZ
RNF152
ZFP276
GM410
KRCC1
ABCC9
KCNRG MMAB
FAP
SLC35F2
2310042D19RIKTTN
FDXR
MOB3A
ZFP458
FNDC3B
CDC42EP4
OPRK1
KCTD5
WNT3
CML3
HS1BP3
PTPRK
TOR3A
PCDHB8
RXRG
CASP6
PCDHB10
SFRP4
WNT5B
EFCAB7
POLHISPD
POMT2
FAM84B
CCDC62
ZDHHC14
PDZRN4
LY6G6E
D19ERTD386E
SLC25A35
2810046L04RIK
OLFML2B
ADAM1A
RRM2
PCDHGB2
SLC10A3
1700003M07RIK
SPSB4
9930012K11RIK
COL14A1
CCDC3ID4
SKP2
RXFP1
TRPC6
ZFP54
RNASET2A
MARCH8
PAFAH2
AP1S2
XKR8
ZFP647
OPLAH
WNT7B
ARHGAP24
TMEM179B
DHDH
ZFP677
SLC16A13
4933426M11RIK
APOOL
PCDHB3
MBOAT2
THNSL2
STARD5
FAM117A
IGHMBP2
C030018K13RIK
FAM111A
DDAH2
CENPJ
BMP2K
CCNJL
GDF10
CEP152
SHMT1PCNT
CHST15
GPR133
GDPD5
CENPL
FEM1C
EHHADH
CAPN6
DDR1
CLGN
PKNOX1
4930534B04RIK
RNF130
NUDT7
SLC35A3
SHC1
RTTN
CTDP1
ZFP389
DNASE1L3
ZFP964RPAP1
MICAL1
ZFP551
ATAD5
MASTL
PCDHB15
MT_AK140300
1700029J07RIK
CERCAM
TMEM194B
JMJD7
LRRC47
NOS1
ENDOU
F2R
ZFAND2A
CREG1
GM10336
FOXRED2
ITGB8
RASD1
MIA1
GLT25D2
DNAHC5
COL8A1 PML
CAD
TRIM67
KIF6
HRH2
HAUS5
RNF17
THAP6
TMCO7
4933424B01RIK
PCDHGB8
D630037F22RIK
SLC4A2
EFHC2
D14ERTD668E VILL
STAT4
KIF13B
SRD5A3
BAAT1
CTGF
MAPK7
SLC35E4
SHROOM1
SMARCD2
PLAUR
MORN1
GPR137B
BRWD3
SLC9A8
ZFP455
CD74GSE1
DEM1
OBFC2A
PUS7
NCAPD2
2810002D19RIK
CIB1
4831440E17RIK
ARHGEF1
ZFP568
GOLIM4
ZFP708
PCDHB13
BUB1B
PCDHB6
RAD54L2
ZC3H7B
PRMT6
FADS2
GRID2
NHLRC3STAR COBRA1
DUSP18
LYSMD3
ATP11C
CDK20
DNM3OS
GAB2
3110056O03RIK
RABEP2
ZFP335TAPBP
STX17
AGA
CORT
USP54
HDAC10
GRAMD1A
HEPACAM ZFP108
TMTC3
FOSB
GTF3C5
THBS2
ZFP287
ZXDB
ISYNA1
ZFP641
2610020C07RIK
NAGLU
TNFRSF22
GPR115 PWP2
PCDHGA12
TACR1
MTTP
WDR41
KIF22
OPTN
THSD7B
FAM164C
RSC1A1
PCDHA11 JRK
ABCB10 MKNK1
TOE1
ACBD3
GABPA
CTNNAL1
ZFP354A
D230025D16RIK
PHKG1
TPH2
RXFP3
IQUB
E130304F04RIK
TSPAN11
ANGPT2
ELMO3
SH3PXD2B
WDR25
CALR3 RCN1
H2−M3PRKX
USP21
POU2F1
PPM1D
TPST2
3110001I22RIK
RGNEF
PCDHA8
GNB1L
SIX4
PALB2
EPB4.1L4ASLAIN1
MCAT
ADIPOR2
MOSPD2
DHX38
KCTD9
OSGIN2
EPHB3
WIF1
NOC3L
CD6
D16ERTD472E
SYT9
QTRTD1
KDM1B
SLC6A9
GLIS2
ATG4D
DDX59 POLL
NAIP5
MRAP2
MFAP2
TTC21B
D930015E06RIK
ALMS1
4933407K13RIK
C330006K01RIK
ADRA2B POT1A
NPAS3
TJP1
HRH1
SOX6
RGS19
CPNE9
ATF7IP
4930471M23RIK
ANKIB1
STARD3
NAT10
LARS2 MLST8
ZFP961
FEZ2
LYPD1
PLCL1
GXYLT1
DQX1
ALDH9A1
PSTPIP2
IL11RA1
FAM59A
4833442J19RIK
SLC6A11 USP1
XRCC2
PCDHB12
SRRD
ANKRD44
SCEL
1700013N18RIK
ABCB1BFAM160B1
ATPAF1
ZSWIM5
PCDHA12
PPP1R18
5930403L14RIK
GAS1 ASB1 CDH7
LYPD6BINO80D
USP40
BMPR2
MTFR1
NMI
MDK
LAMB1 SLC41A2
MCCC2
ZFP58
B630005N14RIK
TGM2
C030037D09RIK
GM20257
TRAF4
GM12942 PRPF4
JAK2
AU040320
ZFP592
PGBD1
ZSWIM3 TRAPPC8
RG9MTD2 PKD1
ANKRD27
SAAL1
ZFP142
C1QL3
PRKDC
H2−T9
STK3
SALL2
CEP192
1810026J23RIK
DHX35 XRCC6
MAP3K2
ADCK3
ZFP119A
PPP1R26 CCNL1
GPR176
PRRG3
PPOX
D15ERTD621E
TCF12
SHISA6
E4F1
NSUN3
VWA1TRIM26
TMPRSS7
PEO1
KCTD21
6720489N17RIK
D11WSU47E RNF180
SUOX
S100A13
C330018D20RIK
TOR1AIP1
NRARP
PROKR2 DCP1A
MANBA
CCDC111
RAD51
ZFPM2
1300010F03RIKATG14
PIR
SERTAD4
NAPEPLD
2010015L04RIK
SKIV2L
VGLL4
ZFP94
PCDHGA3
GPR64
SCML2
IGSF10
TNNI3K
DHRS11
SLC13A5
CAR12
FHOD1 UBE2L6
MET
PTPDC1
PUS7L
ZDHHC24
TMEM164
LENG1MTSS1L
SLC52A2
PCDHB4
SLC4A11 STEAP2
FRMD6
EML6
DUSP10
TPMT
LRRC3
TBC1D22A
TEAD1
ZBTB24
LRSAM1
A930017M01RIK
CCDC14 DLC1FLNB
MUM1
PRPS1L3
PCK2
DHCR7
OPHN1
ZFP882
LRP12
HERC4
ZFP958
NUP50
ARFGEF2
LACE1
SH3KBP1
ERO1LB
CCDC88C ACACA
RBMS1 AKAP2
CSNK1G3
ZXDC
POLI
XPO5
GALNT7 UBN1
PCX
LNPEP
NF2
SSH3
ZFP760 NKAP
TM9SF1
GLT25D1
TUBGCP3
PTPN9
KAT2B
XRCC4
MVK
TRMT2B
AASDH
ZFP128
SH3GL1
FASTKD5
BNIP2
CDC25B
IFNAR2
ZFP280B
MTMR12
SLC31A2
FAM135A
NPL
DFFB
PCDHB14
LRRC14
PCDHB5
DCTD
ARHGEF15
HBEGF
CTNNA3
KLHL1
RPS6KA6
PABPC5
BC051142
XLR3A
STARD4 ODZ1
TYW1 M
POLD3
SIPA1L2
LIG1
2810021J22RIK TAB1
KDM1A
ANKRD49
DSTYK
RBM33
SERPINB6A
KCNH5 RBP4
LSG1
MED7
CYB5D2 SLK
ZFP180
FAM53C
BBS9
ASCC3
KBTBD7
1810030O07RIK
E130307A14RIK
MOB3C TPBG
CDC42SE1
ERCC8
ZFP804B
TAF5L FRS2
NNT
ZFP30
SLC16A7
ORMDL2
CARTPT
DEPTOR
ZHX3
DNAHC1
ATP10D MSH2
RNF219
PPP1R15B
ECI1
PARD3
ECHDC2
MYH8IRAK2
SLC43A2
OSBPL5
DDX4 MBTPS2
D9ERTD402E
D430042O09RIKIGFN1
ADCK2
ITPKC
BMP1
DTX3L
ASXL3
MFSD1
PACSIN3MVD TSHZ1
ZNFX1
ZFP759 PDP2
GABRA5
ACTL6A
CSK
DSE
BCAT2
ATAD2B
WDPCP
ANKRD39
C1QTNF1
4930525G20RIK
CDHR1 HR RLF
TMEM177
CLCN7
ZFP939
GTPBP10
ZBTB10 IPMK
UBE2Z
CTHRC1
BAG2
SENP3
TAB3
FAM26E GRIK3
RDH11
COG3
RAI14
ADCY5
STK38
2610301G19RIK
MAGEL2 ZCCHC6
GALNTL4
C730002L08RIK
KLC4
CCDC93
CDO1
POLD2
SENP1 TAZ
VHL NR4A1
ZZZ3
ZIK1
ZSCAN21
NADK
DCBLD2
ZFP738
ASXL1
NEBL
MMS19
RNFT1
UBR7
FTSJ3
CEP97
MGAT4A
2010106G01RIK
PGPEP1
ARRDC2
PLOD3
EPB4.1L5
1110028C15RIK STAT3
OTUD7B
ZFP101DTX4
PCDHAC2
GPR107
ZFP661
TMEM186
ARFIP1
GANC
LEKR1
BCAR1
SARS2
NAGPA
ANTXR1
PCDHGB1RET
SPSB1
D830031N03RIK AACS
1810031K17RIK
LGI3
CYTH3
XPCFGGY
MUS81CEBPG
COX4NB
APAF1
CBLB
HS6ST2
PPP2R5A
CCDC137
LRWD1
DACH2 ZFP81
HERC6
PEX12 LPL
KCTD4
PIP4K2A
CNTN2 SHISA9
ZBTB41
STYX
NUP93
KAT2AMID2
TRHDE
LATS1
WDR24
ZFP945
DHX32
LINGO2
IRGM1
NOL9
KLHL20
FAM199X
ARSB
DDX26B
POGK
ZFP712
UNC13B
SDC3
3110052M02RIK
SMAD3
GPR137B−PS CPD
FPGT
ZMPSTE24
CBLN4
HMGXB3
LCORL
PNPLA3
BMPR1A
ZKSCAN14
C230081A13RIK
SLC22A5 ICK
LZIC
MYO1B
CCDC21
ZFP236
PCDHA3 PTPRM
ACADM
ZFP846
GNAI3
COX6A2
ST7L
ZFP39
SETDB1
SPICE1
EIF2C1 RASA2
NUP153
FAM172A
RHOUTRUB2
THADA
NPY5R
AKR1B10
SSTR1
LAMC1
ZFP53 TTC17
TEX10
WWP2
GNAI2
NUCB2
DERL1
MAN1C1
ZFP346
RBAK
CCDC123
SLC25A13
OSBPL10 ACADLSTX16
ATP11ATRMT2A YIPF6
ERLIN2
INO80C MED23
FOXN2
CTDSPL
DGCR14
SLC33A1
RNF19A
CPNE2
SERPINB8 TRPV2
MFSD8
PUS3
CRYL1F AH
ATMIN
CSNK1D
MCC
YOD1
ZFP867
TDRD7 EFTUD1
CPNE7
PIGC
TTLL5
TTF1
GM13298
MAN2C1NRIP1
ZFP111
BC057079
ZFP93
ZXDA
ACY1LIG3
HAUS3
TTC37
CEP68
HDAC8
5031439G07RIK
KCTD10
RHOBTB3
PRKG2
DGKA USP42
ZFX
NCAPD3
TRIT1 SKI
EDC4
CCNL2
LPIN1
PIGH
CEP120
MYST3
RNF139
CWF19L2
ADNP2
TRIM12C
NPNTBMI1
BC017647
ANKS1 NEURL1B
FADS1
TAF1A
MAGEE2
GTF2E1
C630043F03RIK KCNS3
LETM1
ABCC1ATP9B
RAB23
RFWD3
PIAS3
ACVR1
ALG8
DBT
ABCA8B
GPR125
FBN1 ANEA
ARL4C
ZDHHC21
DHRS7
DSCAM
VPS8
YIPF2
GLT8D2
PIBF1
FBXL4
LRP6
AVL9
XPO6
MTX3
TMEM62 CHSY1
SIDT1
PDZD2
CTBP2
KIF21B
GGCX
ANKRD10
FLII
ZFP758
ZFP940
FAM189A1
GCN1L1
EXT2 VRK1
DGCR8 RPA1
ZFP37
WDR13
NPTX1
TMOD3
ZFP715
XRN1 CDH10
TMEM43
NR2C2
LPIN2
ZFP563
REEP3
GRM8 CPNE4
HLTF
GTF3C2
KCNK2
MGAT4C SGCE
RNF2
ODZ3
GLCCI1
TSPAN6
TBC1D19
CEP350
2510009E07RIK
SEPN1
DNM2
KLF3 CTSF
MTMR1
RAB36 ROCK1
ZFP2
NETO2
MPP5 CDC23
CAMK1G
INPP5K
IKBKG
ITCH
HSD17B4
USPL1
PLEKHA1
SIAE
AP3B1
CHMP1B STAM2
BC018507
PPFIA1
PIGM
SEC24B
PLEKHA8
CD302
PIGY
SMC2 RNF185
MUDENG
CHD4
ZFP160
CWC22 LDB2
FHL1
SYBU
ARID4B
DENND5A
TEF
RND3
ZMYM3
D6WSU116E
KLHL23
CDH11
ZC3H7A
ZFP281
6330578E17RIK
GM5643
KCND3 FOSL2
SCN3B
FBXO38 NEFL
PHIP
USP4
PCDH17
UNC5D
TMEM132B
ITGAV
2310046O06RIK
PDZRN3
HIBCH
PTPN12
2010321M09RIK
FRAS1 DPY19L1
CACNA2D3
STARD7
GABRA2
ANKRD29 MAN1A2
GPR158
PTK2
ABLIM1 TMED10
CORO1A
9530091C08RIK
SUN1
FAT3
AGPAT4
PRKCC
GOSR2
IPCEF1
GANAB
G6PDX
CREB1
USP28 GM3893
CASD1
WDR6
GNPDA1 IARS
NR3C1
USP29
NAMPT
OGFRL1 CLK4
EPHA7
ZDHHC20 CPLX2
CAR10
NR1D1
SLC1A1 NAV1BTBD10
OAT
COPB1
BTBD3
IPW
IP6K1
RAB27B
MKLN1 LIN7A
ITM2B
HERPUD1
RBM39 GAA
SLC20A1
PSD3
AHCYL1
ZFP871
CLDN25
MT_AK153847 GAP43
HPCA
PCDH9
PCSK2
MT_AK140457
CACNG3
PRMT2
SLC8A1
ATP8A1
GPRASP2
NCOA4
HPCAL4
SESTD1 LGI1 OGT
CLK1
GRIA3
RASGRF1
NRAS
SCN1A
NAP1L5 VMP1
AKAP5 ATP2B1
MT_AK159262
HSPA5 PAK1H3F3B
MT_AK165865
FAM175A
GPR1
PCDHB11
SERTAD3
MAN2B2
GM5475
DNAHC9
KCNK10
PPAP2C
PLEKHA7UNK
DENND1A
GM16973
SLC10A7
2700081O15RIK
GTPBP1
MAML1
HCRTR1TOPORS
ZFP212
FHL4
APEX2
GSG1L
ACCS
EVI2B
SOX8
INTS5
PTPN21
SYDE1HPS4
TXNDC5
BC026590
VPS37C
ZFP454
COL4A3BP
4930453N24RIK
MFSD7B
FAM35A
HECTD2
LGR5
N4BP2 CRKL
PRDM4
MAML3
MATN2
E130309D14RIK
DSC2
AKAP10
PLD2TELO2
SMCR8
SLC16A14
RPS6KA4
VAC14
MED9
MRO
MCM8
RFX5
MTHFR
NGB
IFIH1
MTMR14
8430406I07RIK
1190007F08RIK
SDR42E1
CCNK
KLHL14
MUTYH
ZFP773
FAM184B
ISOC1
SEC24D
VMAC
AP1G2
BTN2A2
IFI30
NFXL1
NINJ1
THBS3
DDX31
GSTM6
ZKSCAN4
KLHL15
RFC5
PHACTR4
TMEM8
NMBR
GM10767
LEPREL4
ZFP652
SHKBP1
DHFR
ANGEL1
MOXD1
ZSCAN12
MAMDC2
1300014I06RIK CDH6
2700049A03RIK
AMBRA1
PVRL3
CRYZ
A930013F10RIK
RACGAP1
TMEM163
IL17RA BMP3
PTPRT
PIP5K1B
ATL3
WRN
CHRM2ELL
TMEM159
PRKAA1
PKD2
PARP11
KDM6A
SYTL5
DPP4
TWSG1
LRRC1
FAM160A1SLC35D1
RSPRY1
FAM76B
SPECC1L
RPUSD4
DIS3
TMEM63C
PPP1R12C
GPC6
MTR DNAJC3
NRF1
MON2
TK2
FAM118A
LMF1
CDKAL1
FNDC4
AGFG1
FAIM
NUDT22
ZSCAN18
PPP4R1L−PS
1700020O03RIK
SMURF2
4933400F21RIK
GPR126
ARL13B
GAL3ST1
GINS3
E230029C05RIK
4430402I18RIK FLCN
CHST12
FAM116A
NUP62
NR3C2
3110057O12RIK
AHCTF1
EIF3F
PMS1
CNTLN
NEK6
ZKSCAN3
SPAG5
SLC19A2
ZBTB34ANO10
TPPP3
IPP
RFFL
NR2C1
GALNS
STK24
MYO9B
1810010H24RIK
NKAIN1
GRTP1
RAB34
GPR151YES1
SMUG1
HHATL
TTC30A1
SLAIN2
1110008L16RIK
PYCR2
SLC16A2
DZIP1L
ZCWPW1
D19WSU162E
LRRC51
RAVER2
TMEM88
PTBP1
DUSP16
CREB3L1
MYLK3
TACR3
GRIN3A
CNTNAP5B
IL15RA
PGAM2
NFATC3
TRIM65
ST6GALNAC4
NR6A1
PCDHB22
BC065397
SESN3
EPHX3
BC027231
OTOF
LRRC61
FAM188B
MAP3K6
ZFP862
NUDT15
HIST2H2AA1
ARHGAP27
5930438M14
SOWAHC
TRABD
9130019O22RIK
ZBTB7B
CC2D1B
MMGT2 ACP6
KCNS2 GNG7
ESRRG
RINT1
ZFP426
ZFP956
LSM14A
MBTPS1
ZFP654
MMACHC
1110038D17RIK
ZFP770
CMYA5
MAPKAP1
ANKRD23 IFT122
PARVA
MBTD1
GTPBP2
CD2BP2
OTUB2
SLC25A40
BOC
CC2D2A
NEK4
ABCB6
METTL15
GM13251
PDE7A PNPLA6
ERI1
PIK3R4
LANCL3
1700052N19RIK
D1ERTD622E ACTN4
HADHA
LMBR1
PCYT1A
FAM3A
MFAP3L
RTN4IP1
SLC35A2
CLPB
HOMER2
ULK1
ZFP532
ALKBH1
TMEM64
FIG4
PRKAR2A
ZFP583
PDCD7
SLC25A24 PHKA2
LAMC2
1110034G24RIK
DLX1AS
PAK2
LRCH1
TNS3 SIL1
VEZF1
KDM2A
EPC2
WDR11
POLR3A
MLXIP
SPTY2D1
GEMIN8
NAA40ZFP97
RAMP3
PKN2
RSBN1
CSAD
2210404J11RIK
RFXANK
ZFP442
KCTD13
ZFP629
ERRFI1FASN
USP3
LRBA
DPAGT1
CABLES2
IQCG
KLHL26
ANKRD50
ZFP449
CRYBG3
ZFP334
RARB
CCDC157
DCAF15
PRDM5
HMBOX1
FUT11 CLP1
TTI2
SLC7A5
HS3ST4TLN2
EIF2C3
NT5C2
1300001I01RIK
CASP8AP2
CPSF3L
DPH1
UBAC2
NUP133
CDKL4
TMEM161A
6430584L05TSPYL3
RCAN3
AGPAT3 SNN
SUPT7L
1600012H06RIK
RYR3
EEPD1
ARNTL
GAS8
CD320 NAA30
ZDBF2
USP53
DEAF1
ASAP2
LCA5
E2F6 ABCA2
DIABLO
VPS4B
PLEKHA6
TMED8
NAA16
SPECC1
DOLK
EXT1
3110047P20RIK
ZCCHC16
TRDMT1 CCNG1
SRSF6
ZFP397
ABHD13
POLR1B
ALDH18A1
CERK
AP4B1SOS2
FEM1A
DDX27
FAM78B ZFP790
CPSF1
KLF12
INTS9
EML4
PIK3C2A
PITRM1
GNPAT
LTV1
IKBKB
THTPA
PEX5
LRRCC1
AHCYL2
ERMP1
DIS3L
CHD1
BC031353
KLHDC8B
ZFP786
ZFP354B NDST3
4932415G12RIK
C030048B08RIK
GNE
DIAP2
RBM41
9530077C05RIK
A130010J15RIKHNRPLL
DKKL1
SPRYD4
IGDCC4 HCCS
CHML
FRMD5
FAM40A
NUP54
KCNB2
TMEM88B
ZFP112
2610002I17RIK
CARS2 AMY1
TRIM62DR1
SLC8A3RPA2
FBXL20
FAM171A1
TFDP2
GBE1
INF2
TBCD
SLC35B3
NUP107
ZFP597
SPAG6
BC052040
RDH10
ENHO
WRAP53
TBL2
E430025E21RIK
1110007C09RIK TRAF2
APEH
PARP8
POLR2A
GALNTL1
AIFM2
TMTC2 GORAB
FBXO31
POMGNT1
UBIAD1
THOC6
D10WSU102E NUP35
EP300
TBC1D9B
FBXO32
SLC35F1
RRM2B
POLR3H
ZFP369 SLIT2
ZFP560COG2
PRDM2
ZFP804A
ZFP275
TRAFD1
HSD17B7
OXSR1
CIDEA
TMEM214
TGFB3
PUS1 RRAGC
C030039L03RIKC87436
ZDHHC16
SFMBT1
ERI2
PCYT1B
ENOX1
CCDC130
SUMF2 SRSF1
EIF4ENIF1
TM9SF2
POR
MFAP3
TOM1
ZBTB43
VPS13D
RAP1A
1600021P15RIK
RAD17
2700050L05RIK CDC7
NGLY1
DDX18 TGS1
GFRA2
GLB1
STARD3NL ZEB2
FAM120B
TMEM109
ZFP60
FNDC3A
PXMP3
HSPA13
PGM2
3110043O21RIK
ZC3H12C
EDEM3
ITSN2
SLC29A1
CAPN7 HSPA2
ADCY2
ENDOD1 SYTL2
NUDT4
LAPTM4B
6530418L21RIK
SP4 PWP1
RAPGEF6
SURF4
SBF2
BBS2ZFP40
VTI1A
ZFP952
NPLOC4
INTS8
NEK9
ILVBL ZMYM5
SNRNP200
PABPC1
EXOSC10
VPS39
0610031J06RIK
OPRL1
DTL
RPAP2 CDK4 PACSIN2
MLL3
ARFGAP2 PIK3R1
HERC3
ADD1
IPO5 KLHL24
PDIA4
NCSTN
SORT1 FRY
SORBS1
REPS2 HMGCR
MYSM1 SPRY2
TPPP
TMEM170B
PPP3CA
FBXW7
FTL1
MAPRE1 GRIA1 CD200 LRRC17
GABRA1
4
0.
0

0.2
0.6
−2
−4

0 1 2 3 4 5
Average expression

Fig. 14 Identification of highly variable genes using Seurat’s FindVariableGenes function (see documen-
tation for details). This is an appropriate strategy for feature selection on scRNA-seq that does not contain UMIs
Identification of Cell Types from Single-Cell Transcriptomic Data 73

2. Next, we train a Random Forest (RF) model [52] on the

snRNA-seq data and use that to assign cluster labels to mouse
VC data. Given a cell, the classifier maps it to one of 26 clusters.
To account for scale differences between the snRNA-seq (3-
0
-biased, UMI-based) and Smart-seq (full-length, non-UMI-
based), we standardize the two datasets (z-score values along
each gene). After training it on the snRNA-seq data, we apply
this classifier to each cell from the mouse VC data, and assign it
to one of 26 snRNA-seq clusters.

3. How do the cluster assignments compare with the cluster labels

obtained from Tasic et al. [15]? Note that the latter labels were
not used in any way to either construct the classifier, or to
influence the cluster assignment of cells. It would therefore
be interesting to see if there is any correspondence between
mouse cortical cell types, and their assigned “Human” type
based on an unbiased classifier. We examine the confusion
matrix, as before (Fig. 15),

The rows correspond to the Tasic et al. clusters, while each

column corresponds to an snRNA-seq cluster. The matrix is
row-normalized, such that each row adds up to a 100%. First, we
see that Clusters 1–4 and 26, which are of Cerebellar origin, receive
very few matches from mouse VC data, which largely map to
Human clusters originating from VC and FC samples. Among
non-neuronal cells, we see that mouse astrocytes and oligodendro-
cytes map to clusters 23 and 25, which are Human astrocytes and
oligodendrocytes, respectively. Inhibitory neuronal groups expres-
sing parvalbumin (Pvalb), Somatostatin (Sst), and Vasoactive intes-
tinal peptide (Vip) map to clusters 6, 5, and 8 respectively.
Examining the expression of these markers in the snRNA-seq data
validates the RF cluster assignments (Fig. 16). Thus, despite the
fact that these two data sets differ in species (human vs. mouse), cell
fraction profiled (cytoplasmic vs. nucleus-only), profiling method
(Smart-Seq vs. droplet-based sequencing), and clustering method
(gene clustering vs. PCA-based methods vs. PCA-Louvain cluster-
ing), the overall results are comparable and interpretable, suggest-
ing that the transcriptomic space these cells occupy is being
appropriately parsed into subtypes.
74 Karthik Shekhar and Vilas Menon

Sst Cdk6
Pvalb Cpne5
Smad3
Vip Gpc3
Vip Sncg
OPC Pdgfra
L6b Rgs12
L4 Scnn1a
L5a Pde1c
L2/3 Ptgs2
L6a Syt17
Micro Ctss
Endo Xdh
Astro Aqp4
Oligo Opalin
Endo Myl9
Igtp Percentage
L2 Ngb 100
L4 Arf5
75
L4 Ctxn3
L5 Hsd11b1 50
L5 Ucma
25
L5a Batf3
L5a Tcerg1l
Known

0
L5b Cdh13
L5b Chrna6 Percentage
L5b Tph2 0
L6a Car12
25
L6a Mgp
L6a Sla 50
L6b Serpinb11 75
Ndnf Car4 100
Ndnf Cxcl14
Oligo 9630013A20Rik
Pvalb Gpx3
Pvalb Obox3
Pvalb Rspo2
Pvalb Tacr3
Pvalb Tpbg
Pvalb Wt1
Sncg
Sst Cbln4
Sst Chodl
Sst Myh8
Sst Tacstd2
Sst Th
Vip Chat
Vip Mybpc1
Vip Parm1
1 3 5 6 7 8 9 10 11 12 14 15 16 17 18 20 21 22 23 24 25
Predicted

Fig. 15 Transcriptional correspondence between mouse cortical clusters reported in Tasic et al. [15] (rows)
and those in this study (columns). Representation as in Fig. 11

This concludes the basic workflow. We can save files from the
analysis as follows.
Identification of Cell Types from Single-Cell Transcriptomic Data 75

PVALB

1.5

1.0

0.5

0.0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Identity

SST

0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Identity

VIP
2.0
1.5
1.0
0.5
0.0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Identity

Fig. 16 Expression of three classic markers known to distinguish inhibitory neuronal categories PVALB (top),
SST (middle), and VIP (bottom)

Acknowledgments

K. S. would like to acknowledge support from NIH

1K99EY028625-01, the Klarman Cell Observatory, and the labo-
ratory of Dr. Aviv Regev at the Broad Institute. We would like to
gratefully acknowledge critical feedback from Drs. Inbal Benhar
and Jose Ordovas-Montanes.
76 Karthik Shekhar and Vilas Menon

References
1. Vickaryous MK, Hall BK (2006) Human cell single-cell transcriptomics. Nat Rev Genet 16
type diversity, evolution, development, and (3):133
classification with special reference to cells 18. Arendt D (2008) The evolution of cell types in
derived from the neural crest. Biol Rev Camb animals: emerging principles from molecular
Philos Soc 81(3):425–455 studies. Nat Rev Genet 9(11):868–882
2. Regev A et al (2017) The human cell atlas. 19. Ecker JR et al (2017) The BRAIN initiative cell
Elife:6 census consortium: lessons learned toward
3. Tosches MA et al (2018) Evolution of pallium, generating a comprehensive BRAIN cell atlas.
hippocampus, and cortical cell types revealed Neuron 96(3):542–557
by single-cell transcriptomics in reptiles. Sci- 20. Kolodziejczyk AA et al (2015) The technology
ence 360(6391):881–888 and biology of single-cell RNA sequencing.
4. Boisset JC et al (2018) Mapping the physical Mol Cell 58(4):610–620
network of cellular interactions. Nat Methods 21. Islam S et al (2014) Quantitative single-cell
5. Tanay A, Regev A (2017) Scaling single-cell RNA-seq with unique molecular identifiers.
genomics from phenomenology to mechanism. Nat Methods 11(2):163
Nature 541(7637):331–338 22. Menon V (2017) Clustering single cells: a
6. Trapnell C (2015) Defining cell types and review of approaches on high- and low-depth
states with single-cell genomics. Genome Res single-cell RNA-seq data. Brief Funct
25(10):1491–1498 Genomics
7. Cleary B et al (2017) Efficient generation of 23. Hicks SC, Teng M, Irizarry RA (2015,
transcriptomic profiles by random composite 025528) On the widespread and critical impact
measurements. Cell 171(6):1424–1436.e18 of systematic bias and batch effects in single-
8. Klein AM et al (2015) Droplet barcoding for cell RNA-Seq data. bioRxiv
single-cell transcriptomics applied to embry- 24. Butler A et al (2018) Integrating single-cell
onic stem cells. Cell 161(5):1187–1201 transcriptomic data across different conditions,
9. Macosko EZ et al (2015) Highly parallel technologies, and species. Nat Biotechnol 36
genome-wide expression profiling of individual (5):411
cells using nanoliter droplets. Cell 161 25. Haghverdi L et al (2018) Batch effects in
(5):1202–1214 single-cell RNA-sequencing data are corrected
10. Zheng GX et al (2017) Massively parallel digi- by matching mutual nearest neighbors. Nat
tal transcriptional profiling of single cells. Nat Biotechnol 36:421–427
Commun 8:14049 26. Lopez R et al (2018) Bayesian inference for a
11. Habib N et al (2016) Div-Seq: single-nucleus generative model of transcriptome profiles
RNA-Seq reveals dynamics of rare adult new- from single-cell RNA sequencing.
born neurons. Science 353(6302):925–928 bioRxiv:292037
12. Lake BB et al (2016) Neuronal subtypes and 27. Lee JH et al (2014) Highly multiplexed sub-
diversity revealed by single-nucleus RNA cellular RNA sequencing in situ. Science 343
sequencing of the human brain. Science 352 (6177):1360–1363
(6293):1586–1590 28. Stahl PL et al (2016) Visualization and analysis
13. Shekhar K et al (2016) Comprehensive classifi- of gene expression in tissue sections by spatial
cation of retinal bipolar neurons by single-cell transcriptomics. Science 353(6294):78–82
transcriptomics. Cell 166(5):1308–1323.e30 29. Chen KH et al (2015) Spatially resolved, highly
14. Villani A-C et al (2017) Single-cell RNA-seq multiplexed RNA profiling in single cells. Sci-
reveals new types of human blood dendritic ence 348(6233):aaa6090
cells, monocytes, and progenitors. Science 30. Lubeck E et al (2014) Single-cell in situ RNA
356(6335):eaah4573 profiling by sequential hybridization. Nat
15. Tasic B et al (2016) Adult mouse cortical cell Methods 11(4):360
taxonomy revealed by single cell transcrip- 31. Fuzik J et al (2016) Integration of electrophys-
tomics. Nat Neurosci 19(2):335–346 iological recordings with single-cell RNA-seq
16. Zeng H, Sanes JR (2017) Neuronal cell-type data identifies neuronal subtypes. Nat Biotech-
classification: challenges, opportunities and the nol 34(2):175
path forward. Nat Rev Neurosci 18(9):530 32. Dixit A et al (2016) Perturb-Seq: dissecting
17. Stegle O, Teichmann SA, Marioni JC (2015) molecular circuits with scalable single-cell
Computational and analytical challenges in
Identification of Cell Types from Single-Cell Transcriptomic Data 77

RNA profiling of pooled genetic screens. Cell 43. Keogh E, Mueen A (2017) Curse of
167(7):1853–1866.e17 dimensionality. In: Encyclopedia of machine
33. Stoeckius M et al (2017) Simultaneous epitope learning and data mining. Springer, pp
and transcriptome measurement in single cells. 314–315
Nat Methods 14(9):865 44. Hotelling H (1933) Analysis of a complex of
34. Frieda KL et al (2017) Synthetic recording and statistical variables into principal components.
in situ readout of lineage information in single J Educ Psychol 24(6):417
cells. Nature 541(7635):107–111 45. Hyv€arinen A, Karhunen J, Oja E (2004) Inde-
35. Raj B et al (2018) Simultaneous single-cell pendent component analysis, vol 46. Wiley,
profiling of lineages and cell types in the verte- New York
brate brain. Nat Biotechnol 36(5):442–450 46. Lee DD, Seung HS (2001) Algorithms for
36. Pertea M et al (2016) Transcript-level expres- non-negative matrix factorization. In: Leen
sion analysis of RNA-seq experiments with TK, Dietterich TG, Tresp V (eds) Advances in
HISAT, StringTie and Ballgown. Nat Protoc neural information processing systems, vol 13.
11(9):1650 MIT, Cambridge, UK
37. Villani AC, Shekhar K (2017) Single-cell RNA 47. Haghverdi L et al (2016) Diffusion pseudo-
sequencing of human T cells. Methods Mol time robustly reconstructs lineage branching.
Biol 1514:203–239 Nat Methods 13(10):845
38. Satija R et al (2015) Spatial reconstruction of 48. Lancichinetti A, Fortunato S (2009) Commu-
single-cell gene expression data. Nat Biotech- nity detection algorithms: a comparative analy-
nol 33(5):495–502 sis. Phys Rev E Stat Nonlinear Soft Matter Phys
39. Lake BB et al (2018) Integrative single-cell 80(5 Pt 2):056117
analysis of transcriptional and epigenetic states 49. Levine JH et al (2015) Data-driven phenotypic
in the human adult brain. Nat Biotechnol 36 dissection of AML reveals progenitor-like cells
(1):70–80 that correlate with prognosis. Cell 162
40. Pandey S et al (2018) Comprehensive identifi- (1):184–197
cation and spatial mapping of Habenular neu- 50. LVD M, Hinton G (2008) Visualizing data
ronal types using single-cell RNA-Seq. Curr using t-SNE. J Mach Learn Res 9
Biol 28(7):1052–1065.e7 (Nov):2579–2605
41. Andrews TS, Hemberg M (2017) Identifying 51. Soneson C, Robinson MD (2018) Bias,
cell populations with scRNASeq. Mol Asp Med robustness and scalability in single-cell differ-
42. Brennecke P et al (2013) Accounting for tech- ential expression analysis. Nat Methods 15
nical noise in single-cell RNA-seq experiments. (4):255
Nat Methods 10(11):1093 52. Breiman L (2001) Random forests. Mach
Learn 45(1):5–32
Chapter 5

Rare Cell Type Detection

Lan Jiang

Abstract
High-throughput single-cell technologies have great potential to discover new cell types. Here, we present
a novel computational method, called GiniClust (Jiang et al., Genome Biol 17(1):144, 2016), to overcome
the challenge of detecting rare cell types that are distinct from a large population.

Key words Clustering, Single-cell analysis, RNA-seq, qPCR, Gini index, Rare cell type

1 Introduction

The cell is the basic unit of structure and function in life; however,
our knowledge of cell types remains largely incomplete. Interest-
ingly, in many development and disease context, rare cell types
often been see to play an important role although they only con-
tribute to a small proportion of the cell population. For example,
stem cells that contribute to new born neuron in adult brain are
critical to reverse neurodegenerative diseases [1], and drug-
resistant cells are the key barrier to cure cancer [2].
Genome-wide gene expression profiles now are widely accepted
to define cell types. Recently technical advance of massive parallel
single-cell RNA-seq on large-scale provides an unprecedented
opportunity to discover previously unrecognized cell types due to
their rarity. Although the number of cells being profiled single-cell
transcriptome assay increase the chance that rare cell being sam-
pled, tailored computational method to detect them remains highly
demanded.
One of the major challenges is to identify genes that are asso-
ciated with rare cell types without prior biological knowledge.
GiniClust [3] adapted Gini index, an informative statistical measure
widely used in social domain, to selecting rare cell-type-associated
genes. It is implemented in Python and R and can be applied to

Guo-Cheng Yuan (ed.), Computational Methods for Single-Cell Data Analysis, Methods in Molecular Biology, vol. 1935,
https://doi.org/10.1007/978-1-4939-9057-3_5, © Springer Science+Business Media, LLC, part of Springer Nature 2019

79
80 Lan Jiang

datasets originating from different platforms, such as multiplex

qPCR data, traditional single-cell RNA-seq, or UMI-based single-
cell RNA-seq, e.g., inDrops, Drop-seq, and 10 genomics.

2 Materials

2.1 Operating Operating system is preferred to be Linux or MAC. In windows,

System Cygwin is highly recommended to build a Linux-like environment.
It can be downloaded from https://cygwin.com/install.html. For
the installation of GiniClust and its dependencies, Internet access is
necessary.

2.2 Python Packages The graphical user interface of GiniClust relies on wxPython, a
Python wrapper for the cross-platform wxWidgets API. Please
ensure that you have Python 2.7 in your environment. In addition,
GiniClust relies on the following libraries:
1. Gooey.
2. Setuptools.
Those packages should be automatically installed or upgraded
via a pip installation. For instance, to install Gooey, proceed as
follows:
3. Start a terminal session;
4. run $ pip install Gooey --upgrade.

If in doubt, please check that those libraries got installed prop-

erly by trying to import them or some of their modules in your
Python interpreter: >>> import gooey, pkg_resources.

2.3 R Packages As for the R code at the core of much of GiniClust’s computations,
for MAC and WINDOWS only the official R (3.2.1 or higher
version) installation file is supported and tested. Using other instal-
lation methods, such as brew, may lead to running error. Besides,
some users might experience issues installing another of GiniClust’s
dependencies: the MAST R package (see Note 1).

2.4 Input Files The input file is a gene expression matrix in comma-separated value
(csv) format. Specifically, for qPCR data, each row is log2 gene
expression level; for RNAseq data, each row is UMI-Count/Cell or
Raw-Read-Count/Cell (see Note 2). The first row contains cell
IDs. The first column contains unique gene names. For example,
user can take a look at one of our test datasets (stored in the
sample_data folder within GiniClust’s repository):
Rare Cell Type Detection 81

>ExprM.RawCounts<-read.csv("Data_GBM.csv", sep=",", head=T)

>ExprM.RawCounts[1:4,1:4]
# MGH26 MGH26.1 MGH26.2 MGH26.3
#1/2-SBSRNA4 0 47 0 0
#A1BG 41 80 3 0
#A1BG-AS1 0 0 0 0
#A1CF 0 0 0 0

3 Methods

3.1 Run GiniClust To run GiniClust, please download the GiniClust GitHub repository
Through Python-Based from https://github.com/lanjiangboston/GiniClust/archive/mas
Graphical User ter.zip, unzip it and move to the extracted directory so that it
Interface becomes your current working directory. Then, in a Linux environ-
ment, start a terminal session and enter:

$ python GiniClust.py.

From an OS X or Windows environment, launch a terminal

session and enter:

$ pythonw GiniClust.py

A graphical user interface springs up and directs you into

choosing a file to process from your arborescence of directories,
specify the type of data by choose “qpcr” or “rna” under the
“Actions” column, along with the name of the folder where you
would like to store GiniClust’s output (see the section below for
more information about those files). A screenshot is provided
(Fig. 1).

3.2 Run GiniClust Alternatively, GiniClust can be run directly as an R script at the
Through R Script Main command-line interface. User can run GiniClust in terminal session
Function using Rscript like following:

$ Rscript Giniclust_Main.R [options]

You can specify the following options:

l -f CHARACTER or --file ¼ CHARACTER, input dataset file
name
l -t CHARACTER or --type ¼ CHARACTER, input dataset
type: choose from ‘qPCR’ or ‘RNA-seq‘
l -o CHARACTER or --out ¼ CHARACTER, output folder
name [default ¼ results]
82 Lan Jiang

Fig. 1 Graphical user interface of GiniClust

l -e DOUBLE or --epsilon ¼ DOUBLE, DBSCAN epsilon

parameter qPCR:[default ¼ 0.25],RNA-seq:[default ¼ 0.5]
l -m INTEGER or --minPts ¼ INTEGER, DBSCAN minPts
parameter qPCR:[default ¼ 5],RNA-seq:[default ¼ 3]
l -h or --help, Show help message and exit.
For example, the following command is used to analyze the
‘Data_GBM.csv’ dataset.

$ Rscript GiniClust_Main.R -f Data_GBM.csv -t RNA-seq -o

GBM_results

The following command is used to analyze the ‘Data_qPCR.

csv’ dataset.

$ Rscript GiniClust_Main.R -f Data_qPCR.csv -t qPCR -o

qPCR_results
Rare Cell Type Detection 83

Default parameters for GiniClust are shown as below:

minCellNum = 3
minGeneNum = 2000
expressed_cutoff = 1
log2.expr.cutoffl = 0
log2.expr.cutoffh = 20
Gini.pvalue_cutoff = 0.0001
Norm.Gini.cutoff = 1
span = 0.9
outlier_remove = 0.75
Gamma = 0.9
diff.cutoff = 1
lr.p_value_cutoff = 1e-5
CountsForNormalized = 100000
rare_p = 0.05
perplexity = 30

Usually default parameters work well for identifying rare cell

types for datasets from different platforms. However, the user can
always explore with different parameters. Users can use a text editor
to open “yourworkdir/Rfunction/GiniClust_parameters.R.” After
changing parameters, users can rerun the GiniClust through Gini-
clust_Main.R script.

3.3 Run GiniClust While GiniClust_Main.R is more friendly to use, running Giniclust
Through R Script Step step by step gives users more control and thus is more suitable for
by Step customized analysis. GiniClust stores intermediate files for each
step. So it is convenient for users to check the results and to change
parameters if necessary and rerun a single step before moving to the
next one. A step-by-step instruction of running GiniClust is
described in the following:

3.3.1 Loading Packages This step first loads all the additional required packages. If it is the
first time running GiniClust, this step automatically installs all R
packages, so it is necessary to have Internet access.

>source("Rfunction/GiniClust_packages.R")

Additionally, users can load all the R functions defined by

GiniClust:

> source("Rfunction/GiniClust_Preprocess.R")
> source("Rfunction/GiniClust_Filtering.R")
> source("Rfunction/GiniClust_Fitting.R")
> source("Rfunction/GiniClust_Clustering.R")
> source("Rfunction/GiniClust_tSNE.R")
> source("Rfunction/DE_MAST.R")
> source("Rfunction/DE_t_test.R")
84 Lan Jiang

3.3.2 Preprocessing This step is used for preprocessing and loading the input data. The
input data for RNA-seq should be raw read counts or UMI counts.
For qPCR data, since the input data are in the log2 scale, GiniClust
transforms back to normal scale to consistently process the data for
the later steps. The variables of input include data.type, out.folder,
and exprimentID, while the file of input is “exprimentID_raw-
Counts.csv.” And the variable of output is “ExprM.RawCounts,”
and is saved as a file named “exprimentID_rawCounts.csv.”

> ExprM.RawCounts = GiniClust_Preprocess(data.file, data.type, out.folder,

exprimentID)

3.3.3 Filtering GiniClust provides some basic filtering functions. The variable of
the Preprocessed Data input is “ExprM.RawCounts.” The variable of output is “ExprM.
RawCounts.filter.” The intermediate stored file is “exprimentID_-
gene.expression.matrix.RawCounts.filtered.csv.” The parameters
involved and default values are:

> minCellNum = 3
> minGeneNum = 2000
> expressed_cutoff = 1

After modifying the parameters (see Note 3), run the code
below:

> ExprM.Results.filter=GiniClust_Filtering(ExprM.RawCounts, out.folder,

exprimentID)

3.3.4 Selecting a Subset This step normalizes the Gini index by LOESS curve fitting in Gini
of the Genes for Clustering vs. Max space. The variable of input is “ExprM.RawCounts.filter.”
The variable of output is “Genelist.top_pvalue” or “Genelist.High-
NormGini.” And the intermediate stored files are “Gini_related_-
table_RNA-seq.csv” or “Gini_related_table_qPCR.csv.” The
parameters involved and default values are:

> log2.expr.cutoffl = 0
> log2.expr.cutoffh = 20
> Gini.pvalue_cutoff = 0.0001
> Norm.Gini.cutoff = 1
> span = 0.9
> outlier_remove = 0.75

After modifying the parameters (see Note 4), run the code
below:

> GeneList.final = GiniClust_Fitting(data.type, ExprM.RawCounts.filter, out.folder,

exprimentID)
Rare Cell Type Detection 85

3.3.5 Clustering This step builds a cell-cell discard distance based on selected genes
from step 4. Then DBSCAN is used to detect clusters. The variable
of input is “ExprM.RawCounts.filter.” The variable of output is
“cell.cell.distance,”, “c_membership,” “clustering_member-
ship_r,” and “rare.cells.list.all.” And the intermediate files are
“exprimentID _clusterID.csv,” “exprimentID _rare_cells_list.txt.”
The parameters involved and default values are:

> eps = 0.5

> MinPts = 3
> rare_p = 0.05

After modifying the parameters (see Note 5), run the code
below:

> Cluster.Results =GiniClust_Clustering(data.type, \

ExprM.RawCounts.filter,\
GeneList.final,eps,\
MinPts, \
out.folder,\
exprimentID)
> cell.cell.distance = Cluster.Results$cell_cell_dist
> c_membership = Cluster.Results$c_membership
> clustering_membership_r = Cluster.Results$clustering_membership_r
> rare.cells.list.all = Cluster.Results$rare.cell

Users can check the clustering results, for example,

> table(c_membership)
> print(rare.cells.list.all)

3.3.6 tSNE Visualization A nonlinear dimension reduction technique called tSNE is used to
visualize clustering results. The variable of input is “c_member-
ship,” “cell.cell.distance.” The variable of output is “Rtnse_-
coord2.” And the intermediate stored files are
“exprimentID_Rtnse_coord2.csv.” A figure is also generated and
stored in the work folder (Fig. 2). The parameters involved and
default values are:

> perplexity = 30

After modifying the parameters (see Note 6), run the code
below:

> GiniClust_tSNE(data.type, c_membership , cell.cell.distance, \

perplexity, out.folder, exprimentID)
86 Lan Jiang

singleton

10
cluster1
cluster2

5
tSNE_2

0
−5
−10

−10 −5 0 5 10

tSNE_1

Fig. 2 An example of tSNE visualization of clustering results

3.3.7 Differentially Comparing the single-cell expression data of each putative rare cell
Expressed Genes type cluster to the largest cluster will identify differentially
expressed genes. The variable of input is “ExprM.RawCounts.fil-
ter,” “rare.cells.list.all,” “c_membership.” The variable of output is
“differential.r.” And the intermediate stored files are “RareCluster.
overlap_genes.txt” and “RareCluster_lrTest.csv” or “RareCluster.
diff.gene.t-test.results.csv.” The parameters involved and default
values are:

diff.cutoff = 1
lr.p_value_cutoff = 1e-5

After modifying the parameters (see Note 7), for RNA-seq data,
run the code below:

> DE_MAST(ExprM.RawCounts.filter, rare.cells.list.all, out.folder, exprimentID)

For qPCR data, run the code below:

> DE_t_test(ExprM.RawCounts.filter,rare.cells.list.all,c_membership,out.folder,exprimentID)
Rare Cell Type Detection 87

3.4 Full List of Output The output directory specified by the user at the graphical user
Files and Description interface contains the following files and directories.
Main results:
exprimentID_rare_cells_list.txt: the clusters of rare cells
detected by Giniclust.
RareCluster_lrTest.csv or RareCluster.diff.gene.t-test.results.
csv: Differentially expressed genes results for the rare cells type
cluster.
Other supporting results:
exprimentID_rawCounts.csv: the raw counts.
exprimentID_normCounts.csv: the normalized counts.
exprimentID_gene.expression.matrix.RawCounts.filtered.csv:
the raw counts after filtering.
exprimentID_gene.expression.matrix.normCounts.filtered.
csv: the normalized counts after filtering.
Gini_related_table_RNA-seq.csv: the table related with Gini
index for RNA-seq data.
Gini_related_table_qPCR.csv: the table related with Gini index
for qPCR data.
exprimentID_clusterID.csv: clustering result, the first column
represents cell IDs and the second column is the corresponding
cluster result for each cell.
exprimentID_Rtnse_coord2.csv: coordinates of cells in
tSNE plot.
exprimentID_bi-directional.GiniIndexTable.csv: For qPCR
data the table of bidirectional Gini index.
RareCluster.overlap_genes.txt: overlap genes between the
selected high Gini genes and DE genes in rare cluster.
Sub-folder ‘figures’:
exprimentID_histogram of Normalized.Gini.Socre.pdf: histo-
gram of estimated p-values based on a normal distribution approxi-
mation for genes.
exprimentID_smoothScatter_pvalue_gene.pdf: the smooth-
Scatter plot in which the red points are the selected high Gini
genes according to specified cutoff.
exprimentID_tsne_plot.pdf: tSNE plot for cells.
exprimentID_RareCluster_diff_gene_overlap.pdf: Venn dia-
gram for differentially expressed genes and high gini genes.
exprimentID_RareCluster_overlapgene_rawCounts_bar_plot.
genename.pdf: barplot of rare cluster and major cluster for the
overlap genes.

3.5 Further Reading It is worth noting that while the GiniClust is powerful for detecting
rare cell types cluster, it is not sensitive for distinguishing major cell
types. This limitation can be partially resolved by updated version of
GiniClust called GiniClust2. It uses a novel cluster-aware weighted
88 Lan Jiang

consensus clustering algorithm to combine GiniClust and Fano-

based k-means clustering results, by maximizing the strengths of
these individual clustering methods in detecting rare and common
clusters, respectively [4].

4 Notes

1. If users have a problem with installing MAST packages, please

visit the website (https://github.com/RGLab/MAST) for
detailed instructions. We recommend that users upgrade
MAST package to the newest version. If you are using an old
version, you may need to replace the file DE_MAST.R by
https://github.com/lanjiangboston/GiniClust/blob/master/
Archive/DE_MAST.R.
2. log2 transformed RNA-seq data for Giniclust may not work.
We suggest that users use featureCounts from http://subread.
sourceforge.net/ [5] or htseq-count from http://www-huber.
embl.de/users/anders/HTSeq/doc/counting.html [6] to
get raw reads counts.
3. minCellNum means the minimum number of cells for rare cell
clusters. It is highly recommended that this value is set to be
equal to or larger than 3. However, the larger value of this
parameter means less sensitivity. minGeneNum is used for
filtering cell that may not express enough genes. The default
value of 2000 is consistent with recent report about scRNA-seq
using smFISH as guide [7].
4. “log2.expr.cutoffl’ ” and “log2.expr.cutoffh” define the range
of gene expression. “Gini.pvalue_cutoff” controls how many
genes finally are chosen. “Norm.Gini.cutoff” controls choos-
ing high gini genes based on p-value or not (default ¼ 1).
“Span” and “outlier_remove” are parameters used in LOESS
fitting.
5. “eps” and “MinPts” are parameter of “epsilon” and “minimum
points” for DBSCAN, respectively. The distance of any point to
its nearest core point of the same cluster is less than “epsilon.”
larger values for “MinPts” are usually better for data sets with
noise and form more significant clusters, and however, lose the
sensitivity for rare cell clusters. “rare_p” is a parameter to define
what you call rare cell cluster. For example, rare_p ¼ 0.05
means only cluster that contributes to less than 5% of the
whole population is called rare cell cluster.
6. “Perplexity” is a parameter for tSNE. The results of t-SNE are
fairly robust for different perplexity. The most appropriate
value depends on the density and size of the data. Typical values
for the perplexity range between 5 and 50.
Rare Cell Type Detection 89

7. “diff.cutoff” and “lr.p_value_cutoff” are used to filter differen-

tial expressed genes based on log2 fold change and p value
during MAST analysis, respectively. Increasing the value of
“diff.cutoff” or decreasing the value of “lr.p_value_cutoff”
will decrease the number of differential expressed genes.

References
1. Habib N, Li Y, Heidenreich M, Swiech L, 4. Tsoucas D, Yuan GC (2018) GiniClust2: a
Avraham-Davidi I, Trombetta JJ, Hession C, cluster-aware, weighted ensemble clustering
Zhang F, Regev A (2016) Div-Seq: single- method for cell-type detection. Genome Biol
nucleus RNA-Seq reveals dynamics of rare adult 19(1):58. https://doi.org/10.1186/s13059-
newborn neurons. Science 353 018-1431-3
(6302):925–928. https://doi.org/10.1126/sci 5. Liao Y, Smyth GK, Shi W (2014) Featurecounts:
ence.aad7038 an efficient general purpose program for assign-
2. Sharma SV, Lee DY, Li B, Quinlan MP, ing sequence reads to genomic features. Bioin-
Takahashi F, Maheswaran S, McDermott U, formatics 30(7):923–930. https://doi.org/10.
Azizian N, Zou L, Fischbach MA, Wong KK, 1093/bioinformatics/btt656
Brandstetter K, Wittner B, Ramaswamy S, 6. Anders S, Pyl PT, Huber W (2015) HTSeq--a
Classon M, Settleman J (2010) A chromatin- Python framework to work with high-
mediated reversible drug-tolerant state in cancer throughput sequencing data. Bioinformatics 31
cell subpopulations. Cell 141 (2):166–169. https://doi.org/10.1093/bioin
(1):69–80. https://doi.org/10.1016/j.cell. formatics/btu638
2010.02.027 7. Torre E, Dueck H, Shaffer S, Gospocic J,
3. Jiang L, Chen H, Pinello L, Yuan GC (2016) Gupte R, Bonasio R, Kim J, Murray J, Raj A
GiniClust: detecting rare cell types from single- (2018) Rare cell detection by single-cell RNA
cell gene expression data with Gini index. sequencing as guided by single-molecule RNA
Genome Biol 17(1):144. https://doi.org/10. FISH. Cell Syst 6(2):171–179.e175. https://
1186/s13059-016-1010-4 doi.org/10.1016/j.cels.2018.01.014
Chapter 6

scMCA: A Tool to Define Mouse Cell Types Based

on Single-Cell Digital Expression
Huiyu Sun, Yincong Zhou, Lijiang Fei, Haide Chen, and Guoji Guo

Abstract
For decades, people have been trying to define cell type with the combination of expressed genes. The
choice of the limited number of genes for the classification limits the precision of this system. Here, we build
a “single-cell Mouse Cell Atlas (scMCA) analysis” pipeline based on scRNA-seq datasets covering all mouse
cell types. We build the scMCA reference and then use the tool “scMCA” to match single-cell digital
expression to its closest cell type.

Key words scMCA, Mouse Cell Atlas, scRNA-seq

1 Introduction

Single-cell RNA sequencing (scRNA-seq) is a powerful tool to

perform transcriptomes analyses at the single-cell level. It can reveal
the gene expression status of individual cells and capture rare
populations that are difficult to obtain with conventional bulk
RNA-seq data. Recently, Han and colleagues [1] have performed
scRNA-seq on 400 k single cells from 51 mouse tissues, organs, and
cell cultures. The study provides an initial draft of the mouse cell
atlas (MCA) with a comprehensive database that contains transcrip-
tional characteristics of almost all major cell types in mouse. It is
therefore possible to construct a transcriptomic reference to map
and define unknown cell types.
Here, we established a database of more than 800 mouse cell
types using scRNA-seq data from MCA [1] together with other
studies [2–6]. After selecting differentially expressed genes and
grouping the cell type clusters, we use averaged expression values
of the same group to construct a cell-type reference. A pipeline

Huiyu Sun, Yincong Zhou, Lijiang Fei, and Haide Chen contributed equally to this work.

Guo-Cheng Yuan (ed.), Computational Methods for Single-Cell Data Analysis, Methods in Molecular Biology, vol. 1935,
https://doi.org/10.1007/978-1-4939-9057-3_6, © Springer Science+Business Media, LLC, part of Springer Nature 2019

91
92 Huiyu Sun et al.

called scMCA analysis was established to identify the cell types at

single-cell level. The pipeline can also be extended to identify bulk
RNA-seq data.

2 Materials

2.1 scRNA-Seq We use scRNA-seq data with more than 800 mouse cell clusters
Datasets Resource from different tissues to build our scMCA reference (see Table 1).
The datasets we used to build the reference are listed as below,
details of datasets are available via MCA website http://bis.zju.edu.
cn/MCA/gallery.html.

2.2 Software You can use MCA website or R package scMCA to define the
and Algorithms mouse cell types of your data. The scMCA on MCA website is
available at http://bis.zju.edu.cn/MCA/blast.html. We have
tested it on Firefox, Chrome, and Safari. For huge digital gene
expression (DGE) data (the DGE file larger than 200 MB), the R
package is available on our Github page: https://github.com/
ggjlab/scMCA. The softwares and algorithms used for scMCA
are listed in Table 2.

Table 1
The datasets used for building scMCA reference

Datasets Reference
Datasets of microwell-seq data (more than 40 tissues) Han et al. [1]
Dataset of Arc-ME (the complex of the arcuate hypothalamus and median Campbell et al. [2]
eminence)
Dataset of pancreatic islets Baron et al. [3]
Dataset of lung mesenchyme Zepp et al. [4]
Dataset of whole mouse E8.25 embryos Ibarra-Soria et al. [5]
Dataset of retina Macosko et al. 2015
[6]

Table 2
The packages used for scMCA

Packages Source
Pheatmap https://cran.r-project.org/web/packages/pheatmap/index.html
Shiny http://shiny.rstudio.com/
Dplyr https://cran.r-project.org/web/packages/dplyr/index.html
Defining Mouse Cell Types by scMCA 93

3 Methods

3.1 Establish Before downstream analyses, the normalization is needed. The

the Reference DGE is processed by the following formula:
with Different scRNA- E ¼ Count*100,000/sum (Count).
Seq Datasets (See Note E: the normalizing expression values;
1) Count: the raw UMI counts;
sum (Count): to sum up all counts in one cell.
3.1.1 Normalization
of Different scRNA-Seq
Datasets

3.1.2 Define Reference By default, we randomly choose 100 cells (all cells will be chosen if
Values of Cell Clusters the total number is less than 100) from each cell cluster for 3 times
to get the gene average expression values (see Note 2), and then
take the integer of the values for each sample (see Note 3). The
average of 3 representative values is used as cell cluster value in
reference.

3.1.3 Feature Gene We used the integer values in three representative averaged data to
Selection perform differential gene expression analysis. Top 10 genes
(avg_logFC > 1) of each cell cluster were merged to make the
reference feature gene list (3028 feature genes). The differential
gene expression analysis was done by the Wilcoxon rank sum test
method using R Package Seurat.

3.1.4 Get the MCA Cell The representative values of more than 800 cell clusters with 3028
Type Reference features genes were used as MCA cell type reference, and the
reference was log-transformed before calculating scMCA scores.

3.2 Using scMCA After uploading the scRNA-seq or bulk RNA-seq DGE file, you
on MCA Website should click the “scMCA” button to perform scMCA pipeline
(Fig. 1). The uploaded file should be a DGE matrix in.txt or.csv
3.2.1 Submit RNA-Seq
format with each row representing a gene, and each column repre-
DGE File to scMCA Website
senting a cell (or sample). The numerical expression can be raw
count, FPKM, and RPKM. After submission, the DGE matrix will
be log-normalized (see Note 4) and the feature gene expression
matrix will be extracted (see Note 5).

3.2.2 Get Defined Results Pearson correlation coefficients of the extracted expression matrix
from scMCA Website against MCA cell type reference are calculated, and the correlation
coefficients are used as the scMCA scores in scMCA pipeline. The
scMCA results can be shown by interactive heatmap, in which the
row means the defined cell type, the column means the query cell,
and the color of block indicates the strength of the correlation
(Fig. 2). One can also download the result table in a csv format,
94 Huiyu Sun et al.

Fig. 1 Data submission interface of scMCA on MCA website. You can upload DGE file in txt or csv format by
clicking the “Add files” button and perform scMCA pipeline by clicking the “scMCA” button

Fig. 2 The scMCA results of website. The result can be shown by interactive heatmap, in which the row means
the defined cell type, the column means the query cell, and the colors of blocks indicate the strength of the
correlations. You can also click the “result from scMCA” button to download the file which records the scMCA
scores between query cells and cell types. The uploaded example dataset of distal lung epithelium is from
Treutlein et al. [7]
Defining Mouse Cell Types by scMCA 95

Fig. 3 The scMCA results of R package. You can choose option buttons on the left of web interface to adjust
the results. The uploaded example dataset of distal lung epithelium is from Treutlein et al. [7]

which records scMCA scores between query cells (cells to be iden-

tified) and reference cell types.

3.3 Using R Package The scMCA package is hosted on Github. It can be conveniently
scMCA installed via “devtools” by typing “devtools::install_github
(“ggjlab/scMCA”).”
3.3.1 Installation of R
Package scMCA

3.3.2 Usage of R scMCA package has two main functions, scMCA and scMCA_vis.
Package scMCA scMCA calculates the Pearson correlation coefficient between each
query cell and cell type. scMCA_vis is used to visualize the result
returned from scMCA. To use scMCA, you should take the follow-
ing steps:
1. Loading the scRNA-seq or bulk RNA-Seq DGE file to R
environment.
2. Setting the number of most relevant cell types for each query
cell. The corresponding parameter is “number_plot” in func-
tion scMCA. You can type “?scMCA” in R for more
information.
3. Execute the scMCA and use scMCA_vis to get the results.
Using scMCA_vis, you can open a web page in localhost
which reflects the results of scMCA (Fig. 3).
For more details, you can find the instructions of R package
scMCA on Github page.
96 Huiyu Sun et al.

4 Notes

1. The normalization step is unnecessary if you use the default

mouse cell atlas reference.
2. We integrated data from different sources, clustered them into
894 cell types, and determined the average expression in each
cluster for transcriptome references. In high-throughput
scRNA-seq experiments, such as Microwell-seq, Drop-seq,
and 10 Genomics, sequencing depth is usually sacrificed;
the average number of detectable genes for each cell is
1000. After adding an averaging step, we can obtain about
10,000 genes in one cluster.
3. The number of detected genes in different cell types is differ-
ent. To reduce such variation, we take the integer of average
expression values from 100 randomly sampled cells for each
cell type.
4. Log-normalization of RNA-seq DGE matrix is obtained by the
following formula:
E ¼ log(Count*100,000/sum(Count) + 1).
E: the log-normalized expression values;
Count: the raw UMI counts (FPKM or RPKM);
sum (Count): to sum up all counts in one cell.
5. Get the feature gene expression matrix of RNA-seq data. The
expression matrix of 3028 feature genes was extracted from the
log-normalized RNA-seq expression matrix. If the expression
of a characteristic gene is not detected in the submitted data,
the expression value of this characteristic gene is considered to
be zero.

References
1. Han X, Wang R, Zhou Y et al (2018) Mapping 5. Ibarra-Soria X, Jawaid W, Pijuan-Sala B et al
the mouse cell atlas by microwell-Seq. Cell (2018) Defining murine organogenesis at
172:1091–107.e17 single-cell resolution reveals a role for the leuko-
2. Campbell JN, Macosko EZ, Fenselau H et al triene pathway in regulating blood progenitor
(2017) A molecular census of arcuate hypothal- formation. Nat Cell Biol 20:127–134
amus and median eminence cell types. Nat Neu- 6. Macosko EZ, Basu A, Satija R et al (2015)
rosci 20:484–496 Highly parallel genome-wide expression
3. Baron M, Veres A, Wolock SL et al (2016) A profiling of individual cells using nanoliter dro-
single-cell transcriptomic map of the human and plets. Cell 161:1202–1214
mouse pancreas reveals inter- and intra-cell pop- 7. Treutlein B, Brownfield DG, Wu AR et al (2014)
ulation structure. Cell Syst 3:346–60.e4 Reconstructing lineage hierarchies of the distal
4. Zepp JA, Zacharias WJ, Frank DB et al (2017) lung epithelium using single-cell RNA-seq.
Distinct Mesenchymal lineages and niches pro- Nature 509:371–375
mote epithelial self-renewal and myofibrogenesis
in the lung. Cell 170:1134–48.e10
Chapter 7

Differential Pathway Analysis

Jean Fan

Abstract
Integrating prior knowledge of pathway-level information can enhance power and facilitate interpretation
of gene expression data analyses. Here, we provide a practical demonstration of the value of gene set or
pathway enrichment testing and extend such techniques to identify and characterize transcriptional sub-
populations from single-cell RNA-sequencing data using pathway and gene set overdispersion analysis
(PAGODA).

Key words Single cell, Pathway, Gene set enrichment analysis, Differential expression analysis,
Clustering

1 Introduction

Identifying genes that exhibit significant differences among two or

more biological states, conditions, or cell-types is integral to under-
standing the putative molecular bases for phenotypic variation.
Determining whether individual genes exhibit significant expres-
sion differences between conditions can be achieved using differen-
tial gene expression analysis [1]. However, when gene expression
data are noisy and biological signals are weak, testing individual
genes for differences may not provide any statistically significant
results. In particular for single-cell RNA-seq data, such a differen-
tial expression analysis is often complicated by high levels of tech-
nical noise and intrinsic biological stochasticity in the data. As such,
application of previous differential expression analysis approaches
developed for bulk RNA-seq may not always be suitable [2]. While
methods for differential expression analysis specifically tailored to
single-cell RNA-seq data have been developed [3, 4], alternatively
grouping genes into biologically-relevant modules such as path-
ways may greatly enhance statistical power and improve our ability
to identify true biological signal [5, 6]. In this chapter, we will
discuss how to take a pathway-informed approach to differential

Guo-Cheng Yuan (ed.), Computational Methods for Single-Cell Data Analysis, Methods in Molecular Biology, vol. 1935,
https://doi.org/10.1007/978-1-4939-9057-3_7, © Springer Science+Business Media, LLC, part of Springer Nature 2019

97
98 Jean Fan

expression analysis and apply it to single-cell RNA-seq data to

identify and characterize transcriptional subpopulations.
Gene set or pathway enrichment analysis is a computational
approach that determines whether an a priori defined set of genes
such as a pathway shows statistically significant, concordant differ-
ences between two biological states. Gene set or pathway enrich-
ment analysis is particularly powerful when genes individually do
not exhibit a statistically significant difference between two
biological states, but, when grouped together, show statistically
significant concordant differences. For example, when performing
differential expression analysis, one common approach is to use a
significance cutoff to identify a limited number of the most inter-
esting genes for further research and interpretation. Gene set or
pathway enrichment analysis takes an alternative approach by focus-
ing on cumulative expression changes of multiple genes as a group,
thus shifting the focus from individual genes to groups of genes. By
looking at several genes at once, such an approach can identify gene
sets or pathways that have several genes each change a small
amount, but in a coordinated way, which may reach statistical
significance even when individual gene expression changes are
quite small and insufficiently significant.
There are many different methods to perform gene set or
pathway enrichment analysis, from hypergeometric distribution
tests [7, 8] to permutation-based approaches [5]. One popular
method, aptly named Gene Set Enrichment Analysis (GSEA) [5],
tests for enrichment using a permutation-based approach. In
GSEA, first, genes are ranked such as based on a measure of each
gene’s differential expression with respect to the two conditions.
Then the entire ranked list is used to assess how the genes of each
gene set are distributed across the ranked list by walking down the
ranked list of genes, increasing a running-sum statistic when a gene
belongs to the set, and decreasing it when the gene does not
(Fig. 1). The enrichment score is the maximum deviation from
zero encountered during the walk. The score reflects the degree
to which the genes in a gene set are overrepresented at the top or
bottom of the entire ranked list of genes. A set that is not enriched
will have its genes spread more or less uniformly through the
ranked list. An enriched set, on the other hand, will have a larger
portion of its genes at one or the other end of the ranked list. The
extent of enrichment is captured mathematically as the score statis-
tic. The statistical significance of the score can then be estimated
using permutation, whereby enrichment scores are computed for
random gene sets of the same size as the tested gene set. This
randomization is repeated many times to produce an empirical
null distribution of scores. The nominal p-value estimates the sta-
tistical significance of a single gene set’s score based on the
permutation-generated null distribution.
Differential Pathway Analysis 99

score
+

running-sum
Gene Set:

statistic
gene1
gene2
gene3
gene4
gene5 0
gene6
gene7
names
gene1
gene2

gene3

gene4
gene5
gene6

gene7
gene
Gene
Universe:
gene1 downregulated
... gene genes
gene57 upregulated ranking
genes

Fig. 1 A standard gene set enrichment plot. Genes in the gene universe are ranked according to a differential
expression statistic from most upregulated to most downregulated. A running-sum statistic then traverses the
ranked list and increments the enrichment score statistic upon reaching a gene within the gene set of interest

2 Materials

All programming will be done using the R statistical programming

language [9].

2.1 Liger R Package In Subheading 3.1, we will perform gene set enrichment analysis
using the Lightweight Iterative Gene set Enrichment in R (liger)
package, an R implementation of the GSEA algorithm [5]. liger can
be installed from CRAN using the following command in R:
install.packages("liger")

2.2 Scde R Package In Subheadings 3.2 and 3.3, we will perform pathway and gene set
overdispersion analysis (PAGODA) using the Single Cell Differen-
tial Expression (scde) package. Scde can be installed from Biocon-
ductor using the following command in R:
# try http:// if https:// URLs are not supported
source("https://bioconductor.org/biocLite.R")
biocLite("scde")
100 Jean Fan

3 Methods

3.1 Enhancing To demonstrate the utility of gene set or pathway enrichment

Statistical Power by analysis, we will use a simulated dataset. Specifically, we will simu-
Incorporating late a weak differential expression within a known gene set between
Pathway-Level two biological samples. We will show that while differential expres-
Information sion analysis is not able to pick up these genes as significantly
differentially expressed, a gene set enrichment analysis will be able
to pick up significant enrichment.
1. First, we will load the liger package.

library(liger)
2. Load a gene set based on Gene Ontology (GO) terms.
# load gene set
data("org.Hs.GO2Symbol.list")

We can look into the newly loaded org.Hs.GO2Symbol.list

object. Notice that it is a list of GO ids for various gene sets. Each list
contains the human HUGO symbols of genes within that gene set.
Note, in this manner, alternative gene sets such as MSigDB [10] or
KEGG or even custom gene sets can also be created and used.
head(org.Hs.GO2Symbol.list)
## $`GO:0000002`
## [1] "AKT3" "C10orf2" "DNA2" "LIG3" "MEF2A" "MGME1"
## [7] "MPV17" "OPA1" "PID1" "PRIMPOL" "SLC25A33" "SLC25A36"
## [13] "SLC25A4" "STOML2" "TYMP"
...

3. To simulate a weak differential expression within a known gene

set between two biological samples, we will first simulate ran-
dom gene expression for 100 cells. We will create a matrix
containing all genes and simulate gene expression by drawing
from a normal distribution with mean ¼ 0, and sd ¼ 3. Alter-
natively, your own normalized single-cell expression data can
be substituted in at this step.
# set random seed to ensure reproducibility
set.seed(0)
# get universe of genes
universe <- unique(unlist(org.Hs.GO2Symbol.list))
# make random data
Nsamples <- 100 # 100 cells
Mgenes <- length(universe)
mat <- matrix(rnorm(Mgenes*Nsamples, mean=0, sd=3), Mgenes, Nsamples)
rownames(mat) <- universe
Differential Pathway Analysis 101

Next, we will first pick a gene set.

# get genes in gene set GO:0000002

gs <- org.Hs.GO2Symbol.list[["GO:0000002"]]
# genes
print(gs)
## [1] "AKT3" "C10orf2" "DNA2" "LIG3" "MEF2A" "MGME1"
## [7] "MPV17" "OPA1" "PID1" "PRIMPOL" "SLC25A33" "SLC25A36"
## [13] "SLC25A4" "STOML2" "TYMP"
We will split our 100 cells into 2 groups to represent two
biologically different states.

# two biological states (groups)

group <- factor(c(rep(1, Nsamples/2), rep(2, Nsamples/2)))
names(group) <- colnames(mat) <- paste0('sample', 1:Nsamples)

Now, we will simulate upregulation of genes from our selected

gene set in cells belonging to group 1. Genes from our selected
gene set in cells belonging to group 1, rather than being drawn
from a normal distribution with mean ¼ 0 and sd ¼ 3, will instead
have an increased mean ¼ 2.25, and sd ¼ 5. We will also remove
negative values in our simulation to keep simulated expression
values interpretable.

# simulate upregulation of gene set in group 1

mat[gs, group==1] <- rnorm(length(gs)*sum(group==1), mean=2.25, sd=5)
# make more realistic; can't have negative gene expression
mat[mat < 0] <- 0

4. Now we can visualize the expression of our simulated upregu-

lated genes along with 50 other non-upregulated genes using a
heatmap (Fig. 2). We will also color the column side bar of our
heatmap using the cell group labels, with group 1 cells labeled
in red, and group 2 cells labeled in blue. Similarly, we will color
the row side bar in green if the gene is within our selected gene
set and black if not.
# we can visualize this weak differential expression in a heatmap
# visualize weakly differentially expressed genes and another 50 genes
vi <- c(gs, universe[1:50])
# label supposedly differentially expressed genes
heatmap(mat[vi,], Rowv=NA, Colv=NA, scale="none",
col=colorRampPalette(c("blue", "white", "red"))(100),
RowSideColors = c('black', 'green')[as.factor(vi %in% gs)],
ColSideColors = c('red', 'blue')[group])
102 Jean Fan

genes
cells
expression magnitude

lower average higher

Fig. 2 Gene expression heatmap for select simulated genes. Rows are genes and columns are cells. Gene
expression is colored using a color ramp from blue to white to red, with highly expressed genes colored in red
and lowly expressed genes in blue. Column side bar is colored using the cell group labels, with group 1 cells
labeled in red, and group 2 cells labeled in blue. Row side bar is colored in green if the gene is within our
selected gene set and black if not

Although we simulated the green row side color annotated

genes to be upregulated in the red column side color annotated
samples compared to the blue column side color annotated sam-
ples, even visually, it is somewhat difficult to tell which genes are
differentially expressed.
5. We can also quantify the extent of the differential expression
between our two biological states using a t-test.
# run differential expression analysis using simple t-test
vals.info <- lapply(1:nrow(mat), function(i) {
pv <- t.test(
mat[i, group==1],
mat[i, group==2]
)
return(list(val=pv$statistic, p=pv$p.value))
})
vals <- unlist(lapply(vals.info, function(x) x$val))
p <- unlist(lapply(vals.info, function(x) x$p))
names(p) <- names(vals) <- rownames(mat)

6. Because we are testing many genes, we have to apply multiple-

testing correction. We will use a Bonferroni correction [11].
Differential Pathway Analysis 103

3.0

2.5
-log10(p-value)
2.0

1.5

1.0

0.5

0.0
DNA2
OPA1
STOML2
MGME1
AKT3
C10orf2
LIG3
MEF2A
MPV17
PID1
PRIMPOL
SLC25A33
SLC25A36
SLC25A4
TYMP
Fig. 3 Differential expression analysis results for select simulated genes. Barplot shows -log10( p-value) for
each gene. Red line shows the p ¼ 0.05 significance threshold. Note none of the tested genes passes the
significance threshold

p.adj <- p.adjust(p, method="bonferroni") # multiple-testing correction

names(p.adj) <- rownames(mat)

7. We can now visualize the final -log10( p-values) using a barplot

(Fig. 3). We will use a red line to indicate the common p < 0.05
significance threshold. Significant genes should have bars that
pass the red line.

barplot(sort(-log10(p.adj[gs]), decreasing=TRUE), ylim=c(0, 3), las=2)

abline(h = -log10(0.05), col="red")
Unfortunately, none of the genes, including those we
simulated to be differentially expressed, were actually picked up as
significantly differentially expressed after multiple-testing correc-
tion (with corrected p-values <0.05). In a real-world situation, we
may be tempted to end our analysis here and conclude that since
nothing is significantly differentially expressed between the two
biological states there is no significant difference.
However, we can still perform gene set or pathway enrichment
analysis on a priori defined gene sets to look for statistically signifi-
cant concordant differences.
8. We will perform such analyses using liger for 10 Gene Ontol-
ogy gene sets in org.Hs.GO2Symbol.list, including
GO:0000002.
104 Jean Fan

# run iterative bulk gsea on our true gene set and 9 other gene sets as
test
gseaVals <- iterative.bulk.gsea(
values = vals,
set.list = org.Hs.GO2Symbol.list[1:10],
rank=TRUE)
## initial: [1e+02 - 3] [1e+03 - 1] [1e+04 - 1] done
print(gseaVals)
## p.val q.val sscore edge
## GO:0000002 0.00009999 0.00059994 2.5584741 2.0848842
## GO:0000003 0.66336634 0.66336634 0.4924230 0.3374948
## GO:0000012 0.11888112 0.25774226 -0.9737758 -0.1256842
## GO:0000014 0.24752475 0.36831683 0.6518057 -0.5915193
## GO:0000018 0.30693069 0.36831683 -0.7279604 0.9366813
## GO:0000022 0.12887113 0.25774226 0.9455950 -0.4223886

9. We can then identify significantly enriched gene sets as those

with a q-value <0.05.

# identify significantly enriched gene sets

gseaSig <- rownames(gseaVals[gseaVals$q.val < 0.05,])
print(gseaSig)
## [1] "GO:0000002"

Indeed, we recover GO:0000002 as a significantly enriched

gene set!
10. We can visualize a standard gene set enrichment plot for this
gene set (Fig. 4).

# look at plots
for(i in seq_along(gseaSig)) {
gs <- org.Hs.GO2Symbol.list[[gseaSig[i]]]
gsea(values=vals, geneset=gs, mc.cores=1, plot=TRUE, rank=TRUE)
}
So, although no individual gene was found to be statistically
significantly differentially expressed between our two biological
states, gene set and pathway enrichment analysis identified a signifi-
cantly enriched gene set, GO:0000002, which is exactly the gene
set that we simulated to show concordant differences. By looking
for coordinated changes in genes within these a priori defined gene
sets, we are able to increase our statistical power to identify differ-
ences between our two biological states.
Differential Pathway Analysis 105

15000
P value < 1e 04

10000
score
5000
0
4

edge value = 2.1

2
values
0
2
4

Fig. 4 Gene set enrichment plot for gene set GO:0000002 demonstrates significant enrichment as simulated

3.2 Applying Gene set testing with methods such as liger can be used for differ-
a Pathway-Integrated ential expression analysis to increase statistical power and uncover
Approach likely functional interpretations. However, such testing requires
with Pathway knowledge of biological conditions or subpopulations for compari-
and Gene Set son. To identify these transcriptionally distinct subpopulations, a
Overdisperrsion similar rationale can be applied in single-cell RNA-seq data analysis.
Analysis Highly variable genes may partition cells into transcriptionally dis-
tinct subpopulations but carry consideration uncertainty as
observed variability in gene expression may be the result of techni-
cal artifacts such as drop-outs. Yet whereas variability in the expres-
sion of a single gene may be noisy, coordinated upregulation of
many genes within a gene set or pathway in the same subset of cells
could provide a prominent signature to distinguish subpopulations.
Pathway And Gene set Over-Dispersion Analysis (PAGODA)
[6] looks for coordinated expression variability of genes in both
annotated pathways and automatically detected “de novo” gene
sets. PAGODA then uses this gene set and pathway-level informa-
tion to cluster cells into transcriptional subpopulations.
Briefly, PAGODA first estimates the effective sequencing
depth, drop-out rate, and amplification noise of each cell using a
previously described mixture-model approach with minor enhance-
ments. Using these models, the observed expression variance of
each gene is renormalized on the basis of the expected genome-
106 Jean Fan

wide variance at the appropriate expression magnitude. PAGODA

then examines an extensive panel of gene sets to identify those
showing a statistically significant excess of coordinated variability.
Gene sets can include annotated pathways, such as Gene Ontology
(GO) categories, as well as clusters of transcriptionally correlated
genes found in a given data set (“de novo” gene sets). The prevalent
transcriptional signature of each gene set is captured by its first
principal component (PC), with weighted PCA used to adjust for
technical noise. If the amount of variance explained by the first PC
of a given gene set is significantly higher than expected, the gene set
is considered to be “overdispersed. “PCs from the resulting signifi-
cantly overdispersed gene sets are combined to form a single
“aspect” of heterogeneity to provide a nonredundant view of tran-
scriptional heterogeneity to users through an interactive web
browser interface.
To demonstrate the utility of PAGODA, we will continue our
exploration of our simulated dataset. Note, to run PAGODA using
your own single-cell RNA-seq data, see Subheading 3.3 for step-
by-step instructions on how to go from gene expression counts to
the appropriate variance-normalized gene expression matrix input-
ted into the pathway overdispersion testing step.
1. We will first show how unbiased hierarchical clustering on our
simulated raw data fails to cluster our true groups together
(Fig. 5).

Fig. 5 Gene expression heatmap with cells grouped by hierarchical clustering shows inconsistency with cell
group labels. Rows are genes and columns are cells. Gene expression is colored using a color ramp from blue
to white to red, with highly expressed genes colored in red and lowly expressed genes in blue. Column side bar
is colored using the cell group labels, with group 1 cells labeled in red, and group 2 cells labeled in blue. Row
side bar is colored in green if the gene is within our selected gene set and black if not
Differential Pathway Analysis 107

# just cluster by all genes

hc <- hclust(dist(t(mat)))
heatmap(mat[vi,], Rowv=NA, Colv=as.dendrogram(hc), scale="none",
col=colorRampPalette(c("blue", "white", "red"))(100),
RowSideColors = c('black', 'green')[as.factor(vi %in% gs)],
ColSideColors = c('red', 'blue')[group])

The cells are ordered by unbiased hierarchical clustering and we

do not see any segregation of our two cell group labels. However,
we can integrate pathway-level information to enhance our signal
and enable proper separation of our two simulated cell groups.
2. To run PAGODA, we will load the scde package and format
our simulated data into the appropriate format. Note, to run
PAGODA using your own single-cell RNA-seq data, additional
functions are available for error-modeling and normalization
from gene expression counts (See.

library(scde)

# format data to pipe into PAGODA

varinfo <- list()
varinfo$mat <- mat
matw <- matrix(1, nrow(mat), ncol(mat)) # equal weighting
rownames(matw) <- rownames(mat)
colnames(matw) <- colnames(mat)
varinfo$matw <- matw

3. We will compute PCs for a set of 10 pathways and cluster based

on such pathway-level expression.

go.env <- list2env(org.Hs.GO2Symbol.list[1:10]) # just use first 10

pathways
# test pathways for overdispersion
pwpca <- pagoda.pathway.wPCA(varinfo, go.env, n.components = 1, n.cores =
1)
df <- pagoda.top.aspects(pwpca, return.table = TRUE, plot = FALSE,
z.score = 1.96)
head(df)
## name npc n score z adj.z sh.z adj.sh.z
## 1 GO:0000002 1 15 212.82838 152.43638 152.42917 NA NA
## 3 GO:0000018 1 34 31.30615 50.76668 50.75870 NA NA
## 2 GO:0000003 1 300 10.29349 50.41152 50.41152 NA NA

tam <- pagoda.top.aspects(pwpca, z.score = qnorm(0.01/2, lower.tail =

FALSE))
108 Jean Fan

#PC1# GO:0000018

#PC1# GO:0000003

#PC1# GO:0000002

Fig. 6 Pathway expression heatmap with cells grouped by hierarchical clustering shows consistency with cell
group labels. Rows are pathways and columns are cells. Pathway expression, summarized by the first
principal component (PC1) of gene expressions for genes within the pathway, is colored using a color ramp
from blue to white to red. Column side bar is colored using the cell group labels, with group 1 cells labeled in
red, and group 2 cells labeled in blue

4. Now, we can cluster our cells based on their pathway-level

expression patterns using hierarchical clustering and visualize
the results as a heatmap (Fig. 6).

# unbiased clustering on pathway information

hc2 <- hclust(dist(t(tam$xv)))
heatmap(tam$xv, Rowv=NA, Colv=as.dendrogram(hc2), scale="none",
col=colorRampPalette(c("blue", "white", "red"))(100),
ColSideColors = c('red', 'blue')[group], mar=c(5,15))

And indeed, we can see that the pathway-integrated clustering

better separates our two simulated groups.

3.3 Pathway For a more realistic demonstration, we will analyze single-cell

and Gene Set RNA-seq data from Pollen et al. [12]. The error models PAGODA
Overdisperrsion uses are based off of count-based processes and therefore the
Analysis with Single- inputted data will be a matrix of read counts.
Cell RNA-Seq Data 1. We can load the read count table and cell group annotations
using data(“pollen”) call. The columns are cells and the rows
are genes. Some additional filters are also applied to remove
poor cells and non-detected genes. Your own single-cell RNA--
seq data can be substituted at this step as well.
Differential Pathway Analysis 109

library(scde)
data(pollen)
# remove poor cells and genes
cd <- clean.counts(pollen)
# check the final dimensions of the read count matrix
dim(cd)
## [1] 11310 64

For visualizations later, we will translate group and sample

source data from the original publication [12] into color codes.

x <- gsub("^Hi_(.)_.", "\\1", colnames(cd))

l2cols <- c("coral4", "olivedrab3", "skyblue2",
"slateblue3")[as.integer(factor(x, levels = c("NPC", "GW16", "GW21",
"GW21+3")))]

2. Next, we’ll construct error models for individual cells. Here, we

use a k-nearest neighbor model fitting procedure implemented
by knn.error.models() method. This is a relatively noisy dataset
(non-UMI), so we raise the min.count.threshold to 2 (mini-
mum number of reads for the gene to be initially classified as a
non-failed measurement), requiring at least 5 non-failed mea-
surements per gene. We’re providing a rough guess to the
complexity of the population, by fitting the error models
based on 1/4 of most similar cells (i.e., guessing there might
be ~4 subpopulations). Note, this step takes a considerable
amount of time unless multiple cores are used. We highly rec-
ommend use of multiple cores. You can check the number of
available cores available using detectCores().

knn <- knn.error.models(cd, k = ncol(cd)/4, n.cores = 1,

min.count.threshold = 2, min.nonfailed = 5, max.model.plots = 10)

3. In order to accurately quantify excess variance or overdisper-

sion, we must normalize out expected levels of technical and
intrinsic biological noise. Briefly, variance of the NB/Poisson
mixture processes derived from the error modeling step is
modeled as a chi-squared distribution using adjusted degrees
of freedom and observation weights based on the drop-out
probability of a given gene. We will normalize variance,
trimming 3 most extreme cells and limiting maximum adjusted
variance to 5.

varinfo <- pagoda.varnorm(knn, counts = cd, trim = 3/ncol(cd),

max.adj.var = 5, n.cores = 1, plot = TRUE)
110 Jean Fan

4. Even with all the corrections, sequencing depth or gene cover-

age is typically still a major aspect of variability. In most studies,
we would want to control for that as a technical artifact (excep-
tions are cell mixtures where subtypes significantly differ in the
amount of total mRNA). We will control for the gene coverage
(estimated as a number of genes with nonzero magnitude per
cell) by normalizing out that aspect of cell heterogeneity:

varinfo <- pagoda.subtract.aspect(varinfo, colSums(cd[,

rownames(knn)]>0))

5. As mentioned previously, in order to detect significant aspects

of heterogeneity across the population of single cells,
PAGODA identifies pathways and gene sets that exhibit statis-
tically significant excess of coordinated variability. Specifically,
for each gene set, we will test whether the amount of variance
explained by the first principal component significantly exceeds
the background expectation. We can test both predefined gene
sets as well as “de novo” gene sets whose expression profiles are
well correlated within the given dataset.
For predefined gene sets, we’ll use the GO annotations we
previously loaded from liger.
# in case you didn't load it previously, load it now
library(liger)
data("org.Hs.GO2Symbol.list")
go.env <- org.Hs.GO2Symbol.list
# remove GOs with too few or too many genes
go.env <- clean.gos(go.env)
# convert to an environment
go.env <- list2env(go.env)
Now, we can calculate weighted first principal component
magnitudes for each GO gene set in the provided environment
and evaluate the statistical significance of their overdispersion.
pwpca <- pagoda.pathway.wPCA(varinfo, go.env, n.components = 1, n.cores =
1)
df <- pagoda.top.aspects(pwpca, return.table = TRUE, plot = FALSE,
z.score = 1.96)
head(df)
## name npc n score z adj.z
## 339 GO:0003179 1 10 3.495767 11.108780 10.760218
## 338 GO:0003170 1 10 3.495767 11.108780 10.760218
## 3570 GO:0060563 1 12 3.220725 10.643172 10.297292
## 1829 GO:0030426 1 39 3.134488 14.644926 14.338584
## 1302 GO:0014009 1 10 3.105600 9.656705 9.307366
## 1830 GO:0030427 1 40 3.093050 14.530866 14.223476
Differential Pathway Analysis 111

l The z column gives the Z-score of pathway over-dispersion

relative to the genome-wide model (Z-score of 1.96 corre-
sponds to P-value of 5%, etc.).
l “z.adj” column shows the Z-score adjusted for multiple
hypothesis (using Benjamini-Hochberg correction).
l “score” gives observed/expected variance ratio.
l “sh.z” and “adj.sh.z” columns give the raw and adjusted
Z-scores of “pathway cohesion,” which compares the
observed PC1 magnitude to the magnitudes obtained
when the observations for each gene are randomized with
respect to cells. When such Z-score is high (e.g., for
GO:0008009) then multiple genes within the pathway
contribute to the coordinated pattern.
6. We can also test “de novo” gene sets whose expression profiles
are well correlated within the given dataset. The following
procedure will determine “de novo” gene clusters in the data,
and build a background model for the expectation of the gene
cluster weighted principal component magnitudes. Note the
higher trim values for the clusters, as we want to avoid clusters
that are formed by outlier cells.

clpca <- pagoda.gene.clusters(varinfo, trim = 7.1/ncol(varinfo$mat),

n.clusters = 50, n.cores = 1, plot = FALSE)

7. Now the set of top aspects can be recalculated taking these de

novo gene clusters into account.

df <- pagoda.top.aspects(pwpca, clpca, return.table = TRUE, plot = FALSE,

z.score = 1.96)
head(df)
## name npc n score z adj.z
## 339 GO:0003179 1 10 3.495767 11.108780 10.760218
## 338 GO:0003170 1 10 3.495767 11.108780 10.760218
## 4334 geneCluster.8 1 307 3.397680 13.114746 12.814767
## 3570 GO:0060563 1 12 3.220725 10.643172 10.297292
## 1829 GO:0030426 1 39 3.134488 14.644926 14.338584
## 1302 GO:0014009 1 10 3.105600 9.656705 9.307366

8. To view top aspects of transcriptional heterogeneity, we will

first obtain information on all the significant aspects. We will
also determine the overall cell clustering based on this full
pathway-level information.
112 Jean Fan

# get full info on the top aspects

tam <- pagoda.top.aspects(pwpca, clpca, n.cells = NULL, z.score =
qnorm(0.01/2, lower.tail = FALSE))
# determine overall cell clustering
hc <- pagoda.cluster.cells(tam, varinfo)

9. We can then reduce redundant aspects in two steps. In the first

step, we will combine pathways that are driven by the same sets
of genes. In the second step we will combine aspects that show
similar patterns (i.e., separate the same sets of cells).

tamr <- pagoda.reduce.loading.redundancy(tam, pwpca, clpca)

tamr2 <- pagoda.reduce.redundancy(tamr, distance.threshold = 0.9, trim =
0, plot = FALSE)

10. We can then view these top aspects in a heatmap (Fig. 7).
Indeed, we see a correspondence between out derived cell
annotations and the previously published annotations.

Fig. 7 Pathway expression heatmap for single-cell RNA-seq data from Pollen
et al. The columns are cells and the rows represent a cluster of pathways. The
row names are assigned to be the top overdispersed aspect in each cluster. The
green-to-orange color scheme shows low-to-high weighted PC scores (aspect
patterns), where generally orange indicates higher expression and green lower
expression. The column colors are cell annotations from the original publication
Differential Pathway Analysis 113

Fig. 8 Sample screenshot of an interactive PAGODA app

col.cols <- rbind(groups = cutree(hc, 3), l2cols)

pagoda.view.aspects(tamr2, cell.clustering = hc, box = TRUE, labCol = NA,
margins = c(0.5, 20), col.cols = rbind(col.cols))

11. To interactively browse and explore the output, we can also

create a PAGODA app (Fig. 8).

# compile a browsable app, showing top three clusters with the top color
bar
app <- make.pagoda.app(tamr2, tam, varinfo, go.env, pwpca, clpca,
col.cols = col.cols, cell.clustering = hc, title = "NPCs")
# show app in the browser (port 1468)
show.app(app, "pollen", browse = TRUE, port = 1468)

The PAGODA app allows you to view the gene sets grouped
within each aspect (row), as well as genes underlying the detected
heterogeneity patterns. In this manner, you can interactively
explore the pathways and genes driving each identified transcrip-
tional subpopulation.

Acknowledgment

This work was supported by NIH grant F99CA222750.

114 Jean Fan

References
1. Soneson C, Delorenzi M (2013) A comparison Methods 13:241–244. https://doi.org/10.
of methods for differential expression analysis 1038/nmeth.3734
of RNA-seq data. BMC Bioinformatics 14:91. 7. Wagner F (2016) The XL-mHG test for
https://doi.org/10.1186/1471-2105-14-91 enrichment: algorithms, bounds, and power.
2. Jaakkola MK, Seyednasrollah F, Mehmood A, https://doi.org/10.7287/peerj.preprints.
Elo LL (2016) Comparison of methods to 1962v1
detect differentially expressed genes between 8. Huang DW, Sherman BT, Lempicki RA (2009)
single-cell populations. Brief Bioinform 18: Bioinformatics enrichment tools: paths toward
bbw057. https://doi.org/10.1093/bib/ the comprehensive functional analysis of large
bbw057 gene lists. Nucleic Acids Res 37:1–13. https://
3. Kharchenko PV, Silberstein L, Scadden DT doi.org/10.1093/nar/gkn923
(2014) Bayesian approach to single-cell differ- 9. R Core Team (2017) R: a language and envi-
ential expression analysis. Nat Methods ronment for statistical computing
11:740–742. https://doi.org/10.1038/ 10. Liberzon A, Birger C, Thorvaldsdóttir H,
nmeth.2967 Ghandi M, Mesirov JP, Tamayo P (2015) The
4. Finak G, McDavid A, Yajima M, Deng J, molecular signatures database (MSigDB) hall-
Gersuk V, Shalek AK, Slichter CK, Miller HW, mark gene set collection. Cell Syst 1:417–425.
McElrath MJ, Prlic M, Linsley PS, Gottardo R https://doi.org/10.1016/j.cels.2015.12.004
(2015) MAST: a flexible statistical framework for 11. Dunnett CW (1955) A multiple comparison
assessing transcriptional changes and characteriz- procedure for comparing several treatments
ing heterogeneity in sin gle-cell RNA sequencing with a control. J Am Stat Assoc
data. Genome Biol 16:278. https://doi.org/10. 50:1096–1121. https://doi.org/10.1080/
1186/s13059-015-0844-5 01621459.1955.10501294
5. Subramanian A, Tamayo P, Mootha VK, 12. Pollen AA, Nowakowski TJ, Shuga J, Wang X,
Mukherjee S, Ebert BL, Gillette MA, Leyrat AA, Lui JH, Li N, Szpankowski L,
Paulovich A, Pomeroy SL, Golub TR, Lander Fowler B, Chen P, Ramalingam N, Sun G,
ES, Mesirov JP (2005) Gene set enrichment Thu M, Norris M, Lebofsky R, Toppani D,
analysis: a knowledge-based approach for inter- Kemp DW, Wong M, Clerkson B, Jones BN,
preting genome-wide expression profiles. Proc Wu S, Knutsson L, Alvarado B, Wang J, Weaver
Natl Acad Sci U S A 102:15545–15550. LS, May AP, Jones RC, Unger MA, Kriegstein
https://doi.org/10.1073/pnas.0506580102 AR, West JAA (2014) Low-coverage single-cell
6. Fan J, Salathia N, Liu R, Kaeser GE, Yung YC, mRNA sequencing reveals cellular heterogene-
Herman JL, Kaper F, Fan J-B, Zhang K, ity and activated signaling pathways in develop-
Chun J, Kharchenko PV (2016) Characterizing ing cerebral cortex. Nat Biotechnol
transcriptional heterogeneity through pathway 32:1053–1058. https://doi.org/10.1038/
and gene set overdispersion analysis. Nat nbt.2967
Chapter 8

Pseudotime Reconstruction Using TSCAN

Zhicheng Ji and Hongkai Ji

Abstract
In many single-cell RNA-seq (scRNA-seq) experiments, cells represent progressively changing states along
a continuous biological process. A useful approach to analyzing data from such experiments is to computa-
tionally order cells based on their gradual transition of gene expression. The ordered cells can be viewed as
samples drawn from a pseudo-temporal trajectory. Analyzing gene expression dynamics along the pseudo-
time provides a valuable tool for reconstructing the underlying biological process and generating biological
insights. TSCAN is an R package to support in silico reconstruction of cells’ pseudotime. This chapter
introduces how to apply TSCAN to scRNA-seq data to perform pseudotime analysis.

Key words Single-cell RNA-seq, Gene expression, Pseudotime, Minimum spanning tree, Genomics,
Bioinformatics

1 Introduction

Single-cell RNA-seq (scRNA-seq) offers unprecedented power for

analyzing cells’ distinct transcriptomic profiles in a heterogeneous
cell population [1–3]. In many studies, a biological sample consists
of cells from different stages of a biological process. For example,
during cell differentiation, cells may differentiate at different speeds
and they can enter different developmental lineages. As a result,
when a sample is obtained at a particular time point of a differenti-
ation process, the sample may contain cells representing different
developmental stages and lineages. A useful approach to utilizing
scRNA-seq data obtained from such a biological sample is to com-
putationally place cells onto a pseudo-temporal trajectory based on
their progressive changes of gene expression. Such a pseudo-
temporal trajectory often reflects the underlying biological process
from which the cells are sampled. Ordering cells along this pseudo-
temporal trajectory and analyzing cells’ transcriptomic changes
along the pseudotime therefore could yield new insights into the
dynamic gene expression and regulation programs of a biological
process. This approach, also known as “pseudotime analysis,” has

Guo-Cheng Yuan (ed.), Computational Methods for Single-Cell Data Analysis, Methods in Molecular Biology, vol. 1935,
https://doi.org/10.1007/978-1-4939-9057-3_8, © Springer Science+Business Media, LLC, part of Springer Nature 2019

115
116 Zhicheng Ji and Hongkai Ji

been emerging as a powerful tool for studying cell differentiation,

immune response, tumor progression, and many other biological
processes [1, 4–7].
A pioneering work that demonstrates the value of pseudotime
analysis in scRNA-seq experiments is [1]. In that study, a computa-
tional algorithm Monocle was proposed to construct pseudo-time
using minimum spanning tree (MST). Since then, a number of
pseudotime reconstruction methods have been developed. Com-
prehensive reviews of such methods can be found in [8, 9]. This
chapter introduces one such method, TSCAN [6], which is devel-
oped by us to support pseudotime analysis. In several recent bench-
mark studies, TSCAN has shown favorable performance compared
to other available methods [8, 10].
Similar to Monocle [1], TSCAN uses MST to construct
pseudo-temporal trajectory. However, instead of treating each cell
as a tree node, TSCAN first groups similar cells into clusters. It then
treats each cluster as a node and constructs an MST to connect
cluster centers. The resulting MST provides the backbone of the
pseudo-temporal path. Individual cells are then projected to the
tree backbone to determine their pseudotime and order on the path
(Fig. 1).
Unlike TSCAN, Monocle directly uses individual cells rather
than cell clusters as tree nodes to construct MST. A drawback of this
approach is the instability of the tree inference. When there are a
large number of cells, the tree space is highly complex due to the
large number of tree nodes. Random noises can easily change the
topology of the MST, making the tree inference highly variable and
unstable [6]. As a result, the MST obtained by this approach often
deviates from the true biology. Moreover, the computation also
becomes more challenging with a large number of cells. By cluster-
ing cells and treating clusters as nodes, TSCAN substantially
reduces the number of tree nodes and hence the complexity of
the tree space, making the tree inference less variable and more
stable. Similar to the variance-bias tradeoff in machine learning, the
reduced complexity of the tree space often results in improved tree
estimation. A systematic evaluation in [6] shows that by using this
clustered-based MST approach, TSCAN was better able to recon-
struct the underlying biological processes compared to Monocle.
TSCAN is freely available as an open-source R package. In the
following sections, we will introduce how TSCAN can be used to
perform pseudotime analysis step by step.

2 Materials

In order to use TSCAN, one needs to have the following software

and data.
Pseudotime Reconstruction Using TSCAN 117

Fig. 1 TSCAN analysis workflow. Starting from a preprocessed gene expression matrix, TSCAN constructs
cells’ pseudo-temporal ordering using the following steps: reduce dimension of the gene expression matrix,
group cells into clusters, construct MST that links the cluster centers to create the backbone of the pseudo-
temporal trajectory, and project individual cells onto tree edges to obtain pseudotime

2.1 R R is an open-source software for statistical computing that can be

freely downloaded and installed [11].

2.2 TSCAN Package TSCAN is an open-source R package that can be freely downloaded
and installed from Bioconductor [12] or Github [13]. After
installing R, the latest TSCAN can be installed from Github by
typing the following commands in R:

if (!require("devtools"))
install.packages("devtools")
devtools::install_github("zji90/TSCAN")

2.3 ScRNA-Seq Data We assume that raw scRNA-seq data have already been processed
and summarized into a matrix of normalized gene expression values
(see Note 1). In this matrix, each row represents a gene and each
column represents a cell. The matrix is stored in a tab-delimited text
file. The first row of the file stores cell names, and the first column
contains gene names.
In this chapter, we will demonstrate the pseudotime analysis
using a scRNA-seq dataset of differentiating human skeletal muscle
myoblasts [1]. The dataset contains 271 cells collected at 0, 24,
48, and 72 h after switching human myoblasts to low serum. The
log2 transformed gene expression matrix (HSMM_Y) can be
downloaded from Github [14].

3 Methods

A typical pseudotime analysis using TSCAN consists of four steps:

(1) loading data and preprocessing, (2) dimension reduction and
cell clustering, (3) pseudotime reconstruction, (4) differential gene
expression analysis.
118 Zhicheng Ji and Hongkai Ji

3.1 Loading Data The first step of TSCAN is to load the scRNA-seq data into R and
and Preprocessing convert it into a matrix object. For example, the HSMM dataset,
which is stored in a tab-delimited text file, can be loaded using the
following R command (see Note 2):

library(TSCAN)
data <- as.matrix(read.table("HSMM_Y.txt", as.is=T, header=T, sep="\t",
row.names=1))

Next, the data can be preprocessed in multiple ways using the

function “preprocess” provided by TSCAN. Using this function,
one can log transform the expression matrix and remove genes with
low expression or low variability. A gene with low expression can be
defined by comparing its mean expression value across cells to a
user-specified cutoff, or by comparing the percentage of cells with
nonzero expression to a cutoff. Low variability is defined by com-
paring a gene’s coefficient of variation across cells to a user-specified
cutoff. ScRNA-seq data can be sparse and contain many zero gene
expression values. To mitigate sparsity, one can also use the “pre-
process” function to group co-expressed genes into clusters. For
each gene cluster, the mean expression of all genes in the cluster will
be computed for each cell. By aggregating information from multi-
ple genes, the cluster-level expression typically is more continuous
and much less sparse. The following R command demonstrates
preprocessing using the HSMM data:

aggdata <- preprocess(data, clusternum=0.05*nrow(data), takelog=FALSE,

minexpr_value=0, minexpr_percent=0, cvcutoff=0)

This command removes genes that have zero expression in all

cells. It also groups genes into clusters and returns the mean gene
expression for each cluster. Here the cluster number is equal to 5%
of the total gene number. The cluster-level expression is stored in a
new matrix aggdata which will be used for subsequent pseudo-
time reconstruction.

3.2 Dimension In order to construct pseudo-temporal trajectory, TSCAN first

Reduction and Cell groups similar cells into clusters and then constructs a minimum
Clustering spanning tree (MST) to connect cluster centers. Compared to
constructing an MST to connect individual cells, using cell clusters
as tree nodes in the MST can reduce the complexity of the tree
space and improve the accuracy and stability of tree inference [6].
The default procedure used by TSCAN to cluster cells involves
two steps. In the first step, TSCAN reduces the dimension of the
input data using principal component analysis (PCA) (see Note 3).
Dimension reduction will help visualization and mitigate the curse
of dimensionality in computing cells’ distances. The number of
principal components (PCs) to keep is automatically determined
Pseudotime Reconstruction Using TSCAN 119

using an elbow method [6]. In the second step, model-based

clustering [15] is used to cluster cells based on the dimension-
reduced data (see Note 4). The number of clusters is automatically
determined using the Bayesian information criteria (BIC). These
two steps can be carried out using a single R function, as illustrated
below using the HSMM data:

HSMMmclust <- exprmclust(aggdata)

One can visualize the dimension-reduced data and cell cluster-

ing results. For example, the following R function generates a
scatterplot (Fig. 2a) that visualizes the second and third PCs as
well as cell clustering:

plotmclust(HSMMmclust, x=2, y=3)

In this plot, each dot represents a cell. Different cell clusters are
represented using different colors and marker shapes. Cluster cen-
ters are marked with numbers.

3.3 Pseudotime Once cells are clustered, TSCAN will construct an MST to connect
Reconstruction the cluster centers. This tree will serve as the backbone of the
pseudo-temporal trajectory. The tree may have multiple branches.
By default, TSCAN will choose the path with the largest number of
cell clusters as the main pseudo-temporal path. If two paths have
the same number of cell clusters, the path with the largest number
of cells will be chosen as the main path. For example, the main path
in Fig. 2a will be path 3-1-4.
Sometimes, some branches of the tree are not of biological
interest. For example, some cell clusters and tree branches represent
contaminated cells. To help identify and remove such branches, one
can use TSCAN to visualize expression of marker genes. To dem-
onstrate, the following R commands show the expression of two
marker genes in the HSMM data: PDGFRA (Fig. 2b) and MYOG
(Fig. 2c).

plotmclust(HSMMmclust, x=2, y=3,

markerexpr=data["ENSG00000134853.7",],showcluster = F)
plotmclust(HSMMmclust, x=2, y=3,
markerexpr=data["ENSG00000122180.4",],showcluster = F)

This dataset contains some contaminated cells which are

marked by high expression of PDGFRA [1]. On the other hand,
MYOG is a key marker gene for human skeletal muscle cell differ-
entiation, and its expression is expected to increase as cells differen-
tiate [1]. Based on this prior knowledge, cell cluster 3 is most likely
contaminated cells because it has high PDFGRA expression. Clus-
ter 2 has relatively low expression of MYOG, and cluster 4 has
120 Zhicheng Ji and Hongkai Ji

Fig. 2 TSCAN analysis of the HSMM data. A. Scatterplot showing PC2, PC3, and cell clustering. Path of interest
is 3-1-4. B. Visualization of PDGFRA marker gene expression. C. Visualization of MYOG marker gene
expression. D. Scatterplot showing PC2, PC3, and cell clustering. Path of interest is 2-1-4

relatively high MYOG expression. Thus, one can identify cluster

2 as undifferentiated cells and cluster 4 as differentiated cells. Since
the primary interest here is to study human skeletal muscle cell
differentiation, the pseudo-temporal path of main interest should
be 2-1-4 rather than the default main path 3-1-4. Using the fol-
lowing R command, TSCAN allows one to choose a path (e.g.,
path 2-1-4) other than the default main path for visualization and
analysis (Fig. 2d):

plotmclust(HSMMmclust, x=2, y=3, MSTorder=c(2,1,4))

Pseudotime Reconstruction Using TSCAN 121

After obtaining the backbone of pseudo-temporal trajectory

using the cluster-based MST, one can then order cells along the
pseudo-temporal path and assign pseudotime to cells. TSCAN
orders cells in two steps. First, cell clusters are ordered based on
the pseudo-temporal path. For example, for path 2-1-4, all cells in
cluster 2 are placed before cells in cluster 1, and cells in cluster 1 are
placed before cells in cluster 4. Second, in order to order cells
within the same cluster, cells are projected onto the edge that
connects the centers of neighboring clusters, and the projection
determines cell ordering (Fig. 1, Pseudotime Assignment). For
example, cells in cluster 2 are projected onto the edge linking the
centers of cluster 2 and cluster 1. Cells with a projection closer to
the center of cluster 2 are placed before cells with a projection closer
to the center of cluster 1. Once cell ordering is determined, the
order of each cell is used as its pseudotime. As an example, the
following command can be used to reconstruct cells’ pseudo-
temporal ordering along the path 2-1-4:

HSMMTSCANorder214 <- TSCANorder(HSMMmclust, MSTorder=c(2,1,4))

After pseudotime reconstruction, one can visualize how gene

expression changes along pseudotime. For example, the following
command will generate a plot to show how the expression of
MYH2 changes along the pseudotemporal path 2-1-4 (Fig. 3a):

singlegeneplot(data["ENSG00000125414.13",],HSMMTSCANorder214)

6
Expression

State
4 1
2
4

0
0 50 100 150
Pseudotime

Fig. 3 Differential gene detection by TSCAN. (a) Scatterplot showing expression of MYH2 gene along the
pseudotemporal path 2-1-4. (b) An example of TSCAN output that shows a few top differentially expressed
genes
122 Zhicheng Ji and Hongkai Ji

Since the pseudo-temporal trajectory constructed using MST

may contain multiple branches, one can use the following com-
mand to obtain the pseudo-temporal ordering of all possible paths
(see Note 5):

HSMMTSCANorderfull <- TSCANorder(HSMMmclust,listbranch = T)

3.4 Differential Gene Given cells’ pseudo-temporal ordering, one can detect genes with
Expression Analysis significant expression changes along pseudotime. To detect such
genes, TSCAN fits a generalized additive model (GAM) for each
gene to describe the functional relationship between gene expres-
sion and pseudotime. The fitted model is compared to a null model
that assumes constant expression along pseudotime. The model
fitting is performed using mgcv package in R [16]. A likelihood
ratio test is conducted to obtain p-value. To account for multiple
testing, p-values are converted to false discovery rate (FDR). By
default, genes with FDR < 0.05 are reported as differential. To
demonstrate, the following command performs differential analysis
along the pseudo-temporal path 2-1-4:

diffres <- difftest(data,HSMMTSCANorder214)

head(differs)

The returned object is a data frame that contains the p-values

and FDRs for differential tests (Fig. 3b).

4 Notes

1. There are a number of different technology platforms to gen-

erate scRNA-seq data. Data from different platforms can have
very different characteristics. Currently, there is no universal
computational pipeline that can optimally process all types of
scRNA-seq data. Thus, while TSCAN provides pseudotime
analysis functions, it is users’ own responsibility to find the
most appropriate way to map sequence reads and convert the
read data into normalized gene expression matrices before
pseudotime analysis.
2. The matrix of normalized gene expression values used as
TSCAN input can also be stored in text files that use other
common delimiters. For example, instead of using a tab-delim-
ited text file, one can store the input data in a comma-separated
value (CSV) file that uses comma to separate different columns.
To load such data, one only needs to change the separator (i.e.,
the “sep¼” argument) in the read.table command. For
example:
Pseudotime Reconstruction Using TSCAN 123

data <- as.matrix(read.table("HSMM_Y.csv", as.is=T, header=T, sep=",",

row.names=1))

3. While TSCAN uses PCA as its default dimension reduction

method, there are many other methods in R that can be used
for dimension reduction (e.g., tsne [17]). TSCAN provides
users with the flexibility to use a dimension reduction method
of their choice. In order to do so, users only need to run
dimension reduction using the R function they choose. Sup-
pose the dimension-reduced data are stored in matrix cus-
tomdimred. Users can supply this matrix to TSCAN for
clustering cells by the following command:

HSMMmclust <- exprmclust(customdimred, reduce=F)

4. TSCAN also provides users with the flexibility to choose how

to cluster cells. Instead of using model-based clustering, users
can supply their own clustering results derived by other clus-
tering methods such as k-means clustering or hierarchical clus-
tering. The following command accepts both customized
dimension reduction and cell clustering results:

HSMMmclust <- exprmclust(customdimred, reduce=F, cluster=customcluster)

After loading the cell clustering results, users can then proceed
with the remaining steps of pseudotime reconstruction.
5. For user’s convenience, TSCAN also provides a graphical user
interface (GUI) to perform pseudotime analysis. Most of the
functions and commands described above have corresponding
buttons in the GUI. Instead of typing the commands in R, one
can also use GUI to execute the same functions. This usually
only requires one to click a few buttons. The link to the online
version of TSCAN GUI can be found on TSCAN’s Github
page [13]. On that page, there is a video demonstrating how to
use the GUI. Since the video is quite straightforward, we will
not repeat the demonstration of GUI here. The TSCAN GUI
can also be invoked locally in R on user’s own computer using
the following command:

TSCANui()

Compared to the GUI, using the R commands described in this

chapter can give users more control of the analysis. Moreover,
these R commands can serve as building blocks of users’ own
analysis pipelines and can be easily incorporated into their
codes and programs.
124 Zhicheng Ji and Hongkai Ji

References
1. Trapnell C, Cacchiarelli D, Grimsby J et al tools. bioRxiv. https://doi.org/10.1101/
(2014) The dynamics and regulators of cell 276907
fate decisions are revealed by pseudotemporal 9. Herring CA, Chen B, McKinley ET et al
ordering of single cells. Nat Biotechnol 32 (2018) Single-cell computational strategies
(4):381–386 for lineage reconstruction in tissue systems.
2. Zheng C, Zheng L, Yoo JK et al (2017) Land- Cell Mol Gastroenterol Hepatol 5(4):539–548
scape of infiltrating T cells in liver cancer 10. Street K, Risso D, Fletcher RB et al (2018)
revealed by single-cell sequencing. Cell 169 Slingshot: cell lineage and pseudotime infer-
(7):1342–1356 ence for single-cell transcriptomics. BMC
3. Shalek AK, Satija R, Shuga J et al (2014) Genomics 19(1):477
Single-cell RNA-seq reveals dynamic paracrine 11. R project. https://www.r-project.org/
control of cellular variation. Nature 510 12. TSCAN R package on Bioconductor.
(7505):363–369 https://www.bioconductor.org/packages/
4. Marco E, Karp RL, Guo G et al (2014) Bifur- release/bioc/html/TSCAN.html
cation analysis of single-cell gene expression 13. TSCAN R package on Github. https://github.
data reveals epigenetic landscape. Proc Natl com/zji90/TSCAN
Acad Sci U S A 111(52):E5643–E5650
14. HSMM singe-cell RNA-seq dataset.
5. Shin J, Berg DA, Zhu Y et al (2015) Single-cell https://raw.githubusercontent.com/zji90/
RNA-Seq with waterfall reveals molecular cas- TSCANdata/master/HSMM_Y.txt
cades underlying adult neurogenesis. Cell Stem
Cell 17(3):360–372 15. Fraley C, Raftery AE (2002) Model-based clus-
tering, discriminant analysis, and density esti-
6. Ji Z, Ji H (2016) TSCAN: pseudo-time recon- mation. J Am Stat Assoc 97(458):611–631
struction and evaluation in single-cell RNA-seq
analysis. Nucleic Acids Res 44(13):e117–e117 16. Wood SN (2011) Fast stable restricted maxi-
mum likelihood and marginal likelihood esti-
7. Haghverdi L, Buettner M, Wolf FA et al (2016) mation of semiparametric generalized linear
Diffusion pseudotime robustly reconstructs models. J Royal Stat Soc Sec B 73(1):3–36
lineage branching. Nat Methods 13
(10):845–848 17. Maaten LVD, Hinton G (2008) Visualizing
data using t-SNE. J Mach Learn Res
8. Wouter S, Robrecht C, Helena T, et al (2018) 9:2579–2605
A comparison of single-cell trajectory inference
methods: towards more accurate and robust
Chapter 9

Estimating Differentiation Potency of Single Cells Using

Single-Cell Entropy (SCENT)
Weiyan Chen and Andrew E. Teschendorff

Abstract
The ability to measure molecular properties (e.g., mRNA expression) at the single-cell level is revolutioniz-
ing our understanding of cellular developmental processes and how these are altered in diseases like cancer.
The need for computational methods aimed at extracting biological knowledge from such single-cell data
has never been greater. Here, we present a detailed protocol for estimating differentiation potency of single
cells, based on our Single-Cell ENTropy (SCENT) algorithm. The estimation of differentiation potency is
based on an explicit biophysical model that integrates the RNA-Seq profile of a single cell with an
interaction network to approximate potency as the entropy of a diffusion process on the network. We
here focus on the implementation, providing a step-by-step introduction to the method and illustrating it
on a real scRNA-Seq dataset profiling human embryonic stem cells and multipotent progenitors represent-
ing the 3 main germ layers. SCENT is aimed particularly at single-cell studies trying to identify novel stem-
or-progenitor like phenotypes, and may be particularly valuable for the unbiased identification of cancer
stem cells. SCENT is implemented in R, licensed under the GNU General Public Licence v3, and freely
available from https://github.com/aet21/SCENT.

Key words Single-cell, RNA-Seq, Differentiation potency, Network, Entropy

1 Introduction

Over 50 years ago Conrad Waddington proposed an epigenetic

landscape model of cellular differentiation, in which cell-fate transi-
tions are modeled as canalization events, with stable cell states
occupying the basins or attractor states [1, 2]. A key ingredient of
this landscape is the energy potential or height [3], which correlates
with cell-potency, i.e., the number of cell-fate choices a given cell
may have. According to this landscape model of cellular differenti-
ation, human embryonic stem cells (hESCs), owing to their plur-
ipotency, occupy the highest attractor state, with terminally
differentiated cells occupying the lowest lying basins. In between
these extremes and within a specific cell-lineage may lie other
attractor states representing progenitor cells of variable degrees of

Guo-Cheng Yuan (ed.), Computational Methods for Single-Cell Data Analysis, Methods in Molecular Biology, vol. 1935,
https://doi.org/10.1007/978-1-4939-9057-3_9, © Springer Science+Business Media, LLC, part of Springer Nature 2019

125
126 Weiyan Chen and Andrew E. Teschendorff

potency. Although Waddington’s landscape model is an unrealistic

oversimplification, with newer proposed models [4] indicating
substantially more intercellular heterogeneity and higher-order
bifurcations, it is clear that quantifying the potency of single cells
remains a task of critical importance: quantifying single-cell
potency can not only help us understand cell-type heterogeneity
within FACS-sorted cell populations and allow explicit construc-
tion of cell differentiation landscapes [2, 5], but importantly, can
also provide us with an unbiased means of identifying novel stem
and progenitor cell phenotypes. For instance, estimating potency of
cells may be particularly important in the context of cancer, where
one may wish to identify putative cancer stem cell phenotypes,
which are thought to drive tumor growth, tumor heterogeneity,
and drug resistance [6, 7].
It is reasonable to assume that potency is encoded by the
genome-wide transcriptomic profile of the cell, and recent techno-
logical advances in single-cell RNA-Seq data generation [3, 8, 9]
are allowing us to test, for the first time, explicit biophysical models
for estimating differentiation potency of individuals cells. We note
that the task of estimating differentiation potency of single cells is
different to that of most single-cell analysis algorithms proposed so
far [10–18], whose main aim is to infer cell-lineage differentiation
trajectories, typically in the context of time-course scRNA-Seq
data. Although these algorithms infer pseudotime, which in time-
course differentiation experiments can be interpreted as differenti-
ation potency, these tools do not use an explicit biophysical model
to estimate it, as they often involve a process of training or feature
selection that draws information from all cells in the study. Alterna-
tively, the tools might require prior biological knowledge (e.g.,
surface marker expression) as to which cells may define start or
root nodes. As such, most of these tools do not represent general
models and therefore are not applicable for estimating potency, for
instance in the context of non-time-course scRNA-Seq data [19] or
scRNA-Seq data from cancer tissue [6, 7], or may only be applica-
ble within a specific lineage (e.g., the hematopoietic lineage).
Indeed, so far only a few explicit biophysical models for estimating
differentiation potency of single cells have been proposed. These
are (1) StemID [20], an algorithm that approximates differentia-
tion potency by computing a genome-wide transcriptomic entropy
measuring how uniformly expressed the genes are, (2) SLICE [21],
a method that defines potency in terms of the Shannon entropy
over gene expression derived gene-ontology activity estimates, in
effect measuring the uniformity of the activation profile of different
biological processes in a cell, and (3) Signaling Entropy or Single-
Cell Entropy (SCENT) [22], which models potency in terms of the
entropy-rate of a diffusion process [23] on a signaling network,
thus measuring how efficiently signaling can diffuse over the whole
network. A recent comparative study of these methods has shown
Estimating Differentiation Potency of Single Cells Using Single-Cell. . . 127

that the Signaling Entropy measure used in SCENT currently offers

the most reliable means of estimating differentiation potency [22],
and importantly its validity has been widely demonstrated not only
on single cells but also on bulk tissue samples and cell-lines
[24, 25]. Moreover, the Signaling Entropy measure used to
approximate differentiation potency has been shown to correlate
with drug resistance [25] and clinical outcome in lung and breast
cancer [26]. For these reasons, this chapter focuses on the estima-
tion of differentiation potency via the Signaling Entropy method
used in the SCENT package. As we shall see below, we assume that
an RNA-Seq dataset (be it single cell or bulk sample) has been
normalized at the intra cell/sample and inter cell/sample levels,
meaning that batch effects or other technical confounders have
been accounted for. However, as we stress again later, the signaling
entropy calculation is fairly robust to the normalization procedure
and fairly insensitive to batch effects, as it only depends on ratios of
gene expression, not their absolute levels.

2 Materials

2.1 Software The estimation of differentiation potency using SCENT requires

and Hardware the following hardware and software:
Required
1. A standard desktop or laptop computer with Windows, Max
OSX, or Linux operating system. We recommend at least
16GB RAM.
2. R statistical computing environment (version 3.4.3 or later)
from The Comprehensive R Archive Network (https://cran.r-
project.org/).
3. Install R and Bioconductor packages into the R environment,
including Biobase, mclust, igraph, isva, cluster, corpcor. We
also recommend installing the package parallel to speed up
computations.
4. Download and install SCENT R-package from https://github.
com/aet21/SCENT.

2.2 RNA-Seq Data The first data input for the computation of differentiation potency
using the SCENT package is a single-cell RNA-Seq dataset. The
procedure is identical for bulk RNA-Seq data, the only difference
being in the specific preprocessing and normalization of the data.
To illustrate the method, we shall use a scRNA-Seq dataset from
Chu et al. [19], generated with the Fluidigm C1 platform. This
dataset can be downloaded from the GEO website under accession
number GSE75748, i.e., via https://www.ncbi.nlm.nih.gov/geo/
query/acc.cgi?acc¼GSE75748, and the specific file to download is
GSE75748 sc cell type ec.csv.gz. It contains scRNA-Seq profiles for
128 Weiyan Chen and Andrew E. Teschendorff

Table 1
Distribution and number of single-cell types in Chu et al. dataset

Index label Cell-type Potency Number (after QC)

1 hESC Human embryonic stem cell Pluripotent 374
2 NPC Ectoderm progenitor Non-pluripotent 173
6 DEC Endoderm progenitor Non-pluripotent 138
5 EC Mesoderm progenitor Non-pluripotent 105
4 HFF Human foreskin fibroblasts Non-pluripotent 159
3 TB Trophoblasts Non-pluripotent 69

1018 single cells, composed of 374 human embryonic stem cells,

173 neural progenitor cells (NPCs), 138 definite endoderm pro-
genitors (DEPs), 105 mesoderm progenitors of endothelial cells
(ECs), 69 trophoblasts (TBs), and 159 human fibroblasts (HFFs),
as indicated in Table 1. These cells were obtained via FACS-sorting
and/or induced differentiation experiments from hESCs [19]. We
also provide a log-normalized version of the scRNA-Seq data from
the github link https://github.com/aet21/SCENT/ under filename
“sceChu.Rd”, for easy uploading into R.

2.3 User-Defined The second required input for the SCENT algorithm is a user-
Functional Gene defined functional gene network, for instance, a protein-protein
Network interaction (PPI) network documenting the main interactions
that take place in a cell. For justification as to why a PPI network
is needed (see Notes 1 and 2). Although these networks are mere
caricatures of the underlying signaling networks, ignoring time,
spatial, and biological contexts, some features of the network may
nevertheless be informative, and as we shall see below this is indeed
the case. For instance, if protein-A is a hub (a node of very high
connectivity) and protein-B has only a few connections, then it is
likely that protein-A will have a higher connectivity than protein-B
in any particular biological context (unless of course protein-A is
absent from the cell altogether). As we shall later, the scRNA-Seq
data will provide us with the biological context in which to generate
context or cell-type-specific networks. The specific PPI network we
use here is derived from Pathway Commons (www.pat
hwaycommons.org) (downloaded in Jan. 2016), which is an
integrated resource collating together PPIs from several distinct
sources. In particular, the network is constructed by integrating
the following sources: the Human Protein Reference Database
(HPRD), the National Cancer Institute Nature Pathway Interac-
tion Database (NCI-PID) (http://pid.nci.nih.gov), the Interac-
tome (Intact), and the Molecular Interaction Database (MINT).
Protein interactions in this network include physical stable
Estimating Differentiation Potency of Single Cells Using Single-Cell. . . 129

interactions such as those defining protein complexes, as well as

transient interactions such as posttranslational modifications and
enzymatic reactions found in signal transduction pathways. We
focused on nonredundant interactions, only included nodes with
an Entrez gene ID annotation, and focused on the maximally
connected component (see Note 3), resulting in a connected
network of 10,720 nodes (unique Entrez IDs) and 152,889
documented interactions. This network can be downloaded from
https://github.com/aet21/SCENT/ under filename ppiAsigH-PC2-
17Jan2016.Rd. Another earlier version of the network of size 8434
nodes can be found under filename hprdAsigH-13Jun12.Rd.
Here, we encode the network as an adjacency matrix of dimen-
sion 8434 8434, with “0” entries indicating that no interaction
or connection between the two genes has been documented, while
a “1” means that an interaction has been documented. Diagonal
entries are set to 0. Importantly, because SCENT will integrate the
network with RNA-Seq data, the row names and column names of
the network must use the same gene identifier as used in the
RNA-Seq data matrix.

3 Methods

3.1 Workflow The estimation of differentiation potency with SCENT consists of

for Signaling Entropy three major steps: (1) checking and further processing (if required)
Calculation of the normalized scRNA-Seq data, (2) integration of the scRNA-
Seq data with the user-defined gene functional network, (3) com-
putation of the Signaling Entropy Rate (denoted SR) which is used
to approximate differentiation potency of single cells (Fig. 1a).
Optionally, SCENT can also be used to quantify the heterogeneity
in potency of a single-cell population (Fig. 1b, c), and to infer
lineage relationships between the major clusters of single-cells
[22]. However, because SCENT was designed mainly for the esti-
mation of differentiation potency we shall only focus on the
method for estimating it and how it can be used to infer discrete
potency states.

3.1.1 Checking We assume that the scRNA-Seq data have been appropriately nor-
and Further Preprocessing malized. Given a count matrix of reads mapping to genes, we
of the scRNA-Seq Data assume that the user has run this count matrix through a single-
cell processing and quality control pipeline (see, e.g., [27]), such as
that provided by the Bioconductor package scater [28] or
R-package Seurat [29]. The end result of this is typically a
log-normalized scRNA-Seq data matrix. The log-transformation
provides a natural regularization of the data, stabilizing the variance
of highly expressed genes, and is strongly recommended for any
down-stream analyses [27], especially in the context of SCENT. In
addition, we must also take care of the smallest values in the
130 Weiyan Chen and Andrew E. Teschendorff

Fig. 1 (a) The estimation of differentiation potency using the signaling entropy rate requires two inputs: a user-
defined gene functional network (e.g., a PPI network) which does not depend on biological context, and a
normalized scRNA-Seq profile of a cell or sample, which thus provides the biological context. The latter profile
is overplayed onto the network to define a cell-specific stochastic matrix P with entries pij . From this matrix,
we can derive the invariant measure (steady-state probability) π i, which satisfies πP ¼ π, and finally the
signaling entropy rate SR is obtained as a weighted average over local signaling entropies. This allows cells to
be ordered according to their SR values, i.e., according to potency. SR can be quantified on a scale between
0 and 1 (not shown). (b) Transforming the normalized SR to the logit-scale and fitting Gaussian Mixture models
allows the identification of potency states. (c) The distribution of these potency states across cell-types can be
analyzed to identify novel cellular phenotypes that differ in potency. For instance, this strategy could be used
to identify cells primed for differentiation within a multipotent or pluripotent cell population
Estimating Differentiation Potency of Single Cells Using Single-Cell. . . 131

normalized data matrix, because negative and 0 values are not

allowed in SCENT. The reason why SCENT requires
non-negatively valued data is that it estimates potency from a
stochastic matrix on the network where the weights on the edges
of the network represent signaling probabilities (which by necessity
must be zero or positive). Usually, application of the scater pipeline
will not result in negative values but will result in 0 values for those
genes with zero counts. The reason why zero expression values
must be further excluded is because the construction of the sto-
chastic matrix involves ratios of gene expression values, and ratios
will be undefined if the denominator is zero. Thus, for a count
n
matrix cij with rows labeling genes and columns labeling samples/o
cells, we may use a transformation of the form log2 c ij tcsf j þ 1:1
where tcj is the total number of counts in cell j and sf is a global scale
factor representing library complexity. The above transformation
guarantees that values are always positive definite. Finally, we point
out that another potential pitfall when preparing the scRNA-Seq
data matrix for SCENT is to remove genes that are not expressed
across all or a relatively high proportion of the cells (see also Note
4). Removing such nonexpressed genes is of course a very popular
procedure in scRNA-Seq data analysis, but that is because such
genes are uninformative or do not explain interesting variation
across cells. However, in the context of estimating differentiation
potency with SCENT, it is important to note that this estimation is
done for each cell individually, i.e., independently of all other cells
(see Note 5). If genes are truly not expressed in a given cell or in a
given condition, or across a range of different conditions, then
removing them could potentially lead to loss of information with
regards to the signaling entropy calculation, as their removal may
result in a much smaller network (see Note 4). Thus, for the specific
task of estimating differentiation potency via SCENT, we do not
recommend removing any genes based on zero or low expression
levels across all or most cells.

3.1.2 Integration Integration is achieved with the DoIntegPPI function, and consists
with User-Defined Gene of two steps:
Functional Network
1. The function takes as input two arguments, the normalized
scRNA-Seq data matrix exp.m with rows labeling genes and
columns labeling cells/samples, and a user-defined network
ppiA.m with rows and columns labeling genes. The same
gene identifier must be used for both expression and network
matrices. The function finds the overlap between the gene
identifiers specifying the network with those specifying the
scRNA-Seq matrix, and then extracts the maximally connected
subnetwork (see Note 3), specified by the adjMC output
argument.
132 Weiyan Chen and Andrew E. Teschendorff

2. The function also constructs the reduced scRNA-Seq matrix,

specified by the expMC output argument, and which is there-
fore matched to the rows (nodes) of the adjMC matrix.

3.1.3 Computation The output from DoIntegPPI is then used as input to the functions
of the Signaling that compute the signaling entropy rate, denoted as SR. For a given
Entropy Rate fixed network, i.e., for the given adjMC matrix, there is a maximum
possible SR value (denoted as maxSR), which is obtained for a
particular edge-weight configuration [22]. It is thus very conve-
nient to report the SR value of any given cell, normalized against
this maximum possible value, which means that SR is then bounded
between 0 and 1. The maximum entropy rate value can be calcu-
lated using the CompMaxSR function. This function takes as input
the adjacency matrix, i.e., the adjMC output from the DoIntegPPI
function, and returns the maxSR value as output.
Having obtained the normalization factor maxSR, we can then
proceed to compute the SR value for any given cell/sample, using
the CompSRana function. This function takes four objects as input:
1. The expression profile vector of the cell/sample (exp.v), which
is a given column of the expMC output matrix from
DoIntegPPI,
2. The network adjacency matrix (adj.m), i.e., the adjMC output
from DoIntegPPI,
3. A logical parameter local to tell the function where to report
back the local, i.e., gene-centric, signaling entropies, and,
4. maxSR, the maximum entropy rate calculated earlier. Specify-
ing maxSR ¼ NULL will force the function to return
non-normalized SR values.
Two notes with the above procedure: (a) the local gene-based
entropies can be used in downstream analyses for ranking genes
according to differential entropy, but only if appropriately normal-
ized. For instance, they could be used to identify the main genes
driving changes in the global signaling entropy rate of the network.
However, if the user only wishes to estimate potency, specifying
local ¼ FALSE is fine, which will save some RAM on the output
object, (b) by specifying the input object as an expression vector,
the user can easily use the parallel R-package to compute the SR
values for all cells/samples simultaneously. For this purpose, we
also provide on the github website another function CompSRa-
naPRL, which takes in as input an index and the full scRNA-Seq
data matrix in place of the expression profile of one cell/sample.
One can then use the mclapply function in the parallel package to
loop over the index values, which specify the columns (i.e., cells/
samples) for which the SR values are desired. This will become
clearer in the example given further below. The output of the
Estimating Differentiation Potency of Single Cells Using Single-Cell. . . 133

CompSRana and CompSRanaPRL functions is a list consisting of

four objects:
1. sr: the SR value of the cell/sample.
2. inv: a vector specifying the invariant measure, or steady-state
probability, over the network. That is, this vector will be equal
to the number of nodes in the adjMC matrix, and its entries
must add to 1.
3. s: a vector containing the unnormalized local signaling entro-
pies, and therefore a vector of length equal to the number of
nodes in the adjMC matrix.
4. ns: if local ¼ TRUE, a vector containing normalized local
signaling entropies, and therefore a vector of length equal to
the number of nodes in the adjMC matrix.
The computation of signaling entropy is relatively fast. After
the integration of the scRNA-Seq data matrix with the network, the
resulting graph has approximately 8000 nodes, and estimation of
SR for one sample only takes 1–2 s on an Intel Xeon(R) CPU
E3-1575 M v5 @ 3.00GHz processor.

3.2 Real Dataset Test To illustrate the procedure above, we now run in detail through the
example given in Table 1. As mentioned, we assume that the count
matrix has been normalized, say using a specific package (e.g.,
scater). The normalized count matrix we use can be downloaded
from http://github.com/aet21/SCENT/scChu.Rd and loaded into
your session using the load command:

load(“sceChu.Rd”);

This loads in sceChu.m, the log-normalized count matrix, and

phenoSCchu.v, an index vector specifying the cell-type (Table 1).
Likewise, we need to load in the gene-functional (PPI) network
specified earlier, using, e.g.,

load(“hprdAsigH-13Jun12.Rd”);

Thus, we would then run the following series of commands:

int.o <- DoIntegPPI(sceChu.m,hprdAsigH.m)

adjMC.m <- int.o$adjMC
maxSR <- CompMaxSR(adjMC.m)
idx.l <- as.list(1:ncol(scChu.m))
out.l <- mclapply(idx.l,CompSRanaPRL,int.o$expMC,adjMC.m,
maxSR,mc.cores=10);
134 Weiyan Chen and Andrew E. Teschendorff

Fig. 2 Boxplot of signaling entropy rate values (SR, y-axis) against cell-type
(pluripotent hESCs vs non-pluripotent (NotPl), x-axis). P-value is from a
one-tailed Wilcoxon rank sum test. Number of cells in each cell-type category
is given in group labels

The last line assumes we are running on a computer with at

least 10 processing cores. From the out.l object, we can then finally
store the SR values in a vector which we shall denote by SRpSC.v.
To check that the SR values do indeed correlate with potency,
we can run something like.

phenoChu.v <- phenoSCchu.v

phenoChu.v[phenoSCchu.v>=2] <- 2; ### to compare plurip. to
non-plurip.
boxplot(SRpSC.v ˜ phenoChu.v)

which should result in the display shown in Fig. 2.

It might also be of interest to explore the heterogeneity in
potency of the cell population. One can approach this question
using the DoSCENT function, as shown below:

scent.o <- DoSCENT(sceChu.m,SRpSC.v,phenoSCchu.v)

The distribution of potency states in relation to cell-type can be

obtained from the $distPSPH entry:

> scent.o$distPSPH
ordpotS.v
celltype 1 2
1 355 19
2 90 83
3 8 61
4 0 159
5 0 105
6 10 128
Estimating Differentiation Potency of Single Cells Using Single-Cell. . . 135

Fig. 3 Normalized mRNA expression values for 3 genes that mark hESCs primed for differentiation into
mesoderm (PECAM1) or ectoderm (SOX2 & HES1) lineages, against the predicted potency group of hESCs.
PS1 ¼ pluripotent/higher potency state, PS2 ¼ non-pluripotent/lower potency state. P-values are from a
one-tailed Wilcoxon rank sum test

This indicates that the algorithm inferred 2 potency states, with

the first state being enriched for pluripotent hESCs (row indexed
by 1) and a substantial proportion of neural progenitors (row
indexed by 2), whereas the other potency state is dominated by
non-pluripotent cell-types (row index values 2–6). Of note,
approximately 5% of the hESCs were predicted to be of significantly
lower potency than the rest, and it is natural to assume that these
cells may already be primed to differentiate into any one of the
different germ layers (ectoderm, mesoderm, endoderm). We can
check that the predicted lower potency for these hESCs is in line
with this hypothesis, by comparing the expression value distribu-
tion of a known mesoderm stem-cell marker (PECAM1) and two
neural stem cell markers (HES1 and SOX2), all of which exhibited
higher expression in hESCs categorized into the lower potency
group (Fig. 3), as required.

4 Notes

1. Signaling entropy has been validated as a measure of potency

across a large number of independent datasets, including time-
course and non-time-course scRNA-Seq sets [22], as well as
time-course and non-time-course bulk RNA-Seq and microar-
ray sets [24]. Its robustness stems from three properties: First,
it integrates the expression data with orthogonal systems-level
information, as that provided by a PPI network. Although
these networks are themselves noisy, SR depends mainly on
the relative connectivity of the proteins in the network, which is
a robust feature. The connectivity is important since SR can be
approximated as a suitably transformed correlation between
136 Weiyan Chen and Andrew E. Teschendorff

the transcriptome and connectome, with cells of increased

potency exhibiting higher mRNA expression levels of the
more connected or promiscuous proteins [22]. Second, expres-
sion values only enter the SR calculation as expression ratios,
therefore rendering the SR value fairly insensitive to the abso-
lute scale and normalization procedure for the expression data
(although as mentioned earlier, a log-transformation is recom-
mended to regularize the effects of outlier expression). Third,
SR is a global measure computed over a large network of over
8000 nodes. In the case of scRNA-Seq data, this confers it
substantial robustness against technical dropouts. Moreover,
technical dropouts are more likely to affect lowly expressed
genes, which, theoretically, for the networks considered here,
are less influential in determining the SR value.
2. We emphasize that the use of a PPI network is key. As shown by
us previously, randomizing the expression values over the net-
work, which in effect randomly reassigns connectivity values to
each gene, results in a substantially reduced discrimination
[22]. Likewise, we find that reverse-engineering a network
from the data itself does not yield a robust potency measure.
As mentioned in the previous note, potency is encoded in part
via a subtle positive correlation between the transcriptome and
PPI connectome. Thus, we strongly recommend a PPI network
such as those derived from Pathway Commons (https://www.
pathwaycommons.org/).
3. It is important that the signaling entropy rate is estimated over
a connected network, since the entropy rate is ill-defined for
networks that are not. The function within the package that
integrates the scRNA-Seq data with the user-specified PPI
network will automatically identify the maximally connected
subnetwork, to ensure that the network passed on to the
subsequent functions that compute SR is connected. It is
important however to check the size of the maximally
connected subnetwork before running the final computation
of SR. Usually, the maximally connected subnetwork will be a
“prominent” subnetwork, i.e., one containing a large fraction
(say over 80% or 70%) of the original number of nodes in the
user-specified PPI network. In our experience, it is very
unlikely that, for instance, upon integration with the scRNA-
Seq data, the original PPI network would split up into 2 large
and roughly equally sized components.
4. It is important to point out again that the robustness of Signal-
ing Entropy derives in part from using a large (connected)
network. It is therefore important not to remove genes from
the scRNA-Seq data matrix before integration with the PPI
Estimating Differentiation Potency of Single Cells Using Single-Cell. . . 137

network, as removing a large number of such genes prior to

integration could result in a network that is too small, or may
not result in a sufficiently large maximally connected subnet-
work. By sufficiently large we mean a network of at least 5000
nodes or so. A common procedure in scRNA-Seq data analyses
is however to remove genes that are not expressed across all or a
great proportion of the cells, since these are naturally uninfor-
mative for popular tasks such as dimensional reduction, clus-
tering, or differential expression. However, the entropy
calculation for estimating potency is different in that it esti-
mates it for each cell independently of all others, and so remov-
ing genes that are not expressed could potentially remove
biological information by unnecessarily reducing the size of
the network, and thus compromising the accuracy of SR.
5. We stress that Signaling Entropy represents a model-driven
general approach to the estimation of differentiation potency.
It is estimated for each cell or sample individually without using
information from other cells or samples. As such, it is relatively
assumption-free, does not require training or feature selection,
thus avoids overfitting, and in principle allows ordering of cells
in terms of their potency within any given lineage. However,
we point that the full resolution of the method, i.e., how big
changes in potency can be detected, is still under active
investigation.

5 Conclusions

Computation of signaling entropy can help users order cells/sam-

ples according to their potency, without the need for any prior
biological knowledge, input, or feature selection. This may help
identify in an unbiased way novel stem-and-progenitor cell pheno-
types in large-scale scRNA-Seq data, specially non-time-course
data, and in both normal and cancer physiological settings. The
functions provided in the SCENT package and the example work-
flow provided here should help users learn how to compute signal-
ing entropy for their single-cell or bulk RNA-Seq data.

References

1. Waddington CH (1966) Principles of develop- 3. Levsky JM (2002) Single-cell gene expression

ment and differentiation. Macmillan, London, profiling. Science 297:836–840. https://doi.
pp 1905–1975 org/10.1126/science.1072241
2. Moris N, Pina C, Arias AM (2016) Transition 4. Laurenti E, Göttgens B (2018) From haema-
states and cell fate decisions in epigenetic land- topoietic stem cells to complex differentiation
scapes. Nat Rev Genet 17:693–703. https:// landscapes. Nature 553:418–426. https://doi.
doi.org/10.1038/nrg.2016.98 org/10.1038/nature25022
138 Weiyan Chen and Andrew E. Teschendorff

5. Lang AH, Li H, Collins JJ, Mehta P (2014) analysis reveals insights into cellular differenti-
Epigenetic landscapes explain partially repro- ation and development. Nat Biotechnol
grammed cells and identify key reprogramming 35:551–560. https://doi.org/10.1038/nbt.
genes. PLoS Comput Biol 10:e1003734. 3854
https://doi.org/10.1371/journal.pcbi. 17. Haghverdi L, Büttner M, Wolf FA et al (2016)
1003734 Diffusion pseudotime robustly reconstructs
6. Tirosh I, Venteicher AS, Hebert C et al (2016) lineage branching. Nat Methods 13:845–848.
Single-cell RNA-seq supports a developmental https://doi.org/10.1038/nmeth.3971
hierarchy in human oligodendroglioma. 18. Angerer P, Haghverdi L, Büttner M et al
Nature 539:309–313. https://doi.org/10. (2016) Destiny: diffusion maps for large-scale
1038/nature20123 single-cell data in R. Bioinformatics
7. Tirosh I, Izar B, Prakadan SM et al (2016) 32:1241–1243. https://doi.org/10.1093/bio
Dissecting the multicellular ecosystem of met- informatics/btv715
astatic melanoma by single-cell RNA-seq. Sci- 19. Chu L-F, Leng N, Zhang J et al (2016) Single-
ence 352:189–196. https://doi.org/10. cell RNA-seq reveals novel regulators of human
1126/science.aad0501 embryonic stem cell differentiation to defini-
8. Wang Z, Gerstein M, Snyder M (2009) tive endoderm. Genome Biol 17:2315.
RNA-Seq: a revolutionary tool for transcrip- https://doi.org/10.1186/s13059-016-1033-
tomics. Nat Rev Genet 10:57–63. https:// x
doi.org/10.1038/nrg2484 20. Grün D, Muraro MJ, Boisset J-C et al (2016)
9. Grün D, van Oudenaarden A (2015) Design De novo prediction of stem cell identity using
and analysis of single-cell sequencing experi- single-cell Transcriptome data. Cell Stem Cell
ments. Cell 163:799–810. https://doi.org/ 19:266–277. https://doi.org/10.1016/j.
10.1016/j.cell.2015.10.039 stem.2016.05.010
10. Trapnell C, Cacchiarelli D, Grimsby J et al 21. Guo M, Bao EL, Wagner M et al (2017)
(2014) The dynamics and regulators of cell SLICE: determining cell differentiation and
fate decisions are revealed by pseudotemporal lineage based on single cell entropy. Nucleic
ordering of single cells. Nat Biotechnol Acids Res 45:e54. https://doi.org/10.1093/
32:381–386. https://doi.org/10.1038/nbt. nar/gkw1278
2859 22. Teschendorff AE, Enver T (2017) Single-cell
11. Marco E, Karp RL, Guo G et al (2014) Bifur- entropy for accurate estimation of differentia-
cation analysis of single-cell gene expression tion potency from a cell’s transcriptome. Nat
data reveals epigenetic landscape. Proc Natl Commun 8:15599. https://doi.org/10.
Acad Sci U S A 111:E5643–E5650. https:// 1038/ncomms15599
doi.org/10.1073/pnas.1408993111 23. Gómez-Gardeñes J, Latora V (2008) Entropy
12. Setty M, Tadmor MD, Reich-Zeliger S et al rate of diffusion processes on complex net-
(2016) Wishbone identifies bifurcating devel- works. Phys Rev E Stat Nonlinear Soft Matter
opmental trajectories from single-cell data. Nat Phys 78:114. https://doi.org/10.1103/
Biotechnol 34:637–645. https://doi.org/10. PhysRevE.78.065102
1038/nbt.3569 24. Banerji CRS, Miranda-Saavedra D, Severini S
13. Bendall SC, Davis KL, E-AD A et al (2014) et al (2013) Cellular network entropy as the
Single-cell trajectory detection uncovers pro- energy potential in Waddington’s differentia-
gression and regulatory coordination in tion landscape. Sci Rep 3:1129. https://doi.
human B cell development. Cell org/10.1038/srep03039
157:714–725. https://doi.org/10.1016/j. 25. Teschendorff AE, Sollich P, Kuehn R (2014)
cell.2014.04.005 Signalling entropy: a novel network-theoretical
14. Chen J, Schlitzer A, Chakarov S et al (2016) framework for systems analysis and interpreta-
Mpath maps multi-branching single-cell trajec- tion of functional omic data. Methods
tories revealing progenitor cell progression 67:282–293. https://doi.org/10.1016/j.
during development. Nat Commun 7:11988. ymeth.2014.03.013
https://doi.org/10.1038/ncomms11988 26. Banerji CRS, Severini S, Caldas C, Teschen-
15. Qiu X, Mao Q, Tang Y et al (2017) Reversed dorff AE (2015) Intra-tumour Signalling
graph embedding resolves complex single-cell entropy determines clinical outcome in breast
trajectories. Nat Methods 14:979–982. and lung cancer. PLoS Comput Biol 11:
https://doi.org/10.1038/nmeth.4402 e1004115. https://doi.org/10.1371/journal.
16. Rizvi AH, Camara PG, Kandror EK et al pcbi.1004115
(2017) Single-cell topological RNA-seq
Estimating Differentiation Potency of Single Cells Using Single-Cell. . . 139

27. Lun ATL, McCarthy DJ, Marioni JC (2016) A cell RNA-seq data in R. Bioinformatics 247:
step-by-step workflow for low-level analysis of btw777. https://doi.org/10.1093/bioinfor
single-cell RNA-seq data with bioconductor. matics/btw777
F1000Res 5:2122. https://doi.org/10. 29. Butler A, Hoffman P, Smibert P et al (2018)
12688/f1000research.9501.2 Integrating single-cell transcriptomic data
28. McCarthy DJ, Campbell KR, Lun ATL, Wills across different conditions, technologies, and
QF (2017) Scater: pre-processing, quality con- species. Nat Biotechnol 36:411–420. https://
trol, normalization and visualization of single- doi.org/10.1038/nbt.4096
Chapter 10

Inference of Gene Co-expression Networks from Single-Cell

RNA-Sequencing Data
Alicia T. Lamere and Jun Li

Abstract
Single-cell RNA-Sequencing is a pioneering extension of bulk-based RNA-Sequencing technology. The
“guilt-by-association” heuristic has led to the use of gene co-expression networks to identify genes that are
believed to be associated with a common cellular function. Many methods that were developed for bulk-
based RNA-Sequencing data can continue to be applied to single-cell data, and several of the most widely
used methods are explored. Several methods for leveraging the novel time information contained in single-
cell data when constructing gene co-expression networks, which allows for the incorporation of directed
associations, are also discussed.

Key words Gene co-expression network, Gene regulatory network, Single-cell RNA-Seq, Correlation
coefficient, Count data, Directed network, Pseudotime

1 Introduction

Both gene function and regulatory relationships can be inferred and

identified through the use of gene co-expression networks (GCNs).
In GCNs, nodes represent genes and an edge between two nodes
represents co-expression between a pair of genes (see Fig. 1a below).
Computational inference of GCNs is based on a set of experiments,
each measuring the expression of a large set of genes. These experi-
ments use samples from different tissues or different conditions, so
genes that are co-expressed tend to have high/low expressions in
the same experiments simultaneously.
The “guilt-by-association” heuristic has led to the use of GCNs
to identify co-expressing genes that are believed to be associated
with a common cellular function. Well-constructed GCNs are used
to help understand molecular mechanisms underlying biological
processes and to predict gene functions that are not previously
known [1].
Many methods have been proposed for constructing GCNs
based on microarray and bulk-based RNA-Sequencing (RNA-Seq)

Guo-Cheng Yuan (ed.), Computational Methods for Single-Cell Data Analysis, Methods in Molecular Biology, vol. 1935,
https://doi.org/10.1007/978-1-4939-9057-3_10, © Springer Science+Business Media, LLC, part of Springer Nature 2019

141
142 Alicia T. Lamere and Jun Li

Gene A
Gene B

Expression
Gene A

Gene B Gene C

0 Pseudotime
(A) (B)

Fig. 1 (a) Example GCN. Here, because there exists an edge between the nodes representing Gene A and
Gene B, these two genes show evidence of co-expression. Meanwhile, no edge exists between Gene B and
Gene C, so there is no evidence of co-expression for this pair of genes. (b) Example of two gene expressions
that exhibit a regulatory relationship when ordered by pseudotime. If we simply considered their correlation,
the pair appear unrelated. However, if we correlate the lagged expression, looking at Gene A’s expression
from time 0 and Gene B’s expression from time l, then they exhibit a strong positive correlation

data [2–6] and can be applied to single-cell RNA-Sequencing

(scRNA-Seq) to construct GCNs as well. A significant portion of
this chapter will be dedicated to discussing these methods, with
particular focus on so-called “correlation-based” methods, which
remain the most straightforward and easy-to-interpret methods for
constructing GCNs. In these methods, an edge exists between two
genes if a strong correlation is found between the expression of that
pair of genes.
However, in using these GCNs with scRNA-Seq, the data are
treated no differently than bulk-based RNA-Seq. As a result, infor-
mation contained uniquely in scRNA-Seq is lost. Particularly,
correlation-based GCNs constructed based on microarray or
RNA-Seq data only capture simultaneous associations between
pairs of genes. Biologically, if a gene enhances/inhibits another
gene, then the latter gene will have delayed expression/silence
[7]. For such a pair, the co-expression of the two genes is strong
if the delay in time is taken into account, but can be weak if only
simultaneous association is explored.
Consider the hypothetical gene expressions in Fig. 1b,
arranged by time. If we consider the entire expressions of genes A
and B, then their correlation is close to 0. However, looking at their
expressions, it clearly appears that gene A is enhancing the expres-
sion of gene B—meaning that as gene A’s expression increases, we
see a delayed decrease in gene B’s expression, beginning at time
t ¼ l. We also observe a corresponding delayed decrease in gene B’s
expression following a decrease in gene A. This implies a strong
positive correlation, which we can capture if we take our time lag
into account.
Inference of GCNs from scRNA-Seq 143

A pioneering extension of the bulk-based RNA-Seq technique,

scRNA-Seq, is able to capture this time information in an indirect
way. scRNA-Seq measures the gene expression profile of each indi-
vidual cell, with hundreds to thousands of cells in a single run.
These cells are at different time points of their cell cycles, and these
time points can be estimated based on the idea that expression
profiles should be similar in cells at similar time points. These
estimated time points are called “pseudotime,” and several algo-
rithms have been developed for their estimation [8–12]. We also
discuss these methods in more detail below.

2 Materials

1. R software for statistical computing

R Packages:
2. edgeR (available through Bioconductor).
3. WGCNA (available through CRAN).
4. DiPhiSeq (available through CRAN).
5. Monocle (available through Bioconductor).
6. LEAP (available through CRAN).

3 Methods

This section summarizes the basic steps involved when constructing

a GCN from single-cell data, while exploring several methods that
can be used for each step.

3.1 Normalization Normalization is an important question to consider when working

with scRNA-Seq data. Beyond differences in sequencing depth that
must be taken into account between experiments, as with bulk-
based RNA-Seq, scRNA-Seq experiments also tend to have a large
amount of technical noise (see Note 1). While there do exist GCN
construction methods that do not require normalization, none are
correlation-based. Commonly used bulk-based normalization
methods such as FPKM [13], upper quartile [14], or trimmed
mean of M-values [15] can be applied to scRNA-Seq (see Note 2).
These normalizations can be easily performed through the use of
the edgeR package [16] using the following steps:
1. Let data be a matrix of raw read counts.
2. Find the normalization factors for the method you would like
to use. For example, if we wish to use trimmed mean of
M-values:
>norm_factors = calcNormFactors(data, method=”TMM”)
144 Alicia T. Lamere and Jun Li

3. Output a matrix of normalized counts by using the cpm

function:
>norm_data = cpm(norm_factors,
normalized.lib.sizes = TRUE)

3.2 Identifying After normalization and before constructing any GCNs, it usually is
Highly Expressed important to filter your dataset to only those showing reasonably
Genes high average expression and range of expression. This step is partic-
ularly important for scRNA-Seq data which has a large number of
drop-out events. By filtering to keep genes with not only moder-
ate/high expression, but a large range of expression, we can be
more confident that the edges identified in our network are not
simply the result of noise in the dataset. One method for identifying
these genes is to:
Find the average and inter-quartile range of the normalized
expression values for each gene. In R, this can easily be done using
base functions:
>iqr_vals = apply(norm_data, 1, IQR)
>mean_val = apply(norm_data, 1, mean)

1. Scale the interquartile range for each gene by dividing by its

average expression:

>iqr_scale = iqr_vals/mean_val

2. Select genes to keep for analysis with a sufficiently high average

expression and scaled interquartile range. What qualifies as
sufficient is up to the user’s discretion, however, a scaled IQR
of at least 0.5 and a mean of at least 0.1 are usually desirable (see
Note 3). The filtered dataset can be obtained by using the
code:
>keep = (mean_val >= 0.1 & iqr_scale >= 0.5)
>filter_data = norm_data[keep,]

3.3 Methods When constructing traditional gene co-expression networks,

for Constructing meaning they are undirected, a variety of options exist already
Traditional GCNs for use with bulk-based RNA-Seq data. These methods can be
broken into two key categories: correlation-based and non-correla-
tion-based. Correlation-based methods are generally faster and
simpler to implement, and hence are often preferred, while
non-correlation based are more complicated and computationally
intensive and hence will not be discussed in detail here (see Note 4).
Conceptually, all of these methods may still be applied to scRNA-
Seq data.
Generally, correlation-based methods take the matrix of expres-
sion data and calculate the correlation for each gene pair, usually
through the calculation of Pearson’s correlation coefficient. These
Inference of GCNs from scRNA-Seq 145

methods remain the most straightforward and easy to understand

methods for constructing GCNs for RNA-Seq data.

3.3.1 WGCNA Weighted Gene Co-expression Network Analysis (WGCNA) was

originally developed for the analysis of microarray data, which is
continuous [17]. Hence, to use WGCNA, the count data must first
be normalized using the methods described above. Then, the Pear-
son’s correlation coefficient is calculated for each pair of genes,
resulting in an adjacency matrix to which a hard or soft threshold
can then be applied to determine a particular GCN. This method
has been successfully applied to RNA-Seq data after proper normal-
ization [18–20]. One advantage of this method is its use of a soft-
threshold through the incorporation of weighted edges. This
allows the user to rank edges for consideration based on the
strength of the connection between each pair of genes. To imple-
ment WGCNA, the following code is provided through a Tutorial
written by the authors:
1. To speed up calculations, enable the use of multi-threading:

>enableWGCNAThreads()

2. Identify the soft-threshold to be used:

>Powers = c(c(1:10), seq(from=12, to=20, by=2)

>soft = pickSoftThreshold(filter_data, powerVectors =
power, verbose = 5)
>R_sqr= -sign(soft$fitIndices[,3]) *soft$fitIndices[,2]
>plot(soft$fitIndices[,1], R_square)

3. Based on this plot, choose the lowest power for which the
signed R square curve flattens out upon reaching a high
value. Let us say that this value is 6, we can then construct
our network with the code:

>network = blockwiseModules(filter_data, power = 6, TOMType

= "unsigned", minModuleSize = 30, reassignThreshold = 0,
mergeCutHeight = 0.25, numericLabels = TRUE,
pamRespectsDendro = FALSE, saveTOMs = TRUE, saveTOMFileBase
= “My_network” verbose = 3)

The network will be contained in the TOM file saved as

“My_network”.

3.4 iCC Distribution-inversed and Gaussian-transformed Correlation Coef-

ficient is a GCN construction method developed directly for use on
RNA-Seq data [21]. This method does not require any kind of
146 Alicia T. Lamere and Jun Li

normalization beforehand. Instead, it works with the count data

directly and incorporates the sequencing depth into the model
thereby increasing the power. Though not provided as a package,
this method is relatively simple to implement by hand:
1. Let filter_data be data that have not been normalized, but
have been filtered to keep the most highly expressed genes.
Find the transformed sequencing depth for each experiment:

>d = colMeans(filter_data)
>depth = exp(log(d) - mean(log(d)))

2. Estimate the parameters for the distributions describing each

gene’s expression across each experimental condition. This can
be done with the DiPhiSeq R package [22] using the following
code for genes i and j:
>results_i = robnb(filter_data[i,], depth)
>results_j = robnb(filter_data[j,], depth)

3. Calculate the probability of observing each expression count k

using the distribution parameters:
>for (k in (1: length(depth)){
> pval_i[k] = pnbinom(filter_data[j,k],
>size=1/results_i$phi, mu=exp(results_i$beta)*depth[k])
>pval_j[k] = pnbinom(filter_data[j,k], size=1/results_j$phi,
mu=exp(results_j$beta)*depth[k])}

4. Use these probabilities to find the corresponding standard-

Normal distribution values.
>norm_i = qnorm(pval_i, 0, 1)
>norm_j = qnorm(pval_j, 0, 1)

5. For each gene pair, use the now Gaussian-distributed values to

estimate their correlation.
>corr_ij = cor(norm_i, norm_j)

These correlations define the adjacency matrix that describes

the GCN. Typically, a cutoff should be chosen for the correlations
to construct the network. It is generally recommended that abso-
lute values of 0.7 or more be used, while not going below 0.5 as
these edges have a greater chance of being the result of noise in
the data.

3.4.1 Directed GCNs The network construction methods explored above only describe
Through Pseudotime co-expression of genes. They were originally designed for use on
for scRNA-Seq Data either microarray or bulk-based RNA-Seq data, and therefore do
not leverage any of the additional information available through
Inference of GCNs from scRNA-Seq 147

scRNA-Seq. As described in the introduction, we can leverage

cell-cycle time information through pseudotime algorithms to
identify directed relationships between genes and construct
directed networks. Here we discuss one method, LEAP, in particu-
lar, but there are many other methods available for constructing
GCNs by leveraging the information contained uniquely in scRNA-
Seq data (see Note 5).

3.4.2 Estimating Key to any time-based inference on gene expression data is the
Pseudotime method used to estimate the time information. For scRNA-Seq,
several algorithms have been developed for the estimation of these
time points, called “pseudotime.” In general, the pseudotime anal-
ysis of the scRNA-Seq data often involves dimension reduction
methods to deal with high-dimensionality, due to the often
thousands of gene expression levels measured in each sample.
There are many pseudotime algorithms available now (see Note
6). We will use the method Monocle because of its ease of use
and implementation as an R package. The authors of Monocle [8]
recognize that by projecting the data to a lower dimensional space,
natural clustering of cells can occur and this clustering can capture
cells at different time points. Their algorithm works by first repre-
senting each cell’s expression as a point in a Euclidean space with
dimensions representing each gene included in the sample. Then
this high-dimensional space is reduced using independent compo-
nent analysis (ICA), which, as its name implies, projects the gene
expression profiles into a lower-dimensional space that best distin-
guished the independent components—or in our case, cells. The
algorithm then constructs a minimum spanning tree on the cells in
this lower-dimensional space. This tree is simply the shortest path
that connects all cells without revisiting any edges. Finally, cells’
positions in the minimum spanning tree are used to assign “pseu-
dotime” values (see Note 7). Monocle does not require the scRNA-
Seq data to be normalized beforehand. Instead, it handles all nor-
malization internally. As the first step in our network construction,
we will use Monocle to generate the pseudotime for each cell in the
scRNA-Seq dataset using genes that are known to be associated
with cell cycle.
1. Let filter_data be data that have not been normalized, but
have been filtered to keep the most highly expressed genes.
Create a data set object for monocle to use:
>mon_data= newCellDataSet(as.matrix(filter_data))
148 Alicia T. Lamere and Jun Li

2. Reduce the dimensions:

>red_data = reduceDimension(mon_data)

3. Calculate the pseudotime orderings:

>ord_data = orderCells(red_data, num_path = 2, reverse =
FALSE)
>pseudotime = pData(ord_data)

3.4.3 LEAP Algorithm LEAP is a method created for direct use on scRNA-Seq data to
for Estimating construct directed GCNs. Borrowing from time-series analysis,
Co-expression LEAP sorts cells according to the estimated cell-cycle-based pseu-
dotime creating a “pseudotime-series,” and then computes the
maximum correlation over all possible time lags [23]. This maxi-
mum correlation is used as the statistic to replace the traditional
Pearson’s correlation coefficient for constructing the gene
co-expression network, and the statistical significance of this statis-
tic is measured by the false discovery rate (FDR) calculated using
permutations. LEAP is implemented as an R package, and the
general steps for generating a GCN are:
1. Sort the cells in your scRNA-Seq dataset according to the
pseudotime you calculated using Monocle.
2. In order to apply LEAP, the scRNA-Seq expression counts
must be normalized using any of the methods described
above. Usually, TMM is recommended.
3. Apply the MAC_counter() function to your normalized dataset
to generate the correlation matrix that the GCN will be based
on. By maximizing over all possible time lags, the correlations
found by LEAP can often be larger than a traditional Pearson’s
correlation coefficient (see Note 8). The following is the R code
to estimate the directed GCN:

>MAC_results = MAC_counter(data=filter_data,
max_lag_prop=1/3, MAC_cutoff=0.2, lag_matrix=T)

It is important to notice that, in general, the gene pairs (i,j) and

(j,i) will not have the same maximum absolute correlation. This
is because for the pair (i,j), the maximum absolute correlation
that is found measures the co-expression when gene j’s expres-
sion is delayed compared to the expression of gene i. As a result,
LEAP is able to capture directional relationships, and these
directional relationships likely imply regulatory relationships.
Therefore, LEAP is able to extend the information contained
within traditional GCNs—Incorporating enhancing,
Inference of GCNs from scRNA-Seq 149

inhibiting, and co-expression relationships between genes into

one single network.
4. Next we need to determine direction of correlations. Note that
by considering the sign of the maximum correlation and the lag
at which it occurs, the direction of the regulatory relationship
between a pair of genes is determined. For example, a positive
correlation between a pair A and B occurring at a nonzero lag
suggests that gene A enhances gene B, meaning that an increase
in gene A is causing an increase in gene B’s expression. Con-
versely, a negative correlation between a pair A and B occurring
at a nonzero lag suggests that an increase in gene A is causing a
decrease in gene B’s expression. Note that if these maximum
correlations occurred when the pair was considered gene B and
gene A, then B would be up/downregulating gene A. Finally, if
the maximum correlation occurs when the lag is 0, then a
co-expression relationship has been captured (as is identified
by traditional correlation-based GCNs), and suggests that gene
A and gene B are both regulated by a third gene. These lag
results are generated by the MAC_counter function and saved
as a “lag” file.
5. Finally, LEAP incorporates a method to determine the appro-
priate cutoff to use to construct the network while controlling
false discoveries. To estimate the false discovery rate (FDR),
each gene i’s normalized expression counts are permuted a
default number of 100 times. Then for each permutation, an
estimated maximum absolute correlation is calculated. For each
correlation cutoff, an FDR is then estimated. These estimated
FDRs can then be used to determine the appropriate correla-
tion cutoff that should be used to control the false discoveries
at the desired rate.
>MAC_perm(data=filter_data,MACs_observ=MAC_results, num_perms=10, max_lag_prop=1/3,
FDR_cutoffs=101)

This cutoff can then be applied to the maximum absolute

correlation matrix to create a directed GCN for the scRNA-
Seq data.

4 Notes

1. ERCC Spike-Ins. It has been observed that the use of normali-

zation methods that incorporate spike-in External RNA Con-
trol Consortium (ERCC) levels tends to provide better removal
of the frequently high technical noise found in scRNA-Seq
[24]. If ERCCs have been incorporated in your scRNA-Seq
data, then methods such as removed unwanted variants [25]
and gamma regression model [24] should be used.
150 Alicia T. Lamere and Jun Li

2. Log-transformation. It can often be helpful to apply an addi-

tional log(x + 1) transformation to expression data after nor-
malization to reduce the effects of particularly large expression
values.
3. Choosing cutoff for identifying highly expressed genes: The choice
of cutoffs for average expression and scaled interquartile range
should be determined based on each particular scRNA-Seq
dataset, and in practice must often be tweaked to identify the
most informative and practical set of genes. It is also important
for most methods to reduce the number of genes to a size that
is computationally feasible. In practice, one-thousand genes or
less works well for most methods on a typical laptop computer.
4. Non-correlation-based GCN construction methods. There exist
many methods for constructing GCNs from RNA-Seq data
directly that do not involve estimating the correlation between
genes. Instead, they use more sophisticated methods such as
Markov Random Fields, partial-correlation, and mutual infor-
mation. A downside to these methods is that they tend to
require much more computational power and time to run on
the large datasets common to RNA-Seq data, often working
effectively on a set of only a few dozen to a few hundred genes.
Three of the most widely used non-correlation-based methods
are XMRF, GeneNet, and ARACNE. XMRF is an R package
that includes four Markov Random Field, or Markov Network-
based methods for constructing GCNs [26]. The methods are:
a regular Poisson graphical model, which is only able to capture
negative conditional dependencies between genes; a truncated
Poisson graphical model that reduces the effects of large counts
while allowing for positive and negative conditional dependen-
cies; a sublinear Poisson graphical model that softens the reduc-
tions of the large counts; and finally, a local Poisson graphical
model that, through the use of approximation, is in practice
much faster [27]. Consequently, the package authors generally
recommend the use of the local Poisson graphical model
(LPGM) for RNA-Seq data. GeneNet is an R package that
constructs “causal” or directed networks through the convert-
ing of a correlation network into a partial correlation network
[28]. By doing so, these networks have the ability to essentially
measure the relationships between genes after removing the
effects of other genes on their correlation. This method was
originally designed for use on microarray data, so once again
RNA-Seq data must be normalized using the methods
described above before GeneNet can be used to construct a
GCN. One potential concern for GeneNet is that to perform
the transformation from correlation matrix to partial-
correlation matrix, the original correlation matrix must be
positive definite. Usually, the large size of RNA-Seq datasets
Inference of GCNs from scRNA-Seq 151

resolves this issue as long as the number of genes in the dataset

is much larger than the number of experiments. The third
method, Algorithm for the Reconstruction of Accurate Cellu-
lar Networks (ARACNE), is information-theory based
[29]. It has been shown that by using mutual information, a
more accurate depiction of network relationships can be found
when working with microarray data. ARACNE uses mutual
information to determine dependencies among gene—Essen-
tially measuring how much knowing the expression of one gene
reduces the uncertainty of the expression of the other gene. A
benefit of using mutual information to construct GCNs is that
there is no monotonic relationship assumption, allowing them
to capture nonlinear dependencies missed by correlation-based
networks. This method was also originally designed for use on
microarray data, so once again RNA-Seq data must be normal-
ized before applying ARACNE. Another potential downside to
the use of this algorithm is the longer computational time.
5. Other directed construction methods. Ocone et al. [30] have
described a similar framework to LEAP using pseudotime
that instead uses ordinary differential equations (ODE)-based
methods to construct GCNs directly from scRNA-Seq data.
Although not available as an R package, Matlab and C++ code
are available for implementing it through the authors. The
framework begins with dimension reduction. Ocone et al. rec-
ommend using the nonlinear method diffusion map [31] for
dimension reduction prior to estimating pseudotime. After
dimension reduction, an ad hoc clustering method is applied
to separate branches associated with different cellular pro-
cesses. This clustering depends on the user’s choice of an initial
cell, which is generally sufficiently identified by examining the
visual layout of the diffusion map. Wanderlust is then applied to
each branch identified through clustering. It is important to
note that Wanderlust should be applied to the original expres-
sion data, not the dimension-reduced data. As a guide for the
ODE model selection, the authors recommend first estimating
an approximate network structure by applying GENIE3
[32]. This is then used as a starting place for the ODE model
estimation, thereby reducing the number of models that would
need to be compared. Finally, through MCMC iterations, the
parameters for each ODE model are estimated using the pseu-
dotime ordering of the cells in each branch. The best model is
selected by comparing AIC, BIC, and Bayes’ factors. Similarly,
an alternative method called PIDC is mutual information-
based using partial information decomposition, and can be
combined with clustering methods or pseudotime-orderings
to infer GCNs for scRNA-Seq data [33]. This method,
through its use of MI, is able to capture nonlinear relationships
152 Alicia T. Lamere and Jun Li

between pairs and groups of genes. The first three steps for the
ODE framework described above can be implemented to dis-
cern a computationally feasible subset of genes to work with.
Although not available as an R package, Julia code implement-
ing PIDC is available through the authors. A third method is
designed for experiments collected at specific time points.
These time points may also contain important information for
constructing GCNs. The algorithm SINCERITIES [34] uses
ridge regression and partial correlation analysis to directly
incorporate temporal changes in expression that are observed
through these time points. Though not implemented as an R
package, the R and Matlab code are both available through the
authors.
6. Wanderlust. Another popular pseudotime estimation method
is Wanderlust. Instead of using ICA, Wanderlust takes the
high-dimensional data and transforms it into a nearest-
neighbor graph—Meaning cells with similar expression profiles
will be connected [12]. It then repeatedly identifies the short-
est path and takes the average, using a cell’s placement along
this average path to determine its pseudotime.
7. Estimating Pseudotime with Monocle: Monocle’s demonstrated
effectiveness and ease of implementation tend to make it easier
implement. However, when using Monocle it is important to
pay attention to the “states” in which cells are classified and
note that pseudotimes from one state do not correspond to the
times in other states. As a result, each state should be treated
separately. In practice, there usually is a state that most cells are
captured by and hence analysis may be restricted to those cells.
8. Choosing window size for LEAP: Note that the default for this
function is a maximum window size of two-thirds of the num-
ber of cells present in the dataset. Deviating significantly from
this size may result in correlations that are artifacts of noise in
the expression profiles rather than capturing true biological
effects.

References

1. Wolfe C, Kohane I, Butte A (2005) Systematic 4. Lee HK et al (2004) Coexpression analysis of

survey reveals general applicability of “guilt-by- human genes across many microarray data sets.
association” within gene coexpression net- Genome Res 14(6):1085–1094
works. BMC Bioinformatics 6(1):227 5. Persson H et al (2005) Identification of genes
2. Stuart JM, Segal E, Koller D, Kim SK (2003) A required for cellulose synthesis by regression
gene-coexpression network for global discov- analysis of public microarray data sets. Proc
ery of conserved genetic modules. Science 302 Natl Acad Sci U S A 102(24):8633–8638
(5643):249–255 6. Basso K et al (2005) Reverse engineering of
3. Schafer J, Strimmer K (2005) An empirical regulatory networks in human b cells. Nat
bayes approach to inferring large-scale gene Genet 37(4):382–390
association networks. Bioinformatics 21
(6):754–764
Inference of GCNs from scRNA-Seq 153

7. Munksy B, Neuert G, van Oudenaarden A 22. Li J, Lamere AT (2018) DiPhiSeq: Robust

(2012) Using gene expression noise to under- comparison of expression levels on RNA-Seq
stand gene regulation. Science 336 data with large sample sizes. Paper presented at
(6078):183–187 the Joint Statistical Meetings, Vancouver, CA,
8. Trapnell C et al (2014) The dynamics and reg- 28 July–2 Aug 2018
ulators of cell fate decisions are revealed by 23. Specht AT, Li J (2016) LEAP: constructing
pseudotemporal ordering of single cells. Nat gene co-expression networks for single-cell
Biotechnol 32(4):381–386 rna-sequencing data using pseudotime order-
9. Campbell K, Yau C (2015) Bayesian Gaussian ing. Bioinformatics 33(5):764–766
process latent variable models for pseudotime 24. Ding B, Zheng L, Wang W (2017) Assessment
inference in single-cell rna-seq data. bioRxiv. of single cell rna-seq normalization methods.
p. 026872 G3 (Bethesda) 7(7):2039–2045
10. Reid JE, Wernisch L (2016) Pseudotime esti- 25. Risso D et al (2014) Normalization of rna-seq
mation: deconfounding single cell time series. data using factor analysis of control genes or
Bioinformatics 32(19):2973–2980 samples. Nat Biotechnol 32(9):896
11. Campbell K, Ponting CP, Webber C (2015) 26. Wan YW et al (2016) XMRF: an R package to
Laplacian eigenmaps and principal curves for fit markov networks to high-throughput genet-
high resolution pseudotemporal ordering of ics data. BMC Syst Biol 10(3):69
single-cell rna-seq profiles. bioRxiv. p 027219 27. Allen GI, Liu Z (2013) A local poisson graphi-
12. Bendall SC et al (2014) Single-cell trajectory cal model for inferring networks from sequenc-
detection uncovers progression and regulatory ing data. IEEE Trans Nanobioscience 12
coordination in human b cell development. (3):189–198
Cell 157(3):714–725 28. Opgen-Rhein R, Strimmer K (2007) From cor-
13. Garber M et al (2011) Computational methods relation to causation networks: a simple
for transcriptome annotation and quantifica- approximate learning algorithm and its applica-
tion using rna-seq. Nat Methods 8(6):469 tion to high-dimensional plant gene expression
14. Bullard JH et al (2010) Evaluation of statistical data. BMC Syst Biol 1(1):37
methods for normalization and differential 29. Margolin AA et al (2006) ARACNE: an algo-
expression in mrna-seq experiments. BMC rithm for the reconstruction of gene regulatory
Bioinformatics 11(1):94 networks in a mammalian cellular context.
15. Robinson MD, Oshlack A (2010) A scaling BMC Bioinformatics 7(1):S7
normalization method for differential expres- 30. Ocone A et al (2015) Reconstructing gene
sion analysis of rna-seq data. Genome Biol 11 regulatory dynamics from high-dimensional
(3):R25 single-cell snapshot data. Bioinformatics 31
16. Robinson MD, McCarthy DJ, Smyth GK (12):i89–i96
(2010) edgeR: a bioconductor package for dif- 31. Coifman RR et al (2005) Geometric diffusions
ferential expression analysis of digital gene as a tool for harmonic analysis and structure
expression data. Bioinformatics 26:139–140 definition of data: diffusion maps. Proc Natl
17. Langfelder P, Horvath S (2008) WGCNA: an Acad Sci U S A 102(21):7426–7431
R package for weighted correlation network 32. Huynh-Thu VA et al (2010) Inferring gene
analysis. BMC Bioinformatics 9(1):559 regulatory networks from expression data
18. Iancu D et al (2012) Utilizing rna-seq data for using tree-based methods. PLoS One 5(9):
de novo coexpression network inference. Bio- e12776
informatics 28(12):1592–1597 33. Chan TE et al (2017) Gene regulatory network
19. Kim H et al (2013) Peeling back the evolution- inference from single-cell data using multivari-
ary layers of molecular mechanisms responsive ate information measures. Cell Syst 5
to exercise-stress in the skeletal muscle of the (3):251–267
racing horse. DNA Res 20(3):287–298 34. Papili Gao N et al (2017) SINCERITIES:
20. Xue Z et al (2013) Genetic programs in human inferring gene regulatory networks from time-
and mouse early embryos revealed by single- stamped single cell transcriptional expression
cell RNA sequencing. Nature 500(7464):593 profiles. Bioinformatics 34(2):258–266
21. Specht AT, Li J (2015) Estimation of gene
co-expression from rna-seq count data. Stat
Interface 8(4):507–515
Chapter 11

Single-Cell Allele-Specific Gene Expression Analysis

Meichen Dong and Yuchao Jiang

Abstract
Allele-specific expression is traditionally studied by bulk RNA sequencing, which measures average
gene expression across cells. Single-cell RNA sequencing (scRNA-seq) allows the comparison of expression
distribution between the two alleles of a diploid organism, and characterization of allele-specific bursting.
Here we describe SCALE, a bioinformatic and statistical framework for allele-specific gene expression
analysis by scRNA-seq. SCALE estimates genome-wide bursting kinetics at the allelic level while accounting
for technical bias and other complicating factors such as cell size. SCALE detects genes with significantly
different bursting kinetics between the two alleles, as well as genes where the two alleles exhibit non-
independent bursting processes. Here, we illustrate SCALE on a mouse blastocyst single-cell dataset with
step-by-step demonstration from the upstream bioinformatic processing to the downstream biological
interpretation of SCALE’s output.

Key words Single-cell RNA sequencing, Allele-specific expression, Transcriptional bursting, Techni-
cal variability

1 Introduction

In diploid eukaryotic organisms, each autosomal gene has two cop-

ies/alleles: one inherited from the mother and one from the father
[1–4]. Allele-specific expression (ASE) refers to the phenomenon
that gene expression is unbalanced between the two alleles, and is
found, in its extreme, in genomic imprinting, where the allele from
one parent is uniformly silenced across cells, and in X-chromosome
inactivation, where one of the two X-chromosomes in females is
randomly silenced. During the past decade, ASE has been studied
by bulk RNA sequencing (RNA-seq), where mean expression differ-
ences between the two alleles have been found across cells, in the
form of allelic imbalance [5, 6]. Recent developments in single-cell
RNA-seq (scRNA-seq) have made possible the better characteriza-
tion of allelic differences in gene expression, which circumvent the
averaging artifacts associated with traditional bulk population data
[2, 4, 7, 8]. ASE analysis by scRNA-seq allows the comparison of
expression distribution between the two alleles and characterization

Guo-Cheng Yuan (ed.), Computational Methods for Single-Cell Data Analysis, Methods in Molecular Biology, vol. 1935,
https://doi.org/10.1007/978-1-4939-9057-3_11, © Springer Science+Business Media, LLC, part of Springer Nature 2019

155
156 Meichen Dong and Yuchao Jiang

of allele-specific transcriptional bursting, which is a fundamental

property of gene expression where transcription from DNA to
RNA occurs in bursts [7–10].
Here we describe SCALE (Single-Cell ALlelic Expression) [7],
a systematic bioinformatic and statistical framework to study ASE in
single cells by examining allele-specific transcriptional bursting
kinetics. SCALE is comprised of three main steps. First, based on
allele-specific read counts across cells, we adopt an empirical Bayes
method to classify genes into three categories: silent, monoalleli-
cally expressed, or biallelically expressed. Next, for genes classified
as biallelically expressed, we use a Poisson–Beta hierarchical model
to estimate allele-specific bursting parameters while adjusting for
dropout events, amplification and sequencing bias, and other com-
plicating factors such as cell size. Finally, we apply a nonparametric
bootstrap testing method to examine whether the two alleles of a
gene share the same bursting parameters and perform a Chi-square
test for independent bursting between the two alleles.

2 Materials

In this section, we outline the required materials for conducting

ASE analysis by SCALE, including data input, computational envi-
ronment, and software packages. In the Methods section that
follows, we demonstrate how to carry out ASE analysis by
scRNA-seq, from bioinformatic processing of the raw sequencing
files to the application and result interpretation of SCALE. The
bioinformatic pipeline and the R package for SCALE are available at
https://github.com/yuchaojiang/SCALE.

2.1 Data Input The data input for SCALE includes raw sequencing files from
scRNA-seq studies (as fastq), as well as the corresponding genome
assembly (as fasta). Based on the gene body coverage of
the sequenced reads, scRNA-seq can be classified into two cate-
gories: full-transcript method (e.g., Smart-seq [11] and Smart-seq2
[12]) and tag method (e.g., Drop-seq [13] and the 10X Genomics
Chromium Single Cell 30 Solution [14]). To study ASE at germline
heterozygous loci, full-transcript scRNA-seq protocol such as
Smart-seq2 is recommended due to its broad coverage. To account
for biases that are introduced during the library preparation and
sequencing step, SCALE by default relies on external spike-ins,
whose known concentration is used as ground truth for adjustment
[15]. When spike-ins are not readily available, imputation-based
methods such as SAVER [16] and scImpute [17] can be adopted to
recover true underlying expression distribution.

2.2 Computational The upstream bioinformatic processing of raw sequencing data

Environment needs Linux or Unix platform in high-performance
Single-Cell Allele-Specific Gene Expression Analysis 157

computing (HPC) system, which will return read count matrices as

input for SCALE (see Subheading 3.1 for more details). The down-
stream ASE analysis can then be locally run on a Windows or
Macintosh machine, where R is installed. For faster computation,
SCALE can be run in parallel on an HPC cluster.

2.3 Software For read alignment, BWA [18] and STAR [19] are required for
Packages DNA and RNA sequencing, respectively. Picard Tools (http://bro
adinstitute.github.io/picard) and SAMtools [20] are required for
deduplication and quality controls. The Genome Analysis Toolkit
(GATK) [21] is adopted in our proposed pipeline by default to
identify germline heterozygous loci. R packages SCALE, as well as
its dependents—tsne (https://cran.r-project.org/package¼tsne)
and rje (https://cran.r-project.org/package¼rje), is required for
performing ASE analysis in R.

3 Methods

Figure 1 gives an overview of the analysis pipeline by SCALE. We

start with bioinformatic preprocessing, which returns allele-specific
read counts for the single cells. SCALE takes as input the allelic
coverage at germline heterozygous loci and carries out three major
steps: (1) gene classification using an empirical Bayes method;
(2) estimation of allele-specific bursting kinetics using a Poisson–-
Beta hierarchical model with adjustment of technical variability;
and (3) hypothesis testing of the two alleles of a gene to determine
if they have different bursting kinetics and/or nonindependent
bursting.

3.1 Bioinformatics For endogenous RNAs, germline heterozygous loci are called by
Pipeline bulk DNA- or RNA-seq of the same cells following the best prac-
tices for GATK [21]. If bulk sequencing from the same tissue is not
available, scRNA-seq data can be aggregated to generate a pseudo-
bulk RNA-seq sample. STAR [19] and BWA [18] are used for read
alignment for RNA-seq and DNA-seq, respectively, followed by a
deduplication and quality control procedure. We then force call
single-cell allele-specific read counts using the mpileup command
by SAMtools [20]. These allele-specific reads counts are further
used as input to SCALE.
For external spike-ins, SCALE takes as input the true concen-
trations and lengths of the spike-in molecules, as well as the depth
of coverage for each spike-in sequence across cells. The true con-
centration of each spike-in molecule is calculated according to the
known concentration (denoted as μ attomoles/μL) and the dilu-
tion factor (e.g., 40,000):
158 Meichen Dong and Yuchao Jiang

scRNA-seq Bulk RNA-seq Bulk DNA-seq

fastq fastq fastq
STAR STAR BWA
scRNA-seq Bulk RNA-seq Bulk DNA-seq
bam bam bam

GATK
Germline heterozygous loci (vcf)

mpileup

Allele-specific read counts

Empirical Bayes gene categorization

Monoallelic Biallelic Silent

Bursty Constitutive
depth

Allele A
Mean

Allele-specific Allelic
transcriptional kinetics Allele B imbalance
A B
Hypothesis testing

Differential allele-specific Non-independent

bursting frequency / size allele-specific bursting

Differential allele-specific
Coordinated bursting
bursting frequency

Differential allele-specific Repulsed bursting

bursting size

Fig. 1 Overview of analysis pipeline of SCALE. SCALE takes as input allele-specific read counts at germline
heterozygous loci and carries out three major steps: gene classification, estimation of allele-specific bursting
kinetics, and hypothesis testing of differential and nonindependent allelic bursting

μ 1018 moles=μL 6:02214 1023 mole1 ðAvogadro constantÞ

:
40, 000 ðDilution factorÞ
The observed number of reads for each spike-in is calculated by
adjusting for the library size factor, the read length, and the length
of the spike-in RNA.
Single-Cell Allele-Specific Gene Expression Analysis 159

The bash scripts for bioinformatic preprocessing to generate

the input for SCALE are available at https://github.com/
yuchaojiang/SCALE/tree/master/bioinfo. Here, we outline the
bioinformatic pipeline in three parts: (1) profile germline heterozy-
gous loci by bulk DNA-seq; (2) profile allele-specific read counts at
germline heterozygous loci by scRNA-seq; and (3) generate input
for exogenous spike-ins for adjustment of technical variability.
#################################################################
# 1. Profile germline heterozygous loci
#################################################################
# 1.1. Index the genome template
bwa index ucsc.hg19.fasta
# 1.2. Align reads
bwa mem -M -t 16 ucsc.hg19.fasta bulk_R1.fastq bulk_R2.fastq >
bulk.sam
# 1.3. Convert sam to bam and sort
samtools view -bS bulk.sam > bulk.bam
java -jar SortSam.jar INPUT=bulk.bam OUTPUT=bulk.sorted.bam
SORT_ORDER=coordinate
# 1.4. Add read group
java -jar AddOrReplaceReadGroups.jar INPUT=bulk.sorted.bam
OUTPUT=bulk.sorted.rg.bam RGID=LANE1 RGLB=LIB1 RGPL=ILLUMINA
RGPU=UNIT1 RGSM=bulk
samtools index bulk.sorted.rg.bam
# 1.5. Dedup
java -jar MarkDuplicates.jar INPUT=bulk.sorted.rg.bam
OUTPUT=bulk.sorted.rg.dedup.bam
METRICS_FILE=bulk.sorted.rg.dedup.metrics.txt PROGRAM_RECORD_ID=
MarkDuplicates PROGRAM_GROUP_VERSION=null
PROGRAM_GROUP_NAME=MarkDuplicates
java -jar BuildBamIndex.jar INPUT= bulk.sorted.rg.dedup.bam
# 1.6. Profile germline heterozygous loci by GATK HaplotypeCaller
java -jar GenomeAnalysisTK.jar -R ucsc.hg19.fasta -T
HaplotypeCaller -I bulk.sorted.rg.dedup.bam -o
bulk.sorted.rg.dedup.raw.snps.indels.g.vcf
160 Meichen Dong and Yuchao Jiang

#################################################################
# 2. Profile allele-specific read counts by scRNA-seq
#################################################################
# 2.1. Get splice junction database:
wget http://labshare.cshl.edu/shares/gingeraslab/www-
data/dobin/STAR/STARgenomes/GENCODE/Old/gencode.v14.annotation.gt
f.sjdb
# 2.2. Generate the genome using STAR, 100bp paired-end
sequencing
genomeDir=directory_to_genome
STAR --runMode genomeGenerate --genomeDir $genomeDir --
genomeFastaFiles hg19.fa --sjdbFileChrStartEnd
gencode.v14.annotation.gtf.sjdb --sjdbOverhang 99 --runThreadN 4
# 2.3. Align reads
genomeDir=directory_to_genome
STAR --genomeDir genomeDir --readFilesIn samp_1.fastq
samp_2.fastq --outFilterIntronMotifs
RemoveNoncanonicalUnannotated --outFileNamePrefix samp_ --
runThreadN 4
# 2.4. Convert sam to bam, filter, and sort
samtools view -bS samp_Aligned.out.sam > samp_Aligned.out.bam
perl filter_sam_v2.pl samp_Aligned.out.bam
samp_Aligned.out.filtered.sam
samtools view -bS samp_Aligned.out.filtered.sam >
samp_Aligned.out.filtered.bam
java -Xmx30G -jar SortSam.jar INPUT=samp_Aligned.out.filtered.bam
OUTPUT=samp_Aligned.out.filtered.sorted.bam SORT_ORDER=coordinate
# 2.5. Add read group and index
java -Xmx30G -jar AddOrReplaceReadGroups.jar
Single-Cell Allele-Specific Gene Expression Analysis 161

INPUT=samp_Aligned.out.filtered.sorted.bam
OUTPUT=samp_Aligned.out.filtered.sorted.rg.bam RGID=LANE2
RGLB=LIB2 RGPL=ILLUMINA RGPU=UNIT2 RGSM=samp
samtools index samp_Aligned.out.filtered.sorted.rg.bam
# 2.6. Parse file: position.txt contains all the heterozygous
loci (chr + coordinate) returned by GATK HaplotypeCaller using
bulk DNA-seq
samtools mpileup -E -f hg19.fa -d 1000000 --position position.txt
samp_Aligned.out.filtered.sorted.rg.bam >
samp_Aligned.out.filtered.sorted.rg.mpileup
perl pileup2base_no_strand.pl
samp_Aligned.out.filtered.sorted.rg.mpileup 30
samp_Aligned.out.filtered.sorted.rg.parse30.txt
#################################################################
# 3. Generate input for exogenous spike-ins
#################################################################
# 3.1. Concatenate ERCC with genome (hg19) and index
cat ERCC92.fa hg19.fa > hg19_ERCC.fa
java -jar CreateSequenceDictionary.jar R= hg19_ERCC.fa O=
hg19_ERCC.dict
samtools faidx hg19_ERCC.fa
# 3.2. Generate the genome using STAR, 50bp paired-end sequencing
genomeDir=directory_to_ERCC_genome
STAR --runMode genomeGenerate --genomeDir $genomeDir --
genomeFastaFiles hg19_ERCC.fa --sjdbFileChrStartEnd
gencode.v14.annotation.gtf.sjdb --sjdbOverhang 99 --runThreadN 4
# 3.3. Align reads
genomeDir=directory_to_ERCC_genome
STAR --genomeDir genomeDir --readFilesIn samp_1.fastq
162 Meichen Dong and Yuchao Jiang

samp_2.fastq --outFilterIntronMotifs
RemoveNoncanonicalUnannotated --outFileNamePrefix samp_ --
runThreadN 4
# 3.4. Convert sam to bam, filter, and sort
samtools view -bS samp_Aligned.out.sam > samp_Aligned.out.bam
perl filter_sam_v2.pl samp_Aligned.out.bam
samp_Aligned.out.filtered.sam
samtools view -bS samp_Aligned.out.filtered.sam >
samp_Aligned.out.filtered.bam
java -Xmx30G -jar SortSam.jar INPUT=samp_Aligned.out.filtered.bam
OUTPUT=samp_Aligned.out.filtered.sorted.bam SORT_ORDER=coordinate
# 3.5. Add read group and index
java -Xmx30G -jar AddOrReplaceReadGroups.jar
INPUT=samp_Aligned.out.filtered.sorted.bam
OUTPUT=samp_Aligned.out.filtered.sorted.rg.bam RGID=LANE2
RGLB=LIB2 RGPL=ILLUMINA RGPU=UNIT2 RGSM=samp
samtools index samp_Aligned.out.filtered.sorted.rg.bam
# 3.6. Get total read counts as well as read counts for ERCC
samtools view -c samp_Aligned.out.filtered.sorted.rg.bam
while read ercc; do
echo $ercc
while read bam; do samtools view -c $bam $ercc; done <
rg.bam.list | cat > $ercc.txt
done < ercc.id

3.2 Installation The R package for SCALE can be installed directly from GitHub.
and Data Input The analysis by SCALE requires scRNA-seq data of cells from a
homogenous cell population (i.e., from the same cell types and the
same tissue). Each allele at heterozygous loci should have an expres-
sion matrix with rows as genes and columns as cells. In addition to
the read count matrices for the endogenous RNAs, an input matrix
for the spike-ins is needed for capturing technical variability. This
matrix should have rows as spike-ins, the first column as the true
number of molecules, the second column as the lengths of the
molecules, and the third column and on as the observed read
counts across cells. Note that spike-ins are not required for each
individual cell (see Note 1). Here, we demonstrate the analysis
Single-Cell Allele-Specific Gene Expression Analysis 163

framework of SCALE on a scRNA-seq data set of 122 mouse

blastocyst cells from Deng et al. [2].
# SCALE installation
install.packages(c("rje", "tsne", "devtools"))
devtools::install_github("yuchaojiang/SCALE/package")
# Data input
library(SCALE)
data(mouse.blastocyst) # scRNA-seq dataset of mouse blastocyst
alleleA <- mouse.blastocyst$alleleA # Read counts for allele A
alleleB <- mouse.blastocyst$alleleB # Read counts for allele B
spikein_input <- mouse.blastocyst$spikein_input # Spike-in input
matrix
genename <- rownames(alleleA) # Rows correspond to genes
sampname <- colnames(alleleA) # Columns correspond to samples
# Input matrix for spike-ins
spikein_input[1:4,1:4]
spikein_mol spikein_length GSM1112664 GSM1112665
RNA_SPIKE_1 12165.65657 755 462995 180148
1897542
RNA_SPIKE_2 12165.65657 755 336378 112809
1324062
RNA_SPIKE_3 912.42424 1003 18706 6062
88117
RNA_SPIKE_4 912.42424 1000 13689 6733
85575

3.3 Quality Control Quality control (QC) procedures are recommended to filter out
both poor-quality cells and extreme genes before applying SCALE.
Cell QC metrics may include library size factor, which can be
calculated by the following definition:
A
Q cg þ Q cg
B
ηc ¼ median h i1=C ,
g
∏cC∗ ¼1 Q cA∗ g þ Q cB∗ g

where ηc is the library size for cell c; Q cg

A B
, Q cg denote the observed
expression level of gene g from cell c for allele A and allele B,
respectively; C is the total number of cells. Cells with extreme ratios
164 Meichen Dong and Yuchao Jiang

of reads that map to spike-ins versus endogenous genes should be

excluded, which is equivalent to removing cells with extreme cell
sizes.
SCALE needs to be applied to a homogeneous cell population,
where the same bursting kinetics are shared across cells. Possible
heterogeneity due to, for example, cell subgroups, developmental
trajectories, and donor effects, can lead to biased downstream
analysis. Therefore, it is strongly recommended to first perform
dimensionality reduction and clustering method (e.g., PCA,
t-SNE [22], ZIFA [23], SC3 [24], SIMLR [25], GiniClust2
[26], etc.) to the gene expression matrix. After clustering, remove
cell outliers and apply SCALE to a homogeneous cell cluster with
the assumption of shared bursting kinetics between the cells. (See
Note 2 for more details.) The input for SCALE is cell-type specific
read count matrix for the two alleles, with rows being genes and
columns being cells.

3.4 Technical To account for the tremendous amount technical variability

Variability observed in the scRNA-seq data, a hierarchical model based on
the Toolkit for Analysis of Single Cell data (TASC) [15] is fit to
the spike-in data. Specifically, let Qcg and Ycg be the observed and
true expression levels of gene g in cell c, respectively. The hierarchi-
cal model used to model dropout, amplification, and sequencing
bias is:
β
Q cg Z cg Poisson αc Y cg c ,

Z cg Bernoulli π cg ,

π cg ¼ expit κ c þ τc log Y cg ,
where Zcg is a Bernoulli random variable indicating that gene g is
detected in cell c. π cg ¼ P(Zcg ¼ 1) is the success probability, which
depends on log(Ycg), the logarithm of the true underlying expres-
sion. αc models the capture and sequencing efficiency; βc models
the amplification bias; κ c and τc characterize whether a transcript is
successfully captured in the library.
{α, β, κ, τ} are estimated through exogenous spike-ins and are
assumed to be shared across cells from the same sequencing batch.
αc and βc are estimated by fitting a log-linear regression model.
Nelder-Mead simplex algorithm is then applied to jointly optimize
κc and τc, which models the probability of dropout events (see Note
3). The tech_bias function directly returns the estimated parameters
and generates two plots by default, one for amplification bias and
the other for dropout, as is shown in Fig. 2.
Single-Cell Allele-Specific Gene Expression Analysis 165

(A) (B) Dropout

Log(observed number of reads)

1.0
log(α) = −3.089 κ = −6.304

Percentage of zero reads

10
β = 1.045 τ = 1.278

0.8
8
6

0.6
4

0.4
2

0.2
0
−2

0.0
2 4 6 8 10 12 0 2 4 6 8 10
Log(true number of molecules) Log(true expression)

Fig. 2 Modeling of technical variability and parameter estimation. Amplification and sequencing bias are
modeled and captured by parameter α and β. Estimation is carried out by log-linear regression. Probability of
dropout is modeled by κ and τ and depends on the logarithm of the true expression. Estimation is carried out
by the Nelder-Mead simplex algorithm

# Estimate parameters for technical variability

abkt <- tech_bias(spikein_input = spikein_input,
alleleA = alleleA, alleleB = alleleB,
readlength = 50, pdf = FALSE)

3.5 Gene SCALE adopts an empirical Bayes method that categorizes each
Classification gene into being silent, monoallelically expressed, and biallelically
expressed based on their ASE across cells. An expectation maximi-
zation algorithm is implemented for fast estimation of the
corresponding parameters. The result derived from the gene_classify
function is a list of four elements: gene category, proportion of cells
expressing allele A, proportion of cells expressing allele B, and
posterior assignment of cells for each gene. For the posterior
assignment for each gene in each cell, “A” corresponds to cells
expressing A allele only, “B” corresponds to cells expressing B allele
only, “AB” corresponds to cells expressing both alleles, and “Off”
corresponds to cells that are silent for the gene of interest.

# Gene classification
gene.class.obj <- gene_classify(alleleA = as.matrix(alleleA),
alleleB = as.matrix(alleleB))
gene.category <- gene.class.obj$gene.category
166 Meichen Dong and Yuchao Jiang

3.6 Allele-Specific When studying ASE in single cells, it is critical to consider tran-
Bursting Kinetics scriptional bursting due to its pervasiveness in various organisms
[27–30]. A two-state kinetic model has been proposed for gene
transcription, where genes switch between ON and OFF states with
activation rate kon and deactivation rate koff. When a gene is at the
ON state, DNA is transcribed to RNA at rate s, while RNA decays at
rate d. A Poisson–Beta stochastic model for transcriptional bursting
was proposed by Kepler and Elston [31]:
Y PoissonðspÞ, p Betaðkon ; koff Þ,
where Y is the number of RNA molecules and p is the fraction of
time that the gene spends in the active state. Note that the decay
rate d is set to 1 since only the stationary distribution is observed
[32]. This Poisson–Beta model is easy to fit mathematically, with its
parameters corresponding to biologically meaningful quantities—
burst size as s/koff and burst frequency as kon.
After gene classification, SCALE proceeds to infer allele-specific
bursting parameters for biallelic bursty genes (see Note 4) using a
hierarchical model:

A
Y cg Poisson ϕc s gA pcg A B
, Y cg Poisson ϕc s gB pcg B
,

A
pcg Beta konA
, g ; koff , g , pcg Beta kon, g ; koff , g ,
A B B B

A B
where Y cg and Y cg are the true ASE values for gene g in cell c. Note
that the two Poisson–Beta distributions have gene- and allele-
specific bursting parameters and share the same cell-size factor,
which has been shown to affect burst size [33]. When spike-ins
are available, cell size can be estimated by the ratio of the total
number of endogenous RNA reads over the total number of spike-
in reads [34]. Moreover, users can input the cell size factors ϕc if
they are experimentally measured (see Note 5 for details).
A B
Since Y cg and Y cg are not directly observable while the observed
A B
ASE levels Q cg and Q cg are confounded with technical bias, we use a
novel “histogram-repiling” method to derive the distribution of Ycg
from the observed
n distribution Qcg for each ogene. The allele-specific
A B A B
parameters s A ; s B ; kon ; kon ; koff ; koff are then estimated
using the moment estimator methods.
For real dataset analysis, SCALE’s function allelic_kinetics
returns an object allelic.kinetics.obj, which contains the estimated
bursting parameters. A pdf plot is generated by default and is shown
in Fig. 3, where each dot corresponds to a gene and the genes off the
diagonal indicate differential bursting kinetics between the two
alleles.
Single-Cell Allele-Specific Gene Expression Analysis 167

(A) Burst frequency (B) Burst size

1 2 3 4 5 6 7 8
r = 0.864 r = 0.801

log(sB/koffB)
−1
log(konB)
−3 −5
−5 −3 −1 1 1 2 3 4 5 6 7 8
log(konA) log(sA/koffA)

Fig. 3 Allele-specific transcriptional kinetics of 7486 bursty genes from

122 blastocyst cells. (a) Burst frequency of the two alleles has a correlation of
0.864, where 485 genes show significant difference between the two alleles
after FDR control. (b) Burst size of the two alleles has a correlation of 0.801,
where 49 genes show significant difference between the two alleles

# Estimate bursting kinetics parameters

allelic.kinetics.obj <- allelic_kinetics(alleleA = alleleA, alleleB = alleleB,
abkt = abkt, gene.category = gene.category,
cellsize = cellsize, pdf = TRUE)

3.7 Hypothesis Nonparametric hypothesis testing is carried out with the null
Testing hypothesis that the two alleles share the same burst frequency and
A B A B
burst size kon ¼ kon , s A =koff ¼ s B =koff . SCALE’s function diff_al-
lelic_bursting has two “modes”: the “raw” mode where the boot-
strap samples from the raw observed read counts; and the
“corrected” mode where the bootstrap samples from the adjusted
allelic read counts. Two vectors of p-values will be obtained for test
of burst frequency and burst size, respectively.

# Bootstrap based testing for bursting kinetics between two alleles

diff.allelic.obj <- diff_allelic_bursting(alleleA = alleleA, alleleB = alleleB,

cellsize = cellsize, gene.category =
gene.category, abkt = abkt,
allelic.kinetics.obj = allelic.kinetics.obj,
mode = 'corrected')
pval.kon <- diff.allelic.obj$pval.kon
pval.size <- diff.allelic.obj$pval.size
Chi-square test is carried out to test whether the two alleles burst
independently. For genes where the proportion of cells expressing
both alleles is significantly higher than expected, we define their
bursting as coordinated; for genes where the proportion of cells
expressing only one allele is significantly higher than expected, we
define their bursting as repulsed. SCALE’s function non_ind_burst-
ing returns a list of two vectors, with one vector being the p-values
and the other being the nonindependent bursting type (“C” as
168 Meichen Dong and Yuchao Jiang

coordinated bursting and “R” as repulsed bursting).

# Chi-square test for fire dependence of the two alleles of a
gene
non.ind.obj <- non_ind_bursting(alleleA = alleleA,
alleleB = alleleB, gene.category = gene.category,
results.list = results.list)

3.8 Plot and Output For each gene, a pdf plot can be generated with the estimated
parameters and testing results. As an example, Fig. 4 shows the
output by SCALE for gene Hvcn1, whose two alleles share similar
burst size and frequency but burst in a coordinated fashion with
nominal p-value less than 0.05.
# Generate a pdf output for a selected gene
i = which(genename == 'Hvcn1')
allelic_plot(alleleA = alleleA, alleleB = alleleB,
gene.class.obj = gene.class.obj,
allelic.kinetics.obj = allelic.kinetics.obj,
diff.allelic.obj = diff.allelic.obj,
non.ind.obj = non.ind.obj, i = i)

SCALE generates a final output as a tab delimited txt file. Table 1

includes output from a selected set of genes within each gene cate-
gory as rows. The columns include: genename (gene name), gene.
category (gene category), konA (burst frequency for allele A), konB
(burst frequency for allele B), pval.kon ( p-value from testing of
shared burst frequency), sizeA (burst size for allele A), sizeB (burst
size for allele B), pval.size ( p-value from testing of shared burst size),
A_cell, B_cell, AB_cell, Off_cell (number of cells with posterior
assignment of A, B, AB, and Off), A_prop (proportion of cells
expressing A allele), B_prop (proportion of cells expressing B allele),
p.ind ( p-value of burst independence), and non.ind.type (direction
of non-independent bursting as coordinated or repulsed).

# Output all results from SCALE to a txt file

SCALE.output <- output_table(alleleA = alleleA, alleleB = alleleB,
gene.class.obj = gene.class.obj,
allelic.kinetics.obj = allelic.kinetics.obj,
diff.allelic.obj = diff.allelic.obj,
non.ind.obj = non.ind.obj)
At the whole genome level, we identified a significant number
of genes whose allele-specific bursting differs according to burst
Single-Cell Allele-Specific Gene Expression Analysis 169

Gene Hvcn1

200 100
Adjusted reads
−200 −100 0
A allele B allele
Cell
Gene category: Biallelic.bursty

Number of cells: A 15; AB 13; B 19; Off 74

konA = 0.08 sizeA = 232.911

konB = 0.091 sizeB = 263.37

Test of shared burst freq: pval = 0.6983

Test of shared burst size: pval = 0.7402
Test of independent bursting: pval = 0.0062

Fig. 4 SCALE output for a bursty gene Hvcn1. Bar plot shows the adjusted allelic
coverage across cells. Number of cells expressing A allele, B allele, both alleles,
and neither allele is reported, together with the inferred allelic bursting kinetics.
P-values from testing of shared bursting kinetics and nonindependent bursting
are also returned

frequency but not burst size. Our findings provide evidence that
allelic differences in the expression of bursty genes are achieved
through differential modulation of burst frequency than burst size.
Previous studies have shown that kinetic parameter that varies the
most—along cell cycle [33], between different genes [35], between
different growth conditions [36], or under regulation by a tran-
scription factor [37]—is the probabilistic rate of switching to the
active stat kon, while the rates of gene inactivation koff and of
transcription s vary much less.

4 Notes

1. SCALE uses external spike-ins to estimate the parameters asso-

ciated with technical noise but does not require spike-ins in
every cell in the experiment. As long as some cells from the same
batch have spike-ins, they can be used to model and capture the
dropout events, as well as the amplification and sequencing bias
during the library preparation and sequencing step. When
spike-ins are not readily available, imputation based methods
such as SAVER [16] and scImpute [17] can be adopted to
recover the true underlying expression distribution.
Table 1
Selected gene output from SCALE
170

pval. pval. A. B. pval. non.ind.

Genename Gene.category konA konB kon sizeA sizeB size A_cell B_cell AB_cell Off_cell prop prop ind type
Phf7 Biallelic.bursty 0.0629 0.012 0.010 64.39 156.36 0.149 18 8 1 95 0.16 0.07 0.701 R
Zcchc4 Biallelic.bursty 0.0459 0.066 0.384 136.78 78.31 0.308 14 16 1 91 0.12 0.14 0.385 R
Slc16a5 Biallelic.bursty 0.0138 0.030 0.215 405.56 361.02 0.861 9 13 0 100 0.07 0.11 0.281 R
Gprc5a Biallelic.bursty 0.5515 0.810 0.131 578.29 644.77 0.783 12 21 80 5 0.78 0.86 0.428 C
Atp6ap2 Biallelic.bursty 1.0221 0.031 0.000 141.82 1135.68 0.000 90 2 13 17 0.84 0.12 0.798 C
Nfx1 Biallelic.bursty 0.3342 0.377 0.672 121.29 233.68 0.236 13 31 48 24 0.53 0.68 0.010 C
Meichen Dong and Yuchao Jiang

Rpl4 Biallelic.nonbursty – – – 0 0 122 0 1.00 1.00 – –

Exosc8 Biallelic.nonbursty – – – 2 4 116 0 0.97 0.98 – –
Arhgap1 Biallelic.nonbursty – – – 9 4 108 0 0.97 0.93 – –
Gla MonoA – – – 13 0 0 109 0.11 0.00 – –
Agbl2 MonoA – – – 2 0 0 120 0.02 0.00 – –
Espn MonoA – – – 1 0 0 121 0.01 0.00 – –
Rasd2 MonoB – – – 0 7 0 115 0.00 0.06 – –
Spata2L MonoB – – – 0 4 0 118 0.00 0.03 – –
Kcnip2 MonoB – – – 0 2 0 120 0.00 0.02 – –
Gm101 Silent – – – 0 0 0 122 0.00 0.00 – –
Polr2l Silent – – – 0 0 0 122 0.00 0.00 – –
Actb Silent – – – 0 0 0 122 0.00 0.00 – –
The final output of SCALE is a tab delimited text file. The columns include: gene name, gene category, bursty frequency A, burst frequency B, p-value of shared burst frequency, burst
size A, burst size B, p-value of shared burst size, number of cells with posterior assignment of A, B, AB, and null, proportion of cells expressing A allele, proportion of cells expressing B
allele, p-value of burst independence, and direction of non-independent bursting (C for coordinated bursting and R for repulsed bursting)
Single-Cell Allele-Specific Gene Expression Analysis 171

2. SCALE needs to be applied to a homogeneous cell population,

where the same bursting kinetics are shared across all cells.
Possible heterogeneity due to, for example, cell subgroups,
lineages, and donor effects, can lead to biased downstream
analysis. We find that an excessive number of significant genes
showing coordinated bursting between the two alleles can be
indicative of heterogeneity with the cell population, which
should be further stratified. Therefore, it is recommended
that users adopt dimensionality reduction and clustering meth-
ods (e.g., PCA, t-SNE [22], ZIFA [23], SC3 [24], SIMLR
[25], GiniClust2 [26], etc.) before applying SCALE. The input
data for SCALE should be cluster-specific allelic read counts
across all genes in all cells.
3. Sometimes there can be a “poor” fitting of κ and τ due to low
sequencing depth (small α) as well as low amplification effi-
ciency (small β) in the library preparation and sequencing
procedure. This results in α (Yβ) being much smaller than
Y and sometimes significantly smaller than 1 for spike-ins with
low concentrations and endogenous genes with low expres-
sions. In this case, whether there is dropout or not (modeled
by κ and τ), the observed expression will be zero from Poisson
sampling when sequencing is carried out. As such, κ and τ are
not statistically identifiable. Empirical evidence showed that
this issue does not affect the downstream analysis of bursting
kinetics, since the lowly expressed transcripts resulting in zero
read counts will be modeled and captured, whether it is due to
Poisson sampling or dropout.
4. Through the empirical Bayes framework, a gene is categorized
by SCALE to be monoallelically expressed, if only one allele is
expressed in a nonzero proportion of cells. While SCALE is
focused on detecting differential bursting kinetics between the
two alleles, monoallelic expression is an extreme case where one
allele is completely off—This is equivalent to the allele having
an infinitely large burst frequency. We do not infer bursting
kinetics on these monoallelically expressed genes since there is
no way nor need to do so. Nevertheless, the gene categoriza-
tion results return genes with monoallelic expression (see
Table 1 for an example) and can be used for downstream
analysis.
5. It is nontrivial to reliably estimate the cell size factor in silico.
Cell size can be inferred from the expression levels of
housekeeping genes such as Gapdh [33] or from the ratio of
the total number of reads mapped to the endogenous RNA and
the total number of reads mapped the spike-ins [34]. These
two measurements are not on the same scale and should not be
combined. In real dataset analysis when experimentally
measured cell sizes are not available, we recommend first esti-
mating the bursting kinetics with all cell size factors set to one.
172 Meichen Dong and Yuchao Jiang

After this, one can try again using the in silico estimated cell
size factors. The correlation between the bursting kinetics of
the two alleles can serve as a sanity check for good data quality
and accurate cell size estimation—On the genome-wide scale,
they should be correlated with a decent correlation coefficient.

Acknowledgment

This work was supported by NIH grant CA142538 and a develop-

mental award from the UNC Lineberger Comprehensive Cancer
Center 2017T109 (to YJ). We thank Dr. Nancy R Zhang and
Dr. Mingyao Li for helpful comments and suggestions.

References
1. Buckland PR (2004) Allele-specific gene transcriptomics. Int J Biochem Cell Biol
expression differences in humans. Hum Mol 90:155–160. https://doi.org/10.1016/j.bio
Genet 13(2):R255–R260. https://doi.org/ cel.2017.05.029
10.1093/hmg/ddh227 9. Kim JK, Marioni JC (2013) Inferring the kinet-
2. Deng Q, Ramskold D, Reinius B, Sandberg R ics of stochastic gene expression from single-
(2014) Single-cell RNA-seq reveals dynamic, cell RNA-sequencing data. Genome Biol 14:
random monoallelic gene expression in mam- R7. https://doi.org/10.1186/gb-2013-14-
malian cells. Science 343:193–196. https:// 1-r7
doi.org/10.1126/science.1245316 10. Levesque MJ, Ginart P, Wei Y, Raj A (2013)
3. Reinius B, Sandberg R (2015) Random mono- Visualizing SNVs to quantify allele-specific
allelic expression of autosomal genes: stochas- expression in single cells. Nat Methods
tic transcription and allele-level regulation. Nat 10:865–867. https://doi.org/10.1038/
Rev Genet 16:653–664. https://doi.org/10. nmeth.2589
1038/nrg3888 11. Goetz JJ, Trimarchi JM (2012) Transcriptome
4. Reinius B, Mold JE, Ramskold D, Deng Q, sequencing of single cells with smart-Seq. Nat
Johnsson P, Michaelsson J, Frisen J, Sandberg Biotechnol 30(8):763–765. https://doi.org/
R (2016) Analysis of allelic expression patterns 10.1038/nbt.2325
in clonal somatic cells by single-cell RNA-seq. 12. Picelli S, Faridani OR, Bjorklund AK,
Nat Genet 48:1430–1435. https://doi.org/ Winberg G, Sagasser S, Sandberg R (2014)
10.1038/ng.3678 Full-length RNA-seq from single cells using
5. Skelly DA, Johansson M, Madeoy J, smart-seq2. Nat Protoc 9(1):171–181.
Wakefield J, Akey JM (2011) A powerful and https://doi.org/10.1038/nprot.2014.006
flexible statistical framework for testing 13. Macosko EZ, Basu A, Satija R, Nemesh J,
hypotheses of allele-specific gene expression Shekhar K, Goldman M, Tirosh I, Bialas AR,
from RNA-seq data. Genome Res Kamitaki N, Martersteck EM, Trombetta JJ,
21:1728–1737. https://doi.org/10.1101/gr. Weitz DA, Sanes JR, Shalek AK, Regev A,
119784.110 McCarroll SA (2015) Highly parallel genome-
6. Leon-Novelo LG, McIntyre LM, Fear JM, wide expression profiling of individual cells
Graze RM (2014) A flexible Bayesian method using Nanoliter droplets. Cell 161
for detecting allelic imbalance in RNA-seq (5):1202–1214. https://doi.org/10.1016/j.
data. BMC Genomics 15:920. https://doi. cell.2015.05.002
org/10.1186/1471-2164-15-920 14. Zheng GX, Terry JM, Belgrader P, Ryvkin P,
7. Jiang Y, Zhang NR, Li M (2017) SCALE: Bent ZW, Wilson R, Ziraldo SB, Wheeler TD,
modeling allele-specific gene expression by McDermott GP, Zhu J, Gregory MT, Shuga J,
single-cell RNA sequencing. Genome Biol 18 Montesclaros L, Underwood JG, Masquelier
(1):74. https://doi.org/10.1186/s13059- DA, Nishimura SY, Schnall-Levin M, Wyatt
017-1200-8 PW, Hindson CM, Bharadwaj R, Wong A,
8. Benitez JA, Cheng S, Deng Q (2017) Reveal- Ness KD, Beppu LW, Deeg HJ, McFarland C,
ing allele-specific gene expression by single-cell Loeb KR, Valente WJ, Ericson NG, Stevens
Single-Cell Allele-Specific Gene Expression Analysis 173

EA, Radich JP, Mikkelsen TS, Hindson BJ, Reik W, Barahona M, Green AR, Hemberg M
Bielas JH (2017) Massively parallel digital tran- (2017) SC3: consensus clustering of single-cell
scriptional profiling of single cells. Nat Com- RNA-seq data. Nat Methods 14(5):483–486.
mun 8:14049. https://doi.org/10.1038/ https://doi.org/10.1038/nmeth.4236
ncomms14049 25. Wang B, Zhu J, Pierson E, Ramazzotti D, Bat-
15. Jia C, Hu Y, Kelly D, Kim J, Li M, Zhang NR zoglou S (2017) Visualization and analysis of
(2017) Accounting for technical noise in dif- single-cell RNA-seq data by kernel-based simi-
ferential expression analysis of single-cell RNA larity learning. Nat Methods 14(4):414–416.
sequencing data. Nucleic Acids Res 45 https://doi.org/10.1038/nmeth.4207
(19):10978–10988. https://doi.org/10. 26. Tsoucas D, Yuan GC (2018) GiniClust2: a
1093/nar/gkx754 cluster-aware, weighted ensemble clustering
16. Huang M, Wang J, Torre E, Dueck H, method for cell-type detection. Genome Biol
Shaffer S, Bonasio R, Murray JI, Raj A, Li M, 19(1):58. https://doi.org/10.1186/s13059-
Zhang NR (2018) SAVER: gene expression 018-1431-3
recovery for single-cell RNA sequencing. Nat 27. Chong S, Chen C, Ge H, Xie XS (2014) Mech-
Methods 15(7):539–542. https://doi.org/10. anism of transcriptional bursting in bacteria.
1038/s41592-018-0033-z Cell 158:314–326. https://doi.org/10.
17. Li WV, Li JJ (2018) An accurate and robust 1016/j.cell.2014.05.038
imputation method scImpute for single-cell 28. Blake WJ, Balazsi G, Kohanski MA, Isaacs FJ,
RNA-seq data. Nat Commun 9(1):997. Murphy KF, Kuang Y, Cantor CR, Walt DR,
https://doi.org/10.1038/s41467-018- Collins JJ (2006) Phenotypic consequences of
03405-7 promoter-mediated transcriptional noise. Mol
18. Li H, Durbin R (2009) Fast and accurate short Cell 24:853–865. https://doi.org/10.1016/j.
read alignment with burrows-wheeler trans- molcel.2006.11.003
form. Bioinformatics 25(14):1754–1760. 29. Fukaya T, Lim B, Levine M (2016) Enhancer
https://doi.org/10.1093/bioinformatics/ control of transcriptional bursting. Cell
btp324 166:358–368. https://doi.org/10.1016/j.
19. Dobin A, Davis CA, Schlesinger F, Drenkow J, cell.2016.05.025
Zaleski C, Jha S, Batut P, Chaisson M, Gingeras 30. Suter DM, Molina N, Gatfield D, Schneider K,
TR (2013) STAR: ultrafast universal RNA-seq Schibler U, Naef F (2011) Mammalian genes
aligner. Bioinformatics 29(1):15–21. https:// are transcribed with widely different bursting
doi.org/10.1093/bioinformatics/bts635 kinetics. Science 332:472–474. https://doi.
20. Li H, Handsaker B, Wysoker A, Fennell T, org/10.1126/science.1198817
Ruan J, Homer N, Marth G, Abecasis G, 31. Kepler TB, Elston TC (2001) Stochasticity in
Durbin R, Genome Project Data Processing transcriptional regulation: origins, conse-
Subgroup (2009) The sequence alignment/ quences, and mathematical representations.
map format and SAMtools. Bioinformatics 25 Biophys J 81:3116–3136. https://doi.org/
(16):2078–2079. https://doi.org/10.1093/ 10.1016/s0006-3495(01)75949-8
bioinformatics/btp352 32. Stegle O, Teichmann SA, Marioni JC (2015)
21. DePristo MA, Banks E, Poplin R, Garimella Computational and analytical challenges in
KV, Maguire JR, Hartl C, Philippakis AA, del single-cell transcriptomics. Nat Rev Genet
Angel G, Rivas MA, Hanna M, McKenna A, 16:133–145. https://doi.org/10.1038/
Fennell TJ, Kernytsky AM, Sivachenko AY, nrg3833
Cibulskis K, Gabriel SB, Altshuler D, Daly MJ 33. Padovan-Merhar O, Nair GP, Biaesch AG,
(2011) A framework for variation discovery Mayer A, Scarfone S, Foley SW, Wu AR,
and genotyping using next-generation DNA Churchman LS, Singh A, Raj A (2015) Single
sequencing data. Nat Genet 43(5):491–498. mammalian cells compensate for differences in
https://doi.org/10.1038/ng.806 cellular volume and DNA copy number
22. van der Maaten L, Hinton G (2008) Visualiz- through independent global transcriptional
ing data using t-SNE. J Mach Learn Res mechanisms. Mol Cell 58:339–352. https://
9:2579–2605 doi.org/10.1016/j.molcel.2015.03.005
23. Pierson E, Yau C (2015) ZIFA: dimensionality 34. Vallejos CA, Marioni JC, Richardson S (2015)
reduction for zero-inflated single-cell gene BASiCS: Bayesian analysis of single-cell
expression analysis. Genome Biol 16:241. sequencing data. PLoS Comput Biol 11:
https://doi.org/10.1186/s13059-015-0805-z e1004333. https://doi.org/10.1371/journal.
24. Kiselev VY, Kirschner K, Schaub MT, pcbi.1004333
Andrews T, Yiu A, Chandra T, Natarajan KN,
174 Meichen Dong and Yuchao Jiang

35. Skinner SO, Xu H, Nagarkar-Jaiswal S, Freire embryonic stem cells. Sci Rep 4:7125.
PR, Zwaka TP, Golding I (2016) Single-cell https://doi.org/10.1038/srep07125
analysis of transcription kinetics across the cell 37. Xu H, Sepulveda LA, Figard L, Sokac AM,
cycle. Elife 5:e12175. https://doi.org/10. Golding I (2015) Combining protein and
7554/eLife.12175 mRNA quantification to decipher transcrip-
36. Ochiai H, Sugawara T, Sakuma T, Yamamoto T tional regulation. Nat Methods 12:739–742.
(2014) Stochastic promoter activation affects https://doi.org/10.1038/nmeth.3446
Nanog expression variability in mouse
Chapter 12

Using BRIE to Detect and Analyze Splicing Isoforms

in scRNA-Seq Data
Yuanhua Huang and Guido Sanguinetti

Abstract
Single-cell RNA-seq (scRNA-seq) provides a comprehensive measurement of stochasticity in transcription,
but the limitations of the technology have prevented its application to dissect variability in RNA processing
events such as splicing. In this chapter, we review the challenges in splicing isoform quantification in
scRNA-seq data and discuss BRIE (Bayesian regression for isoform estimation), a recently proposed
Bayesian hierarchical model which resolves these problems by learning an informative prior distribution
from sequence features. We illustrate the usage of BRIE with a case study on 130 mouse cells during
gastrulation.

Key words Alternative splicing, Isoform quantification, Single-cell RNA-seq, Bayesian model

1 Introduction

Next generation sequencing (NGS) technologies have had a huge

impact on our understanding of biology, shedding unprecedented
light on the role of genomic, epigenomic, and transcriptomic pro-
cesses within cellular function. Recently, efficient RNA amplifica-
tion techniques have been coupled with NGS to yield
transcriptome sequencing protocols that can measure the abun-
dance of transcripts within single cells, known as single-cell RNA--
seq (scRNA-seq) [1]. scRNA-seq has provided unprecedented
opportunities to investigate the stochasticity of transcription and
its importance in cellular diversity. Groundbreaking applications of
scRNA-seq include the ability to discover novel cell types, e.g., rare
intestinal cell types [2] and distinct immune cell subtypes in health
and disease conditions [3, 4], to reconstruct the trajectory of
embryonic cells development [5, 6] or the fate of immune cells
[7, 8], and to dissect the heterogeneity of tumor cells [9] or the
complexity of its ecosystem [10]. This promising technology is
being applied in many more studies in a wide range of cellular
biology problems.

Guo-Cheng Yuan (ed.), Computational Methods for Single-Cell Data Analysis, Methods in Molecular Biology, vol. 1935,
https://doi.org/10.1007/978-1-4939-9057-3_12, © Springer Science+Business Media, LLC, part of Springer Nature 2019

175
176 Yuanhua Huang and Guido Sanguinetti

The vast majority of studies employing scRNA-seq probe indi-

vidual cells at the gene level, aggregating all reads mapping to an
annotated genic region into a quantification of expression for that
gene. However, the biologically relevant output of transcription is
the transcript, not the gene; multiple levels of regulation and pro-
cessing, including RNA splicing, capping, and polyadenylation, are
inevitably missed by a gene-level quantification. In this chapter we
focus on splicing, the processing step where intronic regions are
erased and exonic regions are joined to form the mature mRNA.
Splicing is a key regulatory step in gene expression in most eukar-
yotes. In human, over 90% of genes have multiple splicing isoforms
[11], often being required to express the right splicing isoform in a
specific condition. Many studies have shown that alternative splic-
ing plays an important role in cell differentiation, tissue identity,
and organ development [12], for example over 300 genes express
differential splicing isoforms between mouse embryonic and adult
brain, including some main regulatory genes in nervous system
[13]. Also, mis-splicing often leads to serious diseases, which may
be caused by genetic mutations of splicing regulatory sequences, or
the dysregulation of the spliceosome or specific splicing
factors [14].
Despite its importance, splicing quantification has seldom been
attempted in scRNA-seq studies. Technical challenges, including
the minute amounts of starting material, low cDNA conversion
efficiency, and uneven transcript coverage, mean that scRNA-seq
data sets can exhibit substantial technical variation [15]. Addition-
ally, the high level of dropout, and the generally lower coverage,
pose significant problems to standard splicing quantification meth-
ods from bulk RNA-seq.
Nevertheless, the last few years saw a few studies attempting to
stretch single-cell sequencing technologies to gain insights into
RNA splicing. Faigenbloom et al. [16] studied the variability of
the cassette exon inclusion across single cells, and highlighted the
potential regulation of the evolutionary conservation of flanking
introns and the correlation with its expression level. This study,
however, did not attempt quantification of alternative splicing pro-
portions within individual cells, merely highlighting the evidence
from the data for different isoform production between different
cells. Song and colleagues [17] used an exceptionally high (for
single cells) sequencing depth to show that splicing can be an
independent feature of cell identity in neuron cell differentiation,
and also provided a software suite for analyzing splicing in single
cells. More recent papers proposed the use of RNA splicing dynam-
ics to predict the future state of individual cells [], and to investigate
the role of epigenetic variability in splicing control [19].
In this Chapter, we describe BRIE, the first and one of the very
few methods for splicing quantification from scRNA-seq
[20]. BRIE is a Bayesian method which integrates sequence
Single-Cell RNA Splicing 177

features with scRNA-seq data to obtain more confident quantifica-

tions of splicing ratios at low coverage. The method can be used
both for splicing quantification and to detect differential splicing
between different cells/ groups of cells. Additionally, the integra-
tive aspects of BRIE can be used to quantify the effects of covariates
(such as sequence or epigenetic status) on splicing, as done in
[19]. BRIE is implemented as a stand-alone python package
which is freely available at https://pypi.org/project/brie.
The rest of this chapter is organized as follows. In the Materials
section, we briefly describe the software package and the prepro-
cessed splicing annotation data. Then in the Methods section, we
provide a brief and self-contained introduction to the methodolog-
ical foundations of BRIE, followed by a tutorial introduction to its
usage on a case study on 130 mouse cells during gastrulation [5]. A
few notes (see Notes 1–4) on using BRIE are also provided.

2 Materials

The main software we need for this Chapter is BRIE, which is

implemented as open source software in Python, compatible with
both Python2 and Python3. BRIE was mainly designed for exon-
skipping events: we provide preprocessed splicing events annota-
tions for human and mouse and their according sequence features.
In addition, BRIE has a separate toolkit package, BRIE-kit, to
annotate and filter exon-skipping events and extract sequence fea-
tures for any species. Both the software and preprocessed annota-
tion data sets are freely available at following links,
1. BRIE package: https://pypi.org/project/brie.
2. BRIE-kit for preprocessing: https://pypi.org/project/briekit.
3. Preprocessed annotation datasets: https://sourceforge.net/
projects/brie-rna/files/annotation/.
In the next section, we will go through a case study on
130 mouse cells during gastrulation, which is a published scRNA-
seq dataset produced using the SMART-seq2 protocol [5]. The
original dataset consists of 1205 cells; in this tutorial we use a subset
of cells for convenience: all the analyses described in this chapter
can be carried out on a single small server within a few hours. For
bigger dataset with thousands of cells, cluster with multiple nodes
can be very useful to reduce the computational time. As the input
files are specified in each subsection, this example pipeline is also
applicable for any other customized dataset, and all command lines
in bash files are included in BRIE’s GitHub repository: https://
github.com/huangyh09/brie/tree/master/example/
gastrulation.
178 Yuanhua Huang and Guido Sanguinetti

3 Methods

3.1 High-Level Model Low coverage is a common problem for scRNA-seq data, and
Description particularly brings a big statistical challenge to splicing quantifica-
tion, as short sequenced RNA-seq reads can be aligned to multiple
isoforms, namely not immediately informative on splicing status,
and therefore it normally requires a high coverage for accurate
estimate. Recent work has shown that improved predictions at
lower coverage can be achieved with Bayesian methods by incor-
porating informative prior distributions within the probabilistic
splicing quantification algorithms, leveraging either aspects of the
experimental design, such as time series [21], or auxiliary data sets
such as measurements of PolII localization [22]. In addition, it has
also been demonstrated that splicing (in bulk cells) can be accu-
rately predicted from sequence-derived features [23]. This suggests
that overall patterns of read distribution may be associated with
specific sequence words, so that one may be able to construct
sequence-based informative prior distributions that may be learned
directly from data. This is the idea at the core of BRIE (Bayesian
Regression for Isoform Estimation), a statistical model that
achieves extremely high sensitivity at low coverage by the use of
informative priors learned directly from data via a (latent) regres-
sion model. The regression model couples the task of splicing
quantification across different genes, allowing a statistical transfer
of information from well-covered genes to lower covered genes,
achieving considerable robustness to noise in low coverage.
Figure 1 presents a schematic illustration of BRIE (see Meth-
ods in the original paper for precise definitions and details of the
estimation procedure). The bottom part of the figure represents the
standard mixture model approach to isoform estimation intro-
duced in MISO [24] and Cufflinks [25] and also used in many
recent methods, e.g., Kallisto [26], where reads are associated with
a latent, multinomially distributed isoform identity variable. This
module takes as input the scRNA-seq data (aligned reads) and
forms the likelihood of our Bayesian model. The multinomial
identity variables are then assigned an informative prior in the
form of a regression model (top half of Fig. 1), where the prior
probability of inclusion ratios is regressed against sequence-derived
features. Crucially, the regression parameters are shared across all
genes and can be learned across multiple single cells, thus regular-
izing the task and enabling robust predictions in the face of very
low coverage. While the class of regression models we employ is
different from the neural networks of [23], they still provide a
highly accurate supervised learning predictor of splicing on bulk
RNA-seq data sets.
This architecture effectively enables BRIE to simultaneously
trade-off two tasks: in the absence of data (drop-out genes), the
Single-Cell RNA Splicing 179

C1 A C2

#716 #8 #7 #4 735 seq

K-mers Length Conserv SS motif features

Bayesian
regression

Prior
P( |W, X)

Posterior
P( |R, W, X)

Likelihood
P(R| )

Mixture
modeling

RNA-seq
reads

Fig. 1 A cartoon of the BRIE method for isoform estimation. BRIE combines a likelihood computed from
RNA-seq data (bottom part) and an informative prior distribution learned from 735 sequence-derived features
(top)

informative prior provides a way of imputing missing data, while for

highly covered genes the likelihood term dominates, returning a
mixture-model quantification. For intermediate levels of coverage,
BRIE uses Bayes’s theorem to trade off imputation and
180 Yuanhua Huang and Guido Sanguinetti

quantification. A full description of the statistical model, as well as

empirical results on both real and simulated data sets, is available in
the original paper [20]. In the rest of this section, we focus on
giving a thorough introduction to the usage of the associated
software package.

3.2 Data Preparation In order to quantify the exon-skipping splicing with BRIE, we need
with BRIE-Kit a set of good quality annotated splicing events, and also their
according sequence features. Besides using our preprocessed anno-
tations for human and mouse, BRIE-kit, a separate Python package
developed under the Python2 environment, can prepare the anno-
tation for more species, through three functions (1) briekit-event
for extracting the exon-skipping events from full gene annotation,
(2) briekit-event-filter for filtering out poor quality splicing events,
and (3) briekit-factor for defining and extracting the sequence
features. In order to perform these preprocessing steps on the
mouse gastrulation data (stored in the DATA_DIR directory), the
following command lines would be used
1. Generate exon-skipping events from gene annotation.

$ briekit-event -a gencode.vM17.annotation.gtf.gz -o
$DATA_DIR/AS_events

2. Filter splicing events.

$ briekit-event-filter -a AS_events/SE.gff3.gz --anno_ref

gencode.vM17.annotation.gtf.gz \
-r GRCm38.p6.genome.fa

3. Extract the sequence features.

$ briekit-factor -a AS_events/SE.filtered.gff3.gz -r
GRCm38.p6.genome.fa \
-c mm10.60way.phastCons.bw -o mouse_features.csv -p 10 --
bigWigSummary ./bigWigSummary

This retrieves the 12,115 initial exon-skipping events; 5763 out

of them pass the filtering conditions listed in the original BRIE
paper [20]. The filtered annotation will be located in SE.filtered.
gff3.gz, as a default output file. Also, the same set of 735 sequence
features that are used in the original paper [20] for learning infor-
mative priors will be generated into the output file mouse_features.
csv.gz. These two files will be used for splicing quantification
in BRIE.
More detailed instructions, including downloading genome
references and dependent software, can be found in the BRIE-kit
GitHub wiki page (https://github.com/huangyh09/briekit/wiki)
Single-Cell RNA Splicing 181

and bash file for this example (https://github.com/huangyh09/

briekit/blob/master/example/anno_mouse.sh).

3.3 Splicing Once we obtained a set of exon-skipping events and their according
Quantification sequence features, we can use BRIE to quantify their inclusion
Using BRIE probabilities from scRNA-seq data. First, we need to download
the raw scRNA-seq reads in fastq format, and align the reads to
the genome, by HISAT [27] or STAR [28] or other splice-aware
aligners. Then, for each cell, there will be a sorted and indexed
alignment file in bam/sam format, e.g., cell_n.sorted.bam.
Now, we can run BRIE to quantify the annotated exon-
skipping events on the aligned reads files by following command
line,

$ brie -a SE.filtered.gff3.gz -s cell_n.sorted.bam -f mouse_-

features.csv.gz -o $OUT_DIR -p 10

BRIE then generates three output files in the OUT_DIR:

fractions.tsv for exon inclusion fractions, samples.csv.gz for all
MCMC samples, and weights.tsv for learned weights of these
sequence predictors.
All the above procedures work on an individual cell. With these
settings, it takes around 4 min for BRIE to quantify one cell on a
single server. Similarly, given 10 nodes in parallel, 1300 cells can be
done around 8 h. Naturally, the running time will depend on the
specifications of the server, as well as on the number of splicing
events quantified.
In addition, BRIE provides a simple way to pool all reads from a
group of cells as an ensemble sample, which can be useful either for
estimating a single-homogeneous splicing value in a group of cells
or for learning common regression weights for the input features.
The homogeneous splicing value in a group can be used for detect-
ing differential splicing between groups, and the regression weights
can be used as shared weights to (re-)quantify at single-cell resolu-
tion. For inputting multiple cells, the above command line just
needs to list all cells with comma separation, and for setting shared
weights for individual cells it is just by using -w, as follows,

$ brie -a SE.filtered.gtf -f mouse_features.csv.gz \

-s cell1.sorted.bam,cell2.sorted.bam,cell3.sorted.bam -o
$OUT_DIR/cell_1to4 -p 10
$ brie -a SE.filtered.gtf -f mouse_features.csv.gz \
-s cell1.sorted.bam -w $OUT_DIR/cell_1to4/weights.tsv -o
$OUT_DIR/cell1_fixedW -p 10

3.4 Differential Besides splicing quantification, a major focus of many studies is to

Splicing Analysis detect the differential splicing between individual cells or between
with BRIE cell clusters. BRIE uses a Bayes factor to detect differential splicing
182 Yuanhua Huang and Guido Sanguinetti

between any pair of cells (or cell clusters), and the command line to
run it as follows,

# a few cells in command line

$ fileList=cell1/samples.csv.gz,cell2/samples.csv.gz,cell3/
samples.csv.gz,cell4/samples.csv.gz
$ brie-diff -i $fileList -o cell1_cell4.diff.tsv

Then brie-diff will output two tsv files. The first file, in the
format of xxx.diff.tsv, contains all pairs of cells (or cell groups) that
passed the Bayes factor threshold. The other one, in the format of
xxx.diff.rank.tsv, ranks the splicing events by the number of differ-
entially spliced cell pairs, which can be used to select splicing marker
for cell type identity. When using brie-diff to detect differential
splicing events between two groups of cells, we provide some
discussion in Note 1.

3.5 Plotting Results Once a set of highly variable splicing events has been detected, it is
and Extracting very useful to visualize the raw reads and the quantification results.
Statistics from Sashimi plots [29], which were originally developed by Yarden Katz
the BRIE Output and colleagues, are a visually effective way to display reads densities
and junction reads. In BRIE, we adapted the sashimi plot to visua-
lize the results, including the reads density and the prior and
posterior distribution of splicing fraction. Sashimi_plot is included
in BRIE-kit GitHub repository (https://github.com/huangyh09/
briekit/tree/master/sashimi_plot), as a self-contained folder,
which can be executed as follows,

SASHIMI=~/MyGit/briekit/sashimi_plot/sashimi_plot.py
python $SASHIMI --plot-event ENSMUSG00000027478.AS2 $GFF_DIR
sashimi_setting.txt \
--output-dir $PLOT_DIR --plot-label DNMT3B-exon2.pdf --plot-
title DNMT3B-exon2

Figure 2 shows an example of the type of plot that will be

produced. Note, sashimi_plot is not part of the standard BRIE-kit
package, but the above folder includes all scripts and demos to
generate the sashimi plot. A few dependent Python packages are
needed, e.g., misopy and matplotlib, and it is only compatible with
Python2 but not Python3.

4 Notes

BRIE provides an effective solution to the problem of splicing

quantification in single cells, by learning a sequence-based informa-
tive prior that can robustify results against low coverage. In this
Chapter, we provided a tutorial introduction to the usage of the
Single-Cell RNA Splicing 183

Fig. 2 Visualization of splicing quantification with sashimi plot and histogram. An example exon-skipping event
in DNMT3B in 3 mouse cells at 6.5 s days and 3 cells at 7.75 s days. The left panel is sashimi plot of the reads
density and the number of junction reads. The right panel is the prior distribution in blue curve and a histogram
of the posterior distribution in black, both learned by BRIE. For the histogram, the red line is the mean and the
dash lines are the 95% confidence interval

BRIE python package; this is aimed at the practitioner bioinforma-

tician that might want to replicate splicing analyses on scRNA-seq
data. While we believe BRIE is a worthwhile addition to the armory
of scRNA-seq statistical methods, some considerations as to its
usage and possible extension are in order.
1. For detecting differential splicing between two cell groups in
Subheading 3.4, there are two options. The first option is
pooling reads from all cells within a group, and quantify splic-
ing as one ensemble sample, which can be simply done by
listing all cells with comma separation in brie. Then brie-diff
can detect the differential splicing events between any pair of
ensemble samples. This option assumes a homogeneous splic-
ing level across the group, and it requires an additional splicing
quantification besides the routine cell-level quantifications.
Alternatively, one may use BRIE to quantify splicing in individ-
ual cells in each group, and then use a standard hypothesis test,
e.g., Wilcoxon, to detect significant changes in the point esti-
mates between two cell populations. This procedure requires
however a sufficient number of cells in each population to
achieve statistical power, and it ignores the uncertainty in the
estimate. The original paper [20] used the first option; we are
currently developing a hierarchical model to perform
184 Yuanhua Huang and Guido Sanguinetti

differential splicing analyses while considering the uncertainty

and heterogeneity within cluster.
2. As opposed to gene-level expression quantification, splicing
quantification introduces a substantial uncertainty when there
are very limited reads observed as often in scRNA-seq data.
Therefore, it is not sensible to directly use the events-by-cell
matrix of the mean estimates from BRIE in downstream ana-
lyses. A large fraction of splicing events has no or very few
reads, namely the quantification is mainly based on the impu-
tation, whose distribution is usually broad. Therefore, a single
mean value may not be representative enough to the estimated
distribution. There are two options to use them. The first
option is to filter some events in some cells and treat them as
missing values. The filtering condition could be based on the
posterior distribution itself (e.g., 95% confidence interval
<0.25) to select only high-confidence events, or on coverage
thresholds (e.g. more than 5 reads or 3 junction reads). Filters
based on confidence intervals in particular can be very useful, as
the imputations in some events are highly informative and
should be used, even there are fewer than 3 reads. The second
option is to use both mean and variance of the estimate.
Though this option accounts for the uncertainty in the esti-
mate, it is not compatible with standard gene-level analysis, and
requires some skills in statistical modeling.
3. Multiplexed scRNA-seq experiments are increasingly popular
for drop-seq, SMART-seq and other protocols, thanks to its
lower cost, lower batch effects, and ability for doublets detec-
tion. However, BRIE is not able to demultiplex these samples,
and it relies on demultiplexed cells, namely each cell has its only
aligned reads. One option for demultiplex is using cardelino
(https://github.com/PMBio/cardelino), which includes a
pipeline for SMART-seq/SMART-seq2 data.
4. Many software packages introduce new features or even new
input/output formats during upgrading. Therefore, it is often
important to specify the environment. Within the Python plat-
form, many tools developed in Python2 may fail to run in
Python3 environment. BRIE developers try their best to
make it compatible with both Python2 and Python3. Should
BRIE fail in one environment, e.g., Python3, the simplest
strategy is to try the other environment, which can be very
easily set up via the conda environment (https://conda.io/
docs/user-guide/tasks/manage-environments.html).
Single-Cell RNA Splicing 185

References
1. Grün D, van Oudenaarden A (2015) Design 16. Faigenbloom L, Rubinstein ND, Kloog Y et al
and analysis of single-cell sequencing experi- (2015) Regulation of alternative splicing at the
ments. Cell 163:799–810 single-cell level. Mol Syst Biol 11:845
2. Grün D, Lyubimova A, Kester L et al (2015) 17. Song Y, Botvinnik OB, Lovci MT et al (2017)
Single-cell messenger RNA sequencing reveals Single-cell alternative splicing analysis with
rare intestinal cell types. Nature 525:251–255 expedition reveals splicing dynamics during
3. Gaublomme JT, Yosef N, Lee Y et al (2015) neuron differentiation. Mol Cell 67:148–161
Single-cell genomics unveils critical regulators 18. La Manno G, Soldatov R, Hochgerner H et al
of Th17 cell pathogenicity. Cell (2018) RNA velocity of single cells. Nature
163:1400–1412 560.7719:494
4. Papalexi E, Satija R (2018) Single-cell RNA 19. Linker SM, Urban L, Clark S et al (2018)
sequencing to explore immune cell heteroge- Combined single cell profiling of expression
neity. Nat Rev Immunol 18:35 and DNA methylation reveals splicing regula-
5. Scialdone A, Tanaka Y, Jawaid W et al (2016) tion and heterogeneity. bioRxiv:328138
Resolving early mesoderm diversification 20. Huang Y, Sanguinetti G (2017) BRIE:
through single-cell expression profiling. transcriptome-wide splicing quantification in
Nature 535:289–293. https://doi.org/10. single cells. Genome Biol 18:123. https://doi.
1038/nature18633 org/10.1101/098517
6. Wagner DE, Weinreb C, Collins ZM et al 21. Huang Y, Sanguinetti G (2016) Statistical
(2018) Single-cell mapping of gene expression modeling of isoform splicing dynamics from
landscapes and lineage in the zebrafish embryo. RNA-seq time series data. Bioinformatics
Science 80:eaar4362 32:2965–2972
7. Stubbington MJT, Lönnberg T, Proserpio V 22. Liu P, Sanalkumar R, Bresnick EH et al (2016)
et al (2016) T cell fate and clonality inference Integrative analysis with ChIP-seq advances the
from single-cell transcriptomes. Nat Methods limits of transcript quantification from
13:329 RNA-seq. Genome Res 26:1124–1133
8. Lönnberg T, Svensson V, James KR et al 23. Xiong HY, Alipanahi B, Lee LJ et al (2015)
(2017) Single-cell RNA-seq and computa- The human splicing code reveals new insights
tional analysis using temporal mixture model- into the genetic determinants of disease. Sci-
ling resolves Th1/Tfh fate bifurcation in ence 1254806:347
malaria. Sci Immunol 2(9):eaal2192 24. Katz Y, Wang ET, Airoldi EM, Burge CB
9. Patel AP, Tirosh I, Trombetta JJ et al (2014) (2010) Analysis and design of RNA sequencing
Single-cell RNA-seq highlights intratumoral experiments for identifying isoform regulation.
heterogeneity in primary glioblastoma. Science Nat Methods 7:1009–1015
344:1396–1401 25. Trapnell C, Williams BA, Pertea G et al (2010)
10. Tirosh I, Izar B, Prakadan SM et al (2016) Transcript assembly and quantification by
Dissecting the multicellular exosystem of met- RNA-Seq reveals unannotated transcripts and
astatic melanoma by single-cell RNA-seq. Sci- isoform switching during cell differentiation.
ence 352:189–196. https://doi.org/10. Nat Biotechnol 28:511–515
1126/science.aad0501.Dissecting 26. Bray NL, Pimentel H, Melsted P, Pachter L
11. Wang ET, Sandberg R, Luo S et al (2008) (2016) Near-optimal probabilistic RNA-seq
Alternative isoform regulation in human tissue quantification. Nat Biotechnol 34:525
transcriptomes. Nature 456:470–476 27. Kim D, Langmead B, Salzberg SL (2015)
12. Baralle FE, Giudice J (2017) Alternative splic- HISAT: a fast spliced aligner with low memory
ing as a regulator of development and tissue requirements. Nat Methods 12:357
identity. Nat Rev Mol Cell Biol 18:437 28. Dobin A, Davis CA, Schlesinger F et al (2013)
13. Dillman AA, Hauser DN, Gibbs JR et al (2013) STAR: ultrafast universal RNA-seq aligner.
mRNA expression, splicing and editing in the Bioinformatics 29:15–21
embryonic and adult mouse cerebral cortex. 29. Katz Y, Wang ET, Silterra J et al (2015) Quan-
Nat Neurosci 16:499 titative visualization of alternative exon expres-
14. Scotti MM, Swanson MS (2016) RNA sion from RNA-seq data. Bioinformatics
mis-splicing in disease. Nat Rev Genet 17:19 31:2400–2402
15. Ziegenhain C, Vieth B, Parekh S et al (2017)
Comparative analysis of single-cell RNA
sequencing methods. Mol Cell 65:631–643
Chapter 13

Preprocessing and Computational Analysis of Single-Cell

Epigenomic Datasets
Caleb Lareau, Divy Kangeyan, and Martin J. Aryee

Abstract
Recent technological developments have enabled the characterization of the epigenetic landscape of single
cells across a range of tissues in normal and diseased states and under various biological and chemical
perturbations. While analysis of these profiles resembles methods from single-cell transcriptomic studies,
unique challenges are associated with bioinformatics processing of single-cell epigenetic data, including a
much larger (10–1,000) feature set and significantly greater sparsity, requiring customized solutions.
Here, we discuss the essentials of the computational methodology required for analyzing common single-
cell epigenomic measurements for DNA methylation using bisulfite sequencing and open chromatin using
ATAC-Seq.

Key words Epigenetics, Bioinformatics, Single-cell, DNA methylation, Bisulfite sequencing,

ATAC-seq

1 Introduction

Single-cell epigenetic data analysis pipelines are designed to resolve

differences between single cells in heterogeneous populations,
including variability between and within classically defined cell
types in DNA methylation using single-cell DNA methylation
(scDNAm) profiles (Fig. 1) and chromatin accessibility using
single-cell ATAC-Seq (scATAC-seq) profiles (Fig. 2). Axes of varia-
tion driving cell-to-cell epigenomic differences include cell-type,
cell-state, and cell-cycle, similar to transcriptomic differences.
Figure 3 outlines the most common technology platforms
associated with generating single-cell epigenomic data. While
downstream analyses for each platform are largely the same,
technology-specific bioinformatics solutions at the front-end of
analyses are often required. For example, demultiplexing individual
cells from a pool often requires custom scripts to efficiently parse
out barcode sequences.

Guo-Cheng Yuan (ed.), Computational Methods for Single-Cell Data Analysis, Methods in Molecular Biology, vol. 1935,
https://doi.org/10.1007/978-1-4939-9057-3_13, © Springer Science+Business Media, LLC, part of Springer Nature 2019

187
188 Caleb Lareau et al.

CpGs

Enhancer Promoter Gene

Heterogenous tissue Cell type variable DNA methylation

Fig. 1 Overview of single-cell DNA methylation data. For a heterogeneous group of cells (middle), variability in
CpG methylation can occur both between classically defined cell types (left) and within cell types (right)

Bulk
ATAC-seq
Sum of
scATAC-seq
chr19:36.10-36.19 Mb
Genes

−
− − − −− −
− − −
− −
− − −−
−
−
− − − − −
− − −
−
− −−
−
−
−
− − − − −− −
Hundreds of

− − − − − − − −
− −
− − − −
single cells

−
− − −−− −
− − − − −
− − − − −
− − −−
−
− − − − −
− − − −
− −
− − − −− −
− − −
−
−
− − − −
− − −
− −
−
− − −−
− − − − −
−
−
− − −
− − Fragments
−
−
−
−
− − − − −
−
−
− − −
−
−
− − − −−− − − −− − −
− − −
− −
−
−
−
− −
−−
−−−
− −
− − −
− − − 2
− − − −−
− − −− − −
− − −− 1
−
− − − −
− − − −−
− −
− − − −
−
− − −−− −
− − −− −
− −
−
− − − −
− −
− 0
Chromatin Accessibility Peaks
Fig. 2 Overview of single-cell ATAC data. The sum of single cells’ chromatin accessibility profiles resemble
that of a bulk experiment (top) though each cell has a varied open chromatin epigenome. For a diploid
organism, the number of accessible chromatin counts does not generally exceed 2

Plate or array-based Split/Pool-based Droplet-based

Fig. 3 Modes of single-cell epigenomic assays. For scATAC and DNA methylation, popular approaches include
array and split-pool-based though advances in microfluidic technologies will be increasingly used
Single-Cell Epigenomics Analysis 189

Poorly mapped & Reads in

Raw reads (.fastq) trimmed reads peaks
1. Trim adapters
Other unique,
2. Align reads
10% 5%
nuclear reads
20%
Aligned reads (.bam)
3. Filter mitochondria
30% 35%
4. Mark PCR duplicates
5. Overlap with peaks Reads PCR Duplicates
mapped to
Counts matrix (.txt) mitochondrial DNA

Fig. 4 Overview of scATAC-seq data processing. Steps associated with data processing are shown on the left
while a representative data allocation for a given cell is shown on the right. Roughly only about 10% of the raw
data for a given scATAC-seq cell is used in downstream analyses, such as cell clustering

Raw reads (.fastq)

1. Trim reads

Aligned reads (.bam)

3. Deduplication (if necessary)
4. Per-base methylation estimation

Methylation values (.hdf5)

Fig. 5 Overview of single-cell DNA methylation data processing. Steps

associated with data processing are shown on the left

Here, we describe the essentials of processing single-cell epige-

nomic assays, starting with raw data preprocessing which typically
involves demultiplexing pooled library FASTQ read files, the
per-cell quantification of epigenomic state, and the collation of
data from multiple cells into a single data structure (Figs. 4 and 5).
Due to the large fraction of unobserved loci in single-cell epige-
nomic data, further downstream analysis is typically preceded by a
step that aggregates data across multiple sites to reduce sparsity.
This might involve biologically-motivated approaches such as aver-
aging DNA methylation values across predefined sets of promoters
or enhancers, or computing transcription factor accessibility
metrics by combining scATAC-seq data from all sites for a given
factor (Fig. 6). Notably, only a fraction of the full data is fully
utilized in downstream applications, such as ~10% of reads for a
given cell from scATAC-seq library preparation (Fig. 4).
190 Caleb Lareau et al.

Sum of
single cells
Peak 1 Peak 2 Peak 3

TF 1 O X O
TF 2 X O X
TF 3 O O X X = predicted
TF 4 X O X binding

Cell 1 Cell 2 Cell 3 Cell 4 ... TF 1 TF 2 TF 3 TF 4 ... TF 1 TF 2 TF 3 TF 4 ...

Peak 1 0 1 0 1 Peak 1 0 1 0 1 Cell 1 0.6 1 -0.2 0

Peak 2 1 2 0 0 Peak 2 1 0 0 0 Cell 2 1 -0.2 1.1 3.2

Peak 3 0 1 1 1 Peak 3 0 1 1 1 Cell 3 0.1 0 -1.4 1.1

.
.

.
...

...
...

..
..

..
X’ * M = Z
Fig. 6 Biologically-motivated scATAC-seq dimensionality reduction. A binary matrix of motifs by peaks is
multiplied by an integer-value matrix of peaks by cells to yield reduced set of features (real valued motif
deviation scores) per cell. These can be utilized for downstream analyses, including visualization with t-SNE,
and cell type identification

2 Materials

The protocols described here rely on free, open-source bioinfor-

matics tools. All tools can be installed on a Mac or Linux computer.
The single-cell ATAC-seq protocol uses:
1. chromVAR (https://bioconductor.org/packages/release/
bioc/html/chromVAR.html) [1].
2. Picard tools (https://broadinstitute.github.io/picard/).
The single-cell DNA methylation protocol uses:
3. Bismark (https://www.bioinformatics.babraham.ac.uk/pro
jects/bismark/) [2].
Shared tools useful for both protocols include the following
software:
4. bcl2fastq (https://support.illumina.com/sequencing/
sequencing_software/bcl2fastq-conversion-software.html).
5. Bowtie2 (http://bowtie-bio.sourceforge.net/bowtie2/) [3].
6. TrimGalore (https://www.bioinformatics.babraham.ac.uk/
projects/trim_galore/).
Single-Cell Epigenomics Analysis 191

7. R (https://cran.r-project.org).
8. Bioconductor (https://www.bioconductor.org) [4].
9. bedtools (http://bedtools.readthedocs.io/en/latest/index.
html) [5].
10. bedgraphToBigwig (http://hgdownload.soe.ucsc.edu/
admin/exe/).

3 Methods

The protocols described below share the first step (Subheading 3.1)
and are then described separately for single-cell DNA ATAC-Seq
(Subheading 3.2) and single-cell DNA methylation (Subheading
3.3). Graphical overviews of these steps for each method are shown
in Figs. 4 and 5.

3.1 Demultiplexing 1. Single-cell experimental protocols typically involve combining

Single-Cell Libraries individually barcoded single-cell libraries into pools for high-
throughput sequencing. The first step in preprocessing is thus
to demultiplex reads into per-cell FASTQ files. The pooling is
often done in a two-step process (see Note 1). First, a set of
single cells are individually tagged with barcodes and batched
together into a Level 1 pool. This step often involves custom
barcodes. Next, such a pool can then tagged with an additional
barcode and pooled with others creating a Level 2 pool. This
second level of pooling typically uses standard Illumina sample
barcodes. The two-step pooling procedure thus requires two
sequential demultiplexing steps.
2. The Illumina barcoded Level 2 pools can be demultiplexed
using the standard Illumina bcl2fastq tool to generate indi-
vidual fastq files for each Level 1 pool, e.g.:

bcl2fastq --runfolder-dir NextSeq_Data --output-dir

fastq_directory Samplesheet.csv

3. Next, we demultiplex the Level 1 pool fastqs to generate a fastq

(pair) for each single cell. This step typically requires protocol-
specific tools since the exact single-cell barcode structure used
varies from protocol to protocol. For data generated from the
Fluidigm C1 array, these indices are typically codes in the P5
and P7 adaptor sequences. Otherwise, adaptor sequences have
to be parsed out in-line with custom scripts.

3.2 scATAC-Seq 1. Trim adapter sequences from fastq files using tools such as
TrimGalore (see Note 2).

trim_galore --paired read1.fastq.gz read2.fastq.gz

192 Caleb Lareau et al.

2. Align trimmed reads with bowtie2 and pipe to samtools to

create a sorted .bam file (see Note 3).
bowtie2 -X 2000 -1 cellA_1.trim.fastq.gz -2
cell_A2.trim.fastq.gz -x /path/to/bowtie2/reference |
samtools view -bS - | samtools sort - -o cellA.st.bam
samtools index cellA.st.bam

3. Filter reads for unique alignments Picard tools.

java -jar MarkDuplicates.jar I=cellA.st.bam
O=cellA.dedup.bam M=cellA.marked_rep
ort.txt VALIDATION_STRINGENCY=SILENT REMOVE_DUPLICATES=true

4. Shift aligned reads based on Tn5 activity (see Note 4).

bedtools bamtobed -i reads.bam -bedpe | awk -v OFS="\t"
’{if($9=="+"){print $1,$2+4,$6+4}else if($9=="-"){print
$1,$2-5,$6-5}}’ > cellA_fragments.bed

5. Remove reads overlapping known blacklist regions (see Note 5).

bedtools subtract -a cellA_fragments.bed -b blacklist.bed >
cellA_filt_frags.bed

6. Call accessibility peaks using MACS2 [6] on shifted reads over

the union of cell fragments (see Notes 6 and 7).
macs2 call peak -t allCells_fragments.bed --nomodel --shift
-100 --extsize 200 --keep-dup all --call-summits -n combined-
Cells

7. Assemble peak by cell counts matrix using an interactive R

session with chromVAR (see Note 8).
library(chromVAR)
library(motifmatchr)
library(BSgenome.Hsapiens.UCSC.hg19) # change based on
reference genome
library(SummarizedExperiment)
library(chromVARmotifs)
peakfile <- "data/peaks.bed"
peaks <- getPeaks(peakfile)
# Import fragments from per-cell
bedfiles <- list.files("data/cell_fragments", full.names =
TRUE)
raw_counts <- getCounts(bamfiles, peaks, by_rg = FALSE,
format = "bed",
colData = DataFrame(source = bedfiles))
Single-Cell Epigenomics Analysis 193

8. Perform scATAC-seq-specific dimensionality reduction (see

Notes 9 and 10).
# Initialize parallel processing
library(BiocParallel)
register(MulticoreParam(2)) # adjust according to your
machine

# Get GC content/peak; get motifs from chromVARmotifs

package
counts <- addGCBias(counts, genome =
BSgenome.Hsapiens.UCSC.hg19)
data("human_pwms_v2") # also mouse_pwms_v2
motif_ix <- matchMotifs(human_pwms_v2, counts, genome =
BSgenome.Hsapiens.UCSC.hg19)
dim(motif_ix)

# Compute deviations; typically

most time consuming step
dev <- computeDeviations(object = counts, annotations =
motif_ix)

9. Visualize scATAC cells in a two dimensional space by supplying

deviations scores to the t-distributed Stochastic Neighbor
Embedding algorithm (see Note 11).

3.3 scMethylation 1. Trim sequencing adapter sequences from fastq files. It is also
advisable to additionally trim of poor quality bases at the ends
of reads that can lead to alignment errors and/or incorrect
methylation calls. For example, to perform both quality and
adapter trimming in a single step one can use TrimGalore (see
Note 12 for RRBS libraries):
trim_galore --paired read1.fastq.gz read2.fastq.gz

2. Align trimmed reads with a bisulfite-aware aligner (see Note 13).

bismark --genome Bisulfite_Genomes/grch38 -1 fastq1.trim.
fastq.gz -2 fastq2.trim.fastq.gz

3. Remove PCR duplicates.

This step should be performed only for whole-genome
bisulfite sequencing (WGBS) libraries. It is not advised for
reduced representation (RRBS) or targeted capture libraries
(see Note 14).
deduplicate_bismark --bam cell_1.bam

4. Quantify methylation for each cell at each CpG position (see

Note 15).
Per-CpG methylation can be estimated as the fraction of
reads representing methylated cytosines over the total number
194 Caleb Lareau et al.

of reads at the position. We report the number of methylated

reads (M), unmethylated reads (U) and the ratio M /(M+U).
bismark_methylation_extractor --gzip --bedGraph --
genome_folder bismark_index cell_1.bam

By default, this command will omit CpGs with no coverage.

Forcing the inclusion of all CpGs (even those with zero cover-
age) by adding the --cytosine_report option can be help-
ful for simplifying and speeding up the downstream step of
combining data across cells (see Note 16).
5. (Optional) Generate a bigwig track for visualization in a
genome browser.
Individual cell methylation values can be visualized in IGV
or another browser using a bigwig track file (see Note 16).
gunzip cell_1.bedGraph.gz
bedtools sort -i cell_1.bedGraph > cell_1.sorted.bedGraph
bedGraphToBigWig cell_1.sorted.bedGraph
grch38_chrom_sizes.txt cell_1.bw

6. Collate data from individual cells.

Workflow steps 1–5 are performed on a per-cell basis and
can therefore be parallelized. For downstream analysis, it is
often convenient to have data from all cells represented in a
single data structure, such as a matrix of methylation estimates
with one row per CpG (or genomic region), and one column
per cell. This collation can be performed in R/Bioconductor
using the bsseq (https://bioconductor.org/packages/
release/bioc/html/bsseq.html) package (see Notes 18 and
19 for a basic example script). As an alternative collation meth-
ods for users who prefer not to use R, one can also simply paste
together the methylation estimate columns (column 4) from
the individual cell *.cov.gz files from step 4 (e.g., using the
Unix paste command). In order to allow this one needs
to ensure that all files have exactly the same CpG order
(see Note 17).
7. Aggregate data across CpGs.
Unlike transcriptomic assays, single-cell DNA-based assays
like scDNAm do not benefit from the amplification inherent in
transcription. As a result the number of template molecules can
be as low as one, leading to a very large fraction (often >90%)
of missing data. As a result, analyses of individual sites (e.g.,
specific gene promoters) is challenging as a given gene would
only be observed in a small fraction of cells. A common
approach to dealing with this sparsity involves averaging meth-
ylation values across all CpGs in a predefined feature set. These
feature sets consist of genomic regions relevant to the down-
stream questions of interest but may include, for example, gene
Single-Cell Epigenomics Analysis 195

promoters/enhancers for each Gene Ontology category, or

various classes of repetitive elements. This process will trans-
form the highly sparse matrix from step 6 (one row per CpG,
one column per cell) to a smaller, denser matrix with one row
per feature set and one column per cell. This dense feature set
matrix is a starting point for analyses that is amenable to
standard approaches such as PCA, clustering or identification
of differentially methylated feature sets across.

4 Notes

1. The number of cells per pool is limited either by the technology

platform (the Fluidigm C1 array, for example, processes cells in
batches of 96 cells), or by the need to generate sufficient
sequencing depth per cell.
Most single-cell ATAC-seq libraries are processed from
paired-end, dual-index sequencing. Previous studies using the
Fluidigm C1 array [7] and split-pool [8] index cells using
unique combinations of the indices from the sequencing
technology.
2. The two common Nextera adapters used in ATAC-seq and
bisulfite sequencing are TCGTCGGCAGCGTC and TGGTAGA-
GAGGGTG. These will be the most typical sequences that must
be trimmed before alignment.
3. The --X 2000 flag is used to increase alignment of reads
spanning 2–5 nucleosomes (the width of one nucleosome is
~147 bp) which have been reliably detected from ATAC-seq
data. The default option in bowtie2 is 500. Additionally, these
steps can benefit from multiple available cores through parallel
execution using the -p # flag for bowtie2 and the -@ # flag in
samtools where # is an integer (2 for standard laptop com-
puters or as high as 32 on a high-performance computing
environment).
4. After processing reads for PCR duplicates and proper pairs,
ATAC-seq and scATAC-seq aligned read coordinates are typi-
cally offset by +4 and the minus strand aligning reads by 5 bp
to represent the center of the transposon binding event. This
bias was first reported in the original ATAC-seq paper
[9]. Depending on the downstream applications of the scATAC
data, invoking this option later may be desirable.
5. The standard ENCODE blacklist for hg19 can be found
here: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/
encodeDCC/wgEncodeMapability/. An additional custom
blacklist for ATAC-seq derived from mitochondrial reads
mapped to the nuclear genome can be found here: https://
github.com/buenrostrolab/mitoblacklist.
196 Caleb Lareau et al.

6. The --shift -100 --extsize 200 option centers a 200 bp

window where the Tn5 binds, yielding more accurate peak
accessibility coordinates. The --nomodel flag bypasses the
read shifting model that was designed for ChIP-seq data.
Some users also invoke the --nolambda flag to.
7. A strategy for integrating peak calls across multiple cell types or
batches for analysis is discussed here: http://bioconductor.
org/packages/devel/bioc/vignettes/DiffBind/inst/doc/
DiffBind.pdf. In short, using called accessibility summits and
then padding coordinates by fixed amounts can lead to fixed-
width peaks and an inclusive feature set, both of which have
many desirable properties.
8. When importing data into chromVAR, it is often more conve-
nient to have all the single-cell reads contained in a single .bam
file rather than split per-file, especially if the cell count exceeds
100. To achieve this, you can add a read group (RG) ID at the
bowtie2 alignment step:

bowtie2 -1 cellA_1.trim.fastq.gz -2 cell_A2.trim.fastq.gz

--rg-id cellA ...

9. Intuitively, this approach will compute bias-corrected deviation

scores for each of the K motifs and a set of S samples (scATAC-
seq profiles) with P peaks. A positive deviation score can be
interpreted as the specific cell has more accessible chromatin at
the specific genomic feature than expected by chance. This
procedure in chromVAR also requires binarized
matrix M (dimension P by K) where mi,k is 1 if annotation
k (such as a predicted transcription factor binding site) is pres-
ent in peak i and 0 otherwise.
Using the matrix of fragment counts in peaks X, where xi,j
represents the number of fragments from peak i in sample j, a
matrix multiplication XT ∙ M yields the total number of frag-
ments weighted by the presence of a predicted TF binding site.
To compute a raw weighted accessibility deviation, we compute
the expected number of fragments per peak per sample in E,
where ei,j is computed as the proportion of all fragments across
all samples mapping to the specific peak multiplied by the total
number of fragments in peaks for that sample:
P
x i, j
j
e i, j ¼ P P Σi x i, j
j x i, j
i

Analogously, XT ∙ E yields the expected number of fragments

weighted by the fine mapped variant posterior probabilities for
S samples (rows) and K factors (columns). Using the M,X,
Single-Cell Epigenomics Analysis 197

and E matrices, we then compute the raw weighted accessibility

deviation matrix Y for each sample j and trait k (yj,k) as follows:
P
P P
P
x i, j mi, k e i, j mi, k
y j , k ¼ i¼1 i¼1
P
P
e i, j mi, k
i¼1

To correct for technical confounders present in assays (differ-

ential PCR amplification or variable Tn5 tagmentation condi-
tions), chromVAR generates a background set of peaks intrinsic
to the set of epigenetic data examined. The background peak
sampling procedure has been described in depth elsewhere.
[7,1] Ultimately, this procedure yields a matrix B(b). This
ðb Þ
matrix encodes this background peak mapping where b i, j is
1 if peak i has peak j as its background peak in the b background
set (b ∈ {1, 2, . . ., 50}) and 0 otherwise. The matrices B(b) ∙ X
and B(b) ∙ Ethus give an intermediate for the observed and
expected counts also of dimension P by S. For each background
ðb Þ
set b, sample j , and trait k, the elements y j , k of the background
weighted accessibility deviations matrix Y(b) are computed as
follows:
P
P P
P
B ðb Þ ∙X i, k
mi, k B ðb Þ ∙E i, k
mi, k
ðb Þ i¼1 i¼1
y j,k ¼ P
P
B ðb Þ ∙E i, k
mi, k
i¼1

After the background deviations are computed over the 50 sets,

the bias-corrected matrix Z for sample j and trait k (zj,k) can be
computed as follows:

ðb Þ
y j , k mean y j , k
z j, k ¼
ðb Þ
sd y j , k
ðb Þ
where the mean and variance of y j , k is taken over all values of
b (b ∈ {1, 2, . . ., 50}). This implementation uses computation-
ally efficient matrix operations for each step and can compute
pairwise trait-cell type enrichments in ~1 min on a standard
laptop computer.
10. A recent extension of the chromVAR methodology (called
g-chromVAR) to handle weighted or uncertain motif annota-
tions specifically designed for variants nominated by genome-
wide association studies has recently been described [10].
11. A sample Rscript used to perform steps 7–9 from .bam files
(the most common workflow) is shown here:
198 Caleb Lareau et al.

library(chromVAR)
library(motifmatchr)
library(BSgenome.Hsapiens.UCSC.hg19) # change based on
reference genome
library(SummarizedExperiment)
library(chromVARmotifs)

# Initialize parallel processing

library(BiocParallel)
register(MulticoreParam(2)) # adjust according to your
machine

# Import/filter data (replace with appropriate file paths

peakfile <- "data/peaks.bed"
peaks <- getPeaks(peakfile)
bamfiles <- list.files("data/bams", full.names = TRUE)
raw_counts <- getCounts(bamfiles, peaks, paired = TRUE,
by_rg = FALSE,
format = "bam", colData = DataFrame(source = bamfiles))

# Filter low quality samples and peaks

counts_filtered <- filterSamples(raw_counts, min_depth =
500,
min_in_peaks = 0.15, shiny =
FALSE)
counts <- filterPeaks(counts_filtered)

# Get GC content/peak; get motifs from chromVARmotifs

package; find kmers
counts <- addGCBias(counts, genome = BSgenome.Hsapiens.
UCSC.hg19)
data("human_pwms_v2") # also mouse_pwms_v2
motif_ix <- matchMotifs(human_pwms_v2, counts, genome =
BSgenome.Hsapiens.UCSC.hg19)
dim(motif_ix)

# Compute deviations; typically most time consuming step

dev <- computeDeviations(object = counts, annotations =
motif_ix)

# Find variable motifs

variabilityAll <- computeVariability(dev)
plotVariability(variabilityAll, use_plotly = FALSE)

# Visualize single cells with a tSNE

tsne_results <- deviationsTsne(dev, threshold = 1.5, per-
plexity = 10,
Single-Cell Epigenomics Analysis 199

shiny = FALSE)
tsne_plots <- plotDeviationsTsne(dev, tsne_results, anno-
tation = "CTCF",
sample_column = "source",
shiny = FALSE)

12. The MspI-digestion Reduced Representation Bisulfite

Sequencing (RRBS) protocol involves an end-repair step that
introduces an unmethylated cytosine at the end of fragments.
This cytosine should not be used for methylation calling and
can be trimmed by removing 2 bases at the end of reads using
the –-rrbs option of TrimGalore:

trim_galore --rrbs --paired read1.fastq.gz read2.fastq.gz

13. The alignment step required a preindexed bisulfite converted

genome. This can be generated using the bismark_gen-
ome_preparation tool. The single required argument points
to a directory that contains the reference genome FASTA file
(s).

bismark_genome_preparation /path/to/genomes/grch38/

14. Deduplication is typically performed by retaining only one read

(or read pair) for each unique mapping start position. This is
done under the assumption that the number of potential mole-
cules to sample from is large (on the order of the number of
base pairs in the genome) compared to the number of
sequenced reads, and thus reads with the same start position
are more likely to represent PCR duplicates than to have arisen
from different pre-amplification DNA fragments. However,
since reduced representation bisulfite sequencing (RRBS)
libraries and capture/hybrid selection bisulfite libraries only
represent a small portion of the genome this assumption does
not hold.
15. Adding the --CX_context option to bismark_methyla-
tion_extractor will quantify methylation at all cytosines
(on both strands), as opposed to only those in the CpG con-
text. Note that this will generate an output file with over a
billion lines for a human genome.
16. A similar set of commands can be run for visualizing scATAC
data in the .bigwig format. Similar to the single-cell methyla-
tion data, these accessibility profiles should most likely be
pooled across cells. Note that bedGraphToBigWig takes an
input file that specifies the length of each chromosome. This
can be generated using samtools as follows:
200 Caleb Lareau et al.

samtools faidx grch38.fa

cut -f1,2 grch38.fa.fai > grch38_chrom_sizes.txt

17. Current single-cell methylation assays are able to capture only a

small fraction of the ~28 million CpGs in the genome. As a
result the CpG coverage and methylation output of bismark_-
methylation_extractor will contain a different set of cytosines
for each cell. Excluding the uninformative cytosines with zero
coverage saves disk space, but complicates the step of collating
data from multiple cells into a single table containing the union
of all observed CpGs across. The bismark_methylatio-
n_extractor --cytosine_report option will force it to
output all CpGs, thus creating a common set of rows for all
cells. This allows collation to be done simply by pasting
together the methylation estimate columns from each cell’s
coverage file.
18. The R/Bioconductor script below is a basic template for
collating the per-cell cytosine coverage outputs from
bismark_methylation_extractor:

# Replace this with the location/names of the output from

bismark_methylation_extractor
covgz_files <- c("cell_1.bismark.cov.gz",
"cell_2.bismark.cov.gz",
"cell_3.bismark.cov.gz")

library(Biostrings)
library(bsseq)
library(BSgenome.Hsapiens.NCBI.GRCh38)
library(readr)
library(HDF5Array)

getMethCov <- function(covgz_file, gr) {

tab <- read_tsv(covgz_file,
col_types = "ciidii",
col_names=c("chr", "pos", "pos2", "meth_per-
cent", "m_count", "u_count"))
tab_gr <- GRanges(tab$chr, IRanges(tab$pos, tab$pos))
m <- rep(0, length(gr))
cov <- rep(0, length(gr))
ovl <- suppressWarnings(findOverlaps(tab_gr, gr))
m[subjectHits(ovl)] <- tab$m_count[queryHits(ovl)]
cov[subjectHits(ovl)] <- tab$m_count[queryHits(ovl)] +
tab$u_count[queryHits(ovl)]
return(list(m=m, cov=as.integer(cov)))
}

# Set up genome-wide CpG GRanges

Single-Cell Epigenomics Analysis 201

# On the plus strand we keep the left-most position of the

match
# On the minus strand we keep the right-most position of
the match
cpg_gr <- DNAString("CG")
cpg_gr <- vmatchPattern(cpg_gr, BSgenome.Hsapiens.NCBI.
GRCh38)
cpg_gr <- keepStandardChromosomes(cpg_gr, pruning.mode="-
coarse")
s <- start(cpg_gr)
e <- end(cpg_gr)
plus_idx <- as.logical(strand(cpg_gr)=="+")
minus_idx <- as.logical(strand(cpg_gr)=="-")
e[plus_idx] <- s[plus_idx] # Plus strand
s[minus_idx] <- e[minus_idx] # Minus strand
start(cpg_gr) <- s
end(cpg_gr) <- e

# Get methylation, coverage matrices and store on disk as

HDF5Arrays
hdf5_m <- list()
hdf5_cov <- list()
for(covgz_file in covgz_files) {
tmp <- getMethCov(covgz_file, cpg_gr)
m <- tmp$m
cov <- tmp$cov
samplename <- basename(covgz_file)
samplename <- gsub(samplename, pattern="_PE_report.txt",
replacement="")
hdf5_file <- paste0(samplename, ".hdf5")
if(file.exists(hdf5_file)) file.remove(hdf5_file)
hdf5_m[[samplename]] <- writeHDF5Array(matrix(m),
name="m", file=hdf5_file)
hdf5_cov[[samplename]] <- writeHDF5Array(matrix(cov),
name="cov", file=hdf5_file)
}
M <- do.call("cbind", hdf5_m)
Cov <- do.call("cbind", hdf5_cov)

# Create a bsseq object ready for downstream analysis

bs <- BSseq(gr=cpg_gr, M=M, Cov=Cov)

19. We provide sample code for running specific downstream parts

of these described methods here: https://github.com/
aryeelab/mmb-scEpigenomics
202 Caleb Lareau et al.

Acknowledgments

We are grateful to Jason Buenrostro for useful feedback in the

discussion of the scATAC-seq computational analyses.

References

1. Schep AN, Wu B, Buenrostro JD, Greenleaf RM, Brown M, Li W, Liu XS (2008) Model-
WJ (2017) chromVAR: inferring transcription- based analysis of ChIP-Seq (MACS). Genome
factor-associated accessibility from single-cell Biol 9(9):R137. https://doi.org/10.1186/
epigenomic data. Nat Methods 14 gb-2008-9-9-r137
(10):975–978. https://doi.org/10.1038/ 7. Buenrostro JD, Wu B, Litzenburger UM,
nmeth.4401 Ruff D, Gonzales ML, Snyder MP, Chang
2. Krueger F, Andrews SR (2011) Bismark: a flex- HY, Greenleaf WJ (2015) Single-cell chroma-
ible aligner and methylation caller for Bisulfite- tin accessibility reveals principles of regulatory
Seq applications. Bioinformatics 27 variation. Nature 523(7561):486–490.
(11):1571–1572. https://doi.org/10.1093/ https://doi.org/10.1038/nature14590
bioinformatics/btr167 8. Cusanovich DA, Daza R, Adey A, Pliner HA,
3. Langmead B, Salzberg SL (2012) Fast gapped- Christiansen L, Gunderson KL, Steemers FJ,
read alignment with Bowtie 2. Nat Methods 9 Trapnell C, Shendure J (2015) Multiplex single
(4):357–359. https://doi.org/10.1038/ cell profiling of chromatin accessibility by com-
nmeth.1923 binatorial cellular indexing. Science 348
4. Huber W, Carey VJ, Gentleman R, Anders S, (6237):910–914. https://doi.org/10.1126/
Carlson M, Carvalho BS, Bravo HC, Davis S, science.aab1601
Gatto L, Girke T, Gottardo R, Hahne F, Han- 9. Buenrostro JD, Giresi PG, Zaba LC, Chang
sen KD, Irizarry RA, Lawrence M, Love MI, HY, Greenleaf WJ (2013) Transposition of
MacDonald J, Obenchain V, Oles AK, native chromatin for fast and sensitive epige-
Pages H, Reyes A, Shannon P, Smyth GK, nomic profiling of open chromatin,
Tenenbaum D, Waldron L, Morgan M DNA-binding proteins and nucleosome posi-
(2015) Orchestrating high-throughput geno- tion. Nat Methods 10(12):1213–1218.
mic analysis with bioconductor. Nat Methods https://doi.org/10.1038/nmeth.2688
12(2):115–121. https://doi.org/10.1038/ 10. Lareau CA, Ulirsch JC, Bao EL, Ludwig LS,
nmeth.3252 Guo MH, Benner C, Satpathy AT, Salem R,
5. Quinlan AR, Hall IM (2010) BEDTools: a Hirschhorn JN, Finucane HK, Aryee MJ,
flexible suite of utilities for comparing genomic Buenrostro JD, Sankaran VG (2018) Interro-
features. Bioinformatics 26(6):841–842. gation of human hematopoiesis at single-cell
https://doi.org/10.1093/bioinformatics/ and single-variant resolution. bioRxiv.
btq033 https://doi.org/10.1101/255224
6. Zhang Y, Liu T, Meyer CA, Eeckhoute J, John-
son DS, Bernstein BE, Nusbaum C, Myers
Chapter 14

Experimental and Computational Approaches

for Single-Cell Enhancer Perturbation Assay
Shiqi Xie and Gary C. Hon

Abstract
Transcriptional enhancers drive cell-type-specific gene expression patterns, and thus play key roles in
development and disease. Large-scale consortia have extensively cataloged >one million putative enhancers
encoded in the human genome. But few enhancers have been endogenously tested for function. For almost
all enhancers, it remains unknown what genes they target and how much they contribute to target gene
expression. We have previously developed a method called Mosaic-seq, which enables the high-throughput
interrogation of enhancer activity by performing pooled CRISPRi-based epigenetic suppression of enhan-
cers with a single-cell transcriptomic readout. Here, we describe an optimized version of this method,
Mosaic-seq2. We have made several key improvements that have significantly simplified the library prepara-
tion process and increased the overall sensitivity and throughput of the method.

Key words Single-cell RNA-seq, Enhancer, CRISPRi, Single-cell perturbation

1 Introduction

Changes in cell state are often driven by changes in gene expression.

However, we currently lack a predictive understanding of gene
regulation, and this is a major impediment to a deterministic
understanding of development and disease. Regulatory elements
called transcriptional enhancers are the key drivers of tissue- and
disease-specific gene expression, and major consortia have identi-
fied more than one million putative enhancers in the human
genome [1, 2]. A unique feature of enhancers is their ability to
activate genes from large distances, occasionally >1 megabase away
[3]. This feature has significantly complicated the accurate identifi-
cation of enhancer target genes. Correlation analyses have pre-
dicted enhancer target genes, but few have been experimentally
validated. As a result, despite the many developmental and disease
systems in which enhancers have been mapped, the vast majority of
predicted enhancers have no known targets. This is a key obstacle to
determining the biomedical roles of each enhancer.

Guo-Cheng Yuan (ed.), Computational Methods for Single-Cell Data Analysis, Methods in Molecular Biology, vol. 1935,
https://doi.org/10.1007/978-1-4939-9057-3_14, © Springer Science+Business Media, LLC, part of Springer Nature 2019

203
204 Shiqi Xie and Gary C. Hon

We have recently developed Mosaic-seq, a highly multiplexed

enhancer perturbation technology that assesses endogenous
enhancer activities without the gene tagging or phenotypic selec-
tion [4]. By combining CRISPRi [5, 6] with single-cell RNA
sequencing, Mosaic-seq simultaneously measures sgRNA perturba-
tions and the transcriptomic outcomes of these perturbations in the
same single cell. This strategy is analogous to those described in
Perturb-seq, CRISP-seq, and CROP-seq [7–9]. Mosaic-seq can
unbiasedly identify the target genes of any enhancer. Moreover, it
also provides unique information about single-cell enhancer usage
and combinatorial enhancer activity [4].
In this chapter, we will introduce Mosaic-seq2, with key
changes to increase sensitivity and throughput (Fig. 1). First, we
adapt the 10 Genomics single-cell RNA-seq platform, which
enables the generation of 80,000 single-cell RNA-Seq libraries in
a single run. Second, we incorporate the CROP-seq design for
sgRNA expression, which enables direct detection of sgRNAs with-
out the need to barcode [9]. This design dramatically simplifies the
construction of sgRNA plasmid libraries, and also eliminates the
shuffling of sgRNAs and barcodes due to retroviral recombination
[10, 11]. Third, as in Perturb-seq, we implement an enrichment
PCR step to increase the detection efficiency of the sgRNAs in each
single cell [7]. These improvements significantly increase the
throughput of Mosaic-seq, which now allows simultaneous pertur-
bation of hundreds to thousands of enhancers.
Below, we will describe Mosaic-seq2 using K562 cells as an
example. This chapter details (1) the design and construction of
the sgRNA libraries, (2) lentiviral packaging and infection, (3) prep-
aration of single-cell RNA-seq libraries, and (4) computational
analysis. This protocol can be readily extended to other systems.
Overall, single-cell perturbation approaches like Mosaic-seq2 will
enable an understanding of how enhancers function in single cells
and how they define a cell’s transcriptional regulatory network.

2 Materials

2.1 Cell Culture 1. K562 cells and HEK293T cells (both from ATCC).
2. Phosphate-buffered saline (PBS), pH 7.4.
3. 0.25% Trypsin-EDTA.
4. Complete cell culture medium: Iscove’s Modified Dulbecco’s
Medium (IMDM, for K562) or Dulbecco’s Modified Eagle’s
Medium (DMEM, for HEK293T) with 10% FBS, 100 U/mL
Penicillin-Streptomycin (Thermo Fisher Scientific).
5. 10 cm cell culture dishes.
6. Hemocytometer.
Basic Procedures for Mosaic-Seq 205

Infected by sgRNA
library targeting
enhancers

Mix, spike-in 5% 10X Genomics

sgNC cells Chromium Platform

K562
dCas9-KRAB

Infected by sgNC viruses

Cells
Oil
Beads

Full length cDNA

io n
cri p t
UMI
Reverse Tra
ns UMI Cell Barcode
mR A
AA T
NA AA TTT
cDNA Cell Barcode Gel
A

T
T

Fragmentation Enrichment of sgRNA cDNA synthesis Beads

sgRNA
Library Prep

P5 Adapter P7 Adapter P5 Adapter P7 Adapter

Original 10X Library sgRNA Library

Lib Index Lib Index

Fig. 1 Overview of Mosaic-seq2. Preparation of single-cell RNA-seq libraries generally follows the 10 library
preparation procedures except that an sgRNA enrichment library is amplified from full length cDNA

2.2 Plasmids 1. Lentivirus packaging plasmids: pMD2.G and psPAX2

and sgRNA Library (Addgene ID 12259 and 12260).
Construction 2. CROP-seq plasmid for sgRNA expression (Addgene ID
86708).
3. lenti-dCas9-KRAB plasmids (Addgene ID 89567).
4. sgRNA oligo library (Table 1).
206 Shiqi Xie and Gary C. Hon

Table 1
List of oligos (see Note 1)

Name Sequence
Oligo Amp Fwd TAACTTGAAAGTATTTCGATTTCTTGGCTTTATATATCTTG
TGGAAAGGACGAAACACCG

Oligo Amp Rev ATTTTAACTTGCTAGGCCCTGCAGACATGGGTGATCCTCATGTTGGCC

TAGCTCTAAAAC

sgRNA Oligo Pool GTGGAAAGGACGAAACACCNNNNNNNNNNNNNNNNNNNNGTTTTAGAGC

TAGGCCAACATGAGGATCAC

LKO1_5 primer GACTATCATATGCTTACCGT

sgRNA_amp TTGGCCTAGCTCTAAAAC

sgRNA-Lib staggered AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTC

i5-1 TTCCGATCTCTATCATATGCTTACCGT

sgRNA-Lib staggered AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTC

i5-2 TTCCGATCTACTATCATATGCTTACCGT

sgRNA-Lib staggered AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTC

i5-3 TTCCGATCTTACTATCATATGCTTACCGT

sgRNA-Lib staggered AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTC

i5-4 TTCCGATCTGTTCTATCATATGCTTACCGT