Computational Method For Single Cell Data Analysis
Computational Method For Single Cell Data Analysis
Computational
Methods for
Single-Cell Data
Analysis
METHODS IN MOLECULAR BIOLOGY
Series Editor
John M. Walker
School of Life and Medical Sciences
University of Hertfordshire
Hatfield, Hertfordshire, AL10 9AB, UK
Edited by
Guo-Cheng Yuan
Dana–Farber Cancer Institute and Harvard Chan School of Public Health, Boston, MA, USA
Editor
Guo-Cheng Yuan
Dana–Farber Cancer Institute and Harvard Chan
School of Public Health
Boston, MA, USA
This Humana Press imprint is published by the registered company Springer Science+Business Media, LLC, part of
Springer Nature.
The registered company address is: 233 Spring Street, New York, NY 10013, U.S.A.
Preface
The cell is the fundamental unit of life. The biological functions of an organ or tissue are
results of coordinated action of a large number of cells, each having its own properties and
dynamic behavior. While it is traditional to classify cells with similar function and morphol-
ogy as cell types, it is also well-recognized that, even within each cell type, there remain
significant differences, or states, among individual cells. Current knowledge of the reper-
toire of cell types and cell states, as well as their dynamic changes, remains highly incomplete.
Systematic, comprehensive characterization of spatial and temporal organization of cellular
heterogeneity, along with the mechanisms underlying cell-type/state transition and main-
tenance, has important implications in development and diseases.
It is not until recently that it has become feasible to systematically investigate cellular
heterogeneity at the single-cell resolution, thanks to the rapid development of a number of
advanced technologies including sequencing, imaging, and microfluidic devices. Collec-
tively, single-cell technologies have created exciting opportunities to systematically charac-
terize the molecular behavior of individual cells at the omics scale. At the same time, the
analysis and integration of single-cell omic data are difficult due to a number of challenges
such as sparsity, technical variability, and spatial-temporal complexity.
During the past few years, numerous computational methods and software packages
have been developed to overcome these challenges. The aim of this book is to introduce to
the community the state of the art of computational approaches in single-cell data analysis.
Each chapter presents a computational toolbox that is aimed to overcome a specific chal-
lenge in single-cell analysis, such as data normalization, rare cell-type identification, and
spatial transcriptomics analysis. Rather than explaining the mathematical details, here the
focus is on hands-on implementation of computational methods for analyzing experimental
data. Taken together, these chapters cover a wide range of tasks and may serve as a handbook
for single-cell data analysis.
Finally, I would like to thank Prof. John M. Walker for his kind invitation and sustained
support throughout the preparation of this book. I would also like to express my sincere
gratitude to all the contributors for sharing their protocols.
v
Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Contributors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
1 Quality Control of Single-Cell RNA-seq . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Peng Jiang
2 Normalization for Single-Cell RNA-Seq Data Analysis. . . . . . . . . . . . . . . . . . . . . . . 11
Rhonda Bacher
3 Analysis of Technical and Biological Variability in Single-Cell
RNA Sequencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Beomseok Kim, Eunmin Lee, and Jong Kyoung Kim
4 Identification of Cell Types from Single-Cell Transcriptomic Data . . . . . . . . . . . . 45
Karthik Shekhar and Vilas Menon
5 Rare Cell Type Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Lan Jiang
6 scMCA: A Tool to Define Mouse Cell Types Based on Single-Cell
Digital Expression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Huiyu Sun, Yincong Zhou, Lijiang Fei, Haide Chen, and Guoji Guo
7 Differential Pathway Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Jean Fan
8 Pseudotime Reconstruction Using TSCAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
Zhicheng Ji and Hongkai Ji
9 Estimating Differentiation Potency of Single Cells
Using Single-Cell Entropy (SCENT) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
Weiyan Chen and Andrew E. Teschendorff
10 Inference of Gene Co-expression Networks from Single-Cell
RNA-Sequencing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
Alicia T. Lamere and Jun Li
11 Single-Cell Allele-Specific Gene Expression Analysis. . . . . . . . . . . . . . . . . . . . . . . . . 155
Meichen Dong and Yuchao Jiang
12 Using BRIE to Detect and Analyze Splicing Isoforms
in scRNA-Seq Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
Yuanhua Huang and Guido Sanguinetti
13 Preprocessing and Computational Analysis of Single-Cell
Epigenomic Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
Caleb Lareau, Divy Kangeyan, and Martin J. Aryee
vii
viii Contents
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
Contributors
MARTIN J. ARYEE Department of Biostatistics, Harvard T.H. Chan School of Public Health,
Boston, MA, USA; Department of Pathology, Massachusetts General Hospital, Boston, MA,
USA; Broad Institute of MIT and Harvard, Cambridge, MA, USA
RHONDA BACHER Department of Biostatistics, University of Florida, Gainesville, FL, USA
HAIDE CHEN Center for Stem Cell and Regenerative Medicine, Zhejiang University School
of Medicine, Hangzhou, China; Stem Cell Institute, Zhejiang University, Hangzhou,
China
WEIYAN CHEN CAS Key Lab of Computational Biology, CAS-MPG Partner Institute for
Computational Biology, Shanghai Institute of Nutrition and Health, Shanghai Institute of
Biological Sciences, University of Chinese Academy of Sciences, Chinese Academy of Sciences,
Shanghai, China
MEICHEN DONG Department of Biostatistics, Gillings School of Global Public Health,
University of North Carolina, Chapel Hill, NC, USA
JEAN FAN Department of Chemistry and Chemical Biology, Harvard University, Boston,
MA, USA
LIJIANG FEI Center for Stem Cell and Regenerative Medicine, Zhejiang University School of
Medicine, Hangzhou, China; Stem Cell Institute, Zhejiang University, Hangzhou, China
GUOJI GUO Center for Stem Cell and Regenerative Medicine, Zhejiang University School of
Medicine, Hangzhou, China; Stem Cell Institute, Zhejiang University, Hangzhou, China
GARY C. HON Department of Obstetrics and Gynecology, Cecil H. and Ida Green Center for
Reproductive Biology Sciences, University of Texas Southwestern Medical Center, Dallas,
TX, USA
YUANHUA HUANG EMBL-European Bioinformatics Institute, Cambridgeshire, UK
HONGKAI JI Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health,
Baltimore, MD, USA
ZHICHENG JI Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health,
Baltimore, MD, USA
LAN JIANG Howard Hughes Medical Institute, Boston Children’s Hospital, Boston, MA,
USA; Program in Cellular and Molecular Medicine, Boston Children’s Hospital, Boston,
MA, USA; Division of Hematology/Oncology, Department of Pediatrics, Boston Children’s
Hospital, Boston, MA, USA
PENG JIANG Regenerative Biology Laboratory, Morgridge Institute for Research, Madison,
WI, USA
YUCHAO JIANG Department of Biostatistics, Gillings School of Global Public Health,
University of North Carolina, Chapel Hill, NC, USA; Department of Genetics, School of
Medicine, University of North Carolina, Chapel Hill, NC, USA; Lineberger
Comprehensive Cancer Center, University of North Carolina, Chapel Hill, NC, USA
DIVY KANGEYAN Department of Biostatistics, Harvard T.H. Chan School of Public Health,
Boston, MA, USA; Department of Pathology, Massachusetts General Hospital, Boston, MA,
USA
BEOMSEOK KIM Department of New Biology, DGIST, Daegu, Republic of Korea
JONG KYOUNG KIM Department of New Biology, DGIST, Daegu, Republic of Korea
ALICIA T. LAMERE Mathematics Department, Bryant University, Smithfield, RI, USA
ix
x Contributors
CALEB LAREAU Department of Biostatistics, Harvard T.H. Chan School of Public Health,
Boston, MA, USA; Department of Pathology, Massachusetts General Hospital, Boston, MA,
USA
EUNMIN LEE Department of New Biology, DGIST, Daegu, Republic of Korea
JUN LI Applied and Computational Mathematics and Statistics Department, University of
Notre Dame, Notre Dame, IN, USA
IDA LINDEMAN Wellcome Sanger Institute, Hinxton, Cambridge, UK; KG Jebsen Coeliac
Disease Research Centre and Department of Immunology, University of Oslo, Oslo, Norway
VILAS MENON Janelia Research Campus, Howard Hughes Medical Institute, Ashburn, VA,
USA; Columbia University Medical Center, New York, NY, USA
GUIDO SANGUINETTI School of Informatics, University of Edinburgh, Edinburgh, UK
KARTHIK SHEKHAR Klarman Cell Observatory, Broad Institute of MIT and Harvard,
Cambridge, MA, USA
MICHAEL J. T. STUBBINGTON Wellcome Sanger Institute, Hinxton, Cambridge, UK
HUIYU SUN Center for Stem Cell and Regenerative Medicine, Zhejiang University School of
Medicine, Hangzhou, China; Stem Cell Institute, Zhejiang University, Hangzhou, China
ANDREW E. TESCHENDORFF CAS Key Lab of Computational Biology, CAS-MPG Partner
Institute for Computational Biology, Shanghai Institute of Nutrition and Health,
Shanghai Institute of Biological Sciences, University of Chinese Academy of Sciences,
Chinese Academy of Sciences, Shanghai, China; UCL Cancer Institute, University College
London, London, UK
SHIQI XIE Department of Obstetrics and Gynecology, Cecil H. and Ida Green Center for
Reproductive Biology Sciences, University of Texas Southwestern Medical Center, Dallas,
TX, USA
YINCONG ZHOU Stem Cell Institute, Zhejiang University, Hangzhou, China; College of Life
Sciences, Zhejiang University, Hangzhou, China
QIAN ZHU Dana-Farber Cancer Institute, Boston, MA, USA
Chapter 1
Abstract
Single-cell RNA-seq (scRNA-seq) is emerging as a promising technology to characterize and dissect the
cell-to-cell variability. However, the mixture of technical noise and intrinsic biological variability makes
separating technical artifacts from real biological variation cells particularly challenging. Proper detection
and filtering out technical artifacts before downstream analysis are critical. Here, we present a protocol that
integrates both gene expression patterns and data quality to detect technical artifacts in scRNA-seq samples.
Key words scRNA-seq, Quality control, Integrate, Gene expression patterns, Data quality
1 Introduction
Guo-Cheng Yuan (ed.), Computational Methods for Single-Cell Data Analysis, Methods in Molecular Biology, vol. 1935,
https://doi.org/10.1007/978-1-4939-9057-3_1, © Springer Science+Business Media, LLC, part of Springer Nature 2019
1
2 Peng Jiang
2 Materials
Data
D Quality (Main Population) False Positives Rate is estimated by % of main
population cells fail to pass data quality cutoffs
Cutoff
< 5% of Cells
Gene expression outliers
High Low
Technical Artifact with low data quality
Subpopulation Cells
Fig. 1 Illustration of quality control (QC) for scRNA-seq framework. Cells can be separated out based on gene
expression patterns into gene expression outliers and cells of the main population. The data quality cutoffs are
determined by allowing a certain percentage (e.g., <5%) of main population cells that fail to pass them. The
technical artifacts are defined as gene expression outliers that fail to pass data quality cutoffs. The
subpopulation cells are defined as gene expression outliers that can pass data quality cutoffs
2.3 ScRNA-seq Data 1. Raw scRNA-seq dataset (H1) can be accessed by Gene Expres-
sion Omnibus (GEO) with accession number (GSE64016).
2. The downloaded files from GEO are SRA format.
3. SRA toolkit (http://www.ncbi.nlm.nih.gov/Traces/sra/sra.
cgi?view¼software) can be used to convert files from SRA
format to FASTQ format via “fastq-dump” utility.
3 Methods
3.2 Single-Cell 1. 5000–8000 cells were loaded onto a medium size (10–17 μm)
Capture and cDNA C1 Single-Cell Auto Prep IFC (Fluidigm).
Library Preparation 2. The capture efficiency was inspected using EVOS FL Auto Cell
Imaging system (Life Technologies) to perform an automated
area scanning of the 96 capture sites on the IFC.
3. Empty capture sites or sites having more than one cell captured
were first noted and those samples were later excluded from
further library processing for RNA-seq.
4. Immediately after capture and imaging, reverse transcription
and cDNA amplification were performed in the C1 system
using the SMARTer PCR cDNA Synthesis kit (Clontech) and
the Advantage 2 PCR kit (Clontech).
5. Full-length, single-cell cDNA libraries were harvested the next
day from the C1 chip and diluted to a range of 0.1–0.3 ng/μL.
6. Diluted single-cell cDNA libraries were fragmented and ampli-
fied using the Nextera XT DNA Sample Preparation Kit and
the Nextera XT DNA Sample Preparation Index Kit (Illumina).
7. Libraries were multiplexed at 24 libraries per lane, and single-
end reads of 67-bp were sequenced on an Illumina HiSeq 2500
system.
3.3 Reads Mapping 1. Using Bowtie [18] to map raw reads against the reference
genes (e.g., human hg19 Refseq reference) allowing up to
two mismatches and a maximum of 20 multiple hits.
2. The mapped expected read counts and TPMs can be estimated
by RSEM [19].
3.5 Metrics to 1. Total number of mapped reads: the sum of mapped reads for all
Evaluate the scRNA- the genes. An extremely low number of mapped reads may
seq Library Quality affect the ability to characterize the transcriptome and could
be due to either a low mapping rate or other technical issues
introduced during sample prep or sequencing.
2. Mapping rate: the total number of mapped reads divided by the
read depth. Mapping rate can be effected by RNA degradation,
contamination with genomic DNA, or other technical issues
introduced during sample prep or sequencing.
3. Reads complexity: the ratio of unique reads (the count of reads
after removing duplicates) over the total number of all reads.
3.6 Combining 1. For each cell, calculates a quantile score (QS) for each quality
Library Quality Metrics metric. Given a metric, the QS of a cell is defined as the number
to Combined Scores of other cells in the dataset with equal or lower values divided
by the total number of cells. For example, if a cell has the 20th
highest mapping rate among a set of 80 cells, then the mapping
rate QS for this particular cell is 0.75. A higher QS indicates
better data quality.
2. Minimal Quantile Score (MQS): the minimal QS of the three
quality metrics.
MQS ¼ minfQSi g
i∈fmapped reads; mapping rate; reads complexityg
MQS assumes that each of the three quality metrics is critical
and that a deficiency in any of the three is a potential indicator of
technical issues. Thus the “final quality” of a cell depends on its
lowest quality metric score.
3. Weighted Combined Quality Score (WCQS): WCQS assumes
that the importance of each quality metric may depend on
specific experimental batches, protocols, and/or conditions.
WCQS assumes that the importance of each quality metric for
detecting technical artifacts is proportional to its ability to
discriminate between gene expression outliers and cells of the
main population. For example, given a batch of cells, if the
mapping rate of a given batch of cells can perfectly discriminate
between gene expression outliers and cells of the main popula-
tion, then it is more likely that the mapping rate is a dominant
player in detecting technical artifacts. In contrast, if a metric
does not indicate differences between gene expression outliers
and cells of the main population, then it should be removed
6 Peng Jiang
3.7 Identification of 1. We assume that good quality cells should pass particular MQS
Technical and WCQS cutoffs. We use cells of the main population as
controls to determine these cutoffs (see Note 4). You can
enumerate all possible combinatorial pairs of MQS and
WCQS cutoffs in a given dataset, calculate the fraction of cells
of the main population that pass both cutoffs of a pair, and then
uses the remaining cells of the main population to estimate the
corresponding false positive rate (FPR) for that pair (Fig.1).
2. If more than one pair of MQS and WCQS cutoffs results in the
same FPR, you can choose the cutoff pair that maximizes the
percentage of gene expression outliers failing to pass.
3. Applies these cutoffs to the gene expression outliers to identify
technical artifacts. Technical artifacts are defined as gene
expression outliers with poor data quality measurements.
4 Notes
References
1. Eberwine J, Sul J-Y, Bartfai T, Kim J (2014) 11. Treutlein B, Brownfield DG, Wu AR, Neff NF,
The promise of single-cell sequencing. Nat Mantalas GL, Espinoza FH et al (2014) Recon-
Methods 11(1):25–27 structing lineage hierarchies of the distal lung
2. Trapnell C, Cacchiarelli D, Grimsby J, epithelium using single-cell RNA-seq. Nature
Pokharel P, Li S, Morse M et al (2014) The 509(7500):371–375. Epub 2014/04/18.
dynamics and regulators of cell fate decisions https://doi.org/10.1038/nature13173
are revealed by pseudotemporal ordering of 12. Oyolu C, Zakharia F, Baker J (2012) Distin-
single cells. Nat Biotechnol 32(4):381–386. guishing human cell types based on
Epub 2014/03/25. https://doi.org/10. housekeeping gene signatures. Stem Cells 30
1038/nbt.2859 (3):580–584
3. Xue Z, Huang K, Cai C, Cai L, Jiang CY, Feng Y 13. Zeisel A, Muñoz-Manchado AB, Codeluppi S,
et al (2013) Genetic programs in human and Lönnerberg P, La Manno G, Juréus A et al
mouse early embryos revealed by single-cell (2015) Cell types in the mouse cortex and
RNA sequencing. Nature 500(7464):593–597. hippocampus revealed by single-cell RNA-seq.
Epub 2013/07/31. https://doi.org/10. Science 347(6226):1138–1142
1038/nature12364 14. Kumar RM, Cahan P, Shalek AK, Satija R,
4. Pollen AA, Nowakowski TJ, Shuga J, Wang X, DaleyKeyser AJ, Li H et al (2014) Decon-
Leyrat AA, Lui JH et al (2014) Low-coverage structing transcriptional heterogeneity in plu-
single-cell mRNA sequencing reveals cellular ripotent stem cells. Nature 516(7529):56–61.
heterogeneity and activated signaling pathways Epub 2014/12/05. https://doi.org/10.
in developing cerebral cortex. Nat Biotechnol 1038/nature13920
32(10):1053–1058. Epub 2014/08/05. 15. Jiang P, Thomson JA, Stewart R (2016) Qual-
https://doi.org/10.1038/nbt.2967 ity control of single-cell RNA-seq by SinQC.
5. Patel AP, Tirosh I, Trombetta JJ, Shalek AK, Bioinformatics 32(16):2514–2516. https://
Gillespie SM, Wakimoto H et al (2014) Single- doi.org/10.1093/bioinformatics/btw176
cell RNA-seq highlights intratumoral hetero- 16. Leng N, Chu LF, Barry C, Li Y, Choi J, Li X,
geneity in primary glioblastoma. Science 344 et al (2015) Oscope identifies oscillatory genes
(6190):1396–1401. Epub 2014/06/14. in unsynchronized single-cell RNA-seq experi-
https://doi.org/10.1126/science.1254257 ments. Nat Methods 12(10):947–950.
6. Sandberg R (2014) Entering the era of single- https://doi.org/10.1038/nmeth.3549.
cell transcriptomics in biology and medicine. PubMed PMID: 26301841; PubMed Central
Nat Methods 11(1):22–24 PMCID: PMC4589503
7. Stegle O, Teichmann SA, Marioni JC (2015) 17. Chen G, Gulbranson DR, Hou Z, Bolin JM,
Computational and analytical challenges in Ruotti V, Probasco MD et al (2011) Chemically
single-cell transcriptomics. Nat Rev Genet 16 defined conditions for human iPSC derivation
(3):133–145 and culture. Nat Methods 8(5):424–429.
8. Kharchenko PV, Silberstein L, Scadden DT https://doi.org/10.1038/nmeth.1593
(2014) Bayesian approach to single-cell differ- 18. Langmead B, Trapnell C, Pop M, Salzberg SL
ential expression analysis. Nat Methods 11 (2009) Ultrafast and memory-efficient align-
(7):740–742 ment of short DNA sequences to the human
9. Munsky B, Neuert G, van Oudenaarden A genome. Genome Biol 10(3):R25. https://
(2012) Using gene expression noise to under- doi.org/10.1186/gb-2009-10-3-r25
stand gene regulation. Science 336 19. Li B, Dewey CN (2011) RSEM: accurate tran-
(6078):183–187 script quantification from RNA-Seq data with
10. Ting DT, Wittner BS, Ligorio M, Vincent or without a reference genome. BMC Bioinfor-
Jordan N, Shah AM, Miyamoto DT et al matics 12:323. https://doi.org/10.1186/
(2014) Single-cell RNA sequencing identifies 1471-2105-12-323
extracellular matrix gene expression by pancre- 20. Islam S, Kj€allquist U, Moliner A, Zajac P, Fan
atic circulating tumor cells. Cell Rep 8 J-B, Lönnerberg P et al (2011) Characteriza-
(6):1905–1918. Epub 2014/09/23. https:// tion of the single-cell transcriptional landscape
doi.org/10.1016/j.celrep.2014.08.029 by highly multiplex RNA-seq. Genome Res 21
(7):1160–1167
Chapter 2
Abstract
In this chapter, we describe a robust normalization method for single-cell RNA sequencing data. The
procedure, SCnorm, is implemented in R and is part of Bioconductor. Also included in the package are
diagnostic functions to visualize normalization performance. This chapter provides an overview of the
methodology and provides example work-flows.
Key words Single-cell RNA-seq, Normalization, Gene expression, Read count, High-throughput
sequencing
1 Introduction
Guo-Cheng Yuan (ed.), Computational Methods for Single-Cell Data Analysis, Methods in Molecular Biology, vol. 1935,
https://doi.org/10.1007/978-1-4939-9057-3_2, © Springer Science+Business Media, LLC, part of Springer Nature 2019
11
12 Rhonda Bacher
A Bulk C Single-cell
2.0
7 Expression Expression
9
High High
Un-normalized
8
6
1.5
7
Log Expression
Log Expression
5
6
Density
Density
4
5
Low Low
1.0
4
3
3
2
0.5
2
1
0.0
0
0
14.5 14.7 14.9 15.1 15.3 −1 0 1 2 14.0 14.5 15.0 15.5 16.0 −2 −1 0 1 2 3
B D
2.0
7
9
Expression Expression
High High
8
6
1.5
7
Normalized
Log Expression
Log Expression
5
6
3
Density
Density
4
5
Low Low
1.0
4
3
3
2
0.5
2
1
1
0.0
0
0
14.5 14.7 14.9 15.1 15.3 −2 −1 0 1 2 14.0 14.5 15.0 15.5 16.0 −2 −1 0 1 2 3
Log Sequencing Depth Slope Log Sequencing Depth Slope
Fig. 1 For each gene, median quantile regression was used to estimate the count–depth relationship for a bulk
and single-cell RNA-seq dataset before and after normalization. (a) The left plot shows log un-normalized
expression versus log depth with estimated regression fits for three genes in a bulk RNA-seq dataset
containing no zero measurements and having low, moderate, and high expression defined as the median
expression among nonzero un-normalized measurements in the 10th to 20th quantile (blue), 40th to 50th
quantile (black), and 80th to 90th quantile (red), respectively. On the right, the estimated regression fits for all
genes within ten equally sized gene groups where genes were grouped by their median expression among
nonzero un-normalized measurements. (b) Similar to a for normalized bulk RNA-seq data. (c and d) Similar to
a and b for single-cell RNA-seq data
2 Materials
source("https://bioconductor.org/biocLite.R")
biocLite("SCnorm")
3 Methods
Q τk , d k Y j jX j ¼ βτ0k þ βτ1k X j þ . . . þ βτdk X dj k
4 Notes
Note 1 reviews all arguments for running the main functions in the
SCnorm package. Note 2 demonstrates a standard workflow for
normalizing scRNA-seq data and evaluating the normalization.
Note 3 reviews important considerations for normalization when
multiple conditions are present. Note 4 details how to use spike-ins
for across condition scaling.
1. There are two main functions accessible by the user in the
SCnorm package:
SCnorm—implements the normalization procedure.
plotCountDepth—estimates and graphically displays the
count-depth relationships.
The SCnorm function requires only two arguments: the
un-normalized expression matrix (Data) and a vector denoting
the condition or batch each cell belongs to (Conditions). The
Data argument should contain un-normalized expression measure-
ments with genes (or other features) on the rows and cells on the
columns. The expected format of Data is a data matrix in R or of
class SummarizedExperiment of the SummarizedExperiment
R package [9]. The Conditions argument should be a vector with
length equal to the number of cells and match the exact order of the
columns of the Data argument.
The SCnorm function will implement the entire normalization
procedure described above, automatically iterating until an optimal
K is reached. A dataset with 100 cells and 20,000 genes will take
approximately 15 min to run with three computing cores. The
computation time will increase as the number of cells and genes
increases, though increasing the number of cores can be used to
offset the increased time. In the example given, increasing to seven
cores reduces the time to around 8 min.
The output of SCnorm is a SummarizedExperiment object
containing at minimum the normalized data (NormalizedData)
and a list of the genes not included in the normalization (Genes-
FilteredOut). Additional outputs may be generated using the
non-default options and are described in more detail below.
The full SCnorm function with default arguments is:
data in the instance that the user previously ran the normalization
but did not save the data or wishes to change arguments that only
affect the across-condition scaling step.
NCores: The number of cores to use. The more cores available,
the faster SCnorm will perform. By default, SCnorm will use one
less than the number of cores available on the machine.
ditherCounts: When this option is set to TRUE, counts will
be randomly jittered by 0.01 prior to fitting. With unique molecu-
lar identifier (UMI) scRNA-seq experiments, the data typically have
many tied count values, which occasionally cause the quantile
regression fit to fail. We find that dithering the counts by a small
value avoids this issue and does not otherwise affect the normaliza-
tion procedure or resulting normalized counts.
withinSample: As demonstrated in previous papers, gene-
specific features may vary across samples. We have implemented
the method from Risso et al. [14] if the user wishes to first normal-
ize the counts based on a gene-specific feature such as GC content
or gene length. This argument expects a vector of equal length to
the number of rows of Data (and in matching order) with values
representing the gene-specific feature to normalize. Note that
within sample normalization should be used with caution as it is
often specific to the experiment and exploratory analyses are highly
recommended.
useZerosToScale: If set to TRUE, the zeros will be used
when scaling across conditions. Use of this argument depends on
which downstream differential expression tool will be used. If using
methods which test zeros separately from continuous counts, such
as MAST [12] or scDD [13], this option should remain FALSE.
However, for methods such as DESeq2 [11] which test all counts
together, this flag should be set to TRUE. A detailed example is
given in Note 3.
useSpikes: We do not implement the use of spike-ins for
within group normalization at this time because there are currently
too few to estimate scale factors robustly in all groups. However,
when multiple conditions or batches are being normalized, if this
argument is TRUE then spike-ins will be used to perform the across
condition scaling. The spike-ins are expected to be named follow-
ing the convention of “ERCC-”. Additional details regarding the
use of spike-ins is given in Note 4.
The plotCountDepth function is used to visualize the count-
depth relationships. It includes a wrapper for internal functions that
estimate the count-depth relationships and then outputs a plot.
During the normalization, if PrintProgressPlots¼TRUE, mul-
tiple calls will be made to the plotCountDepth function, other-
wise the function may be used stand-alone. The required
arguments are Data and Conditions similar to the SCnorm
function. All genes will be split into ten equally sized groups
based on their nonzero un-normalized median expression. A
18 Rhonda Bacher
We will use the first two sheets in the Excel file, which can be
loaded into R by:
> library(readxl)
> h1cells.4M <-
data.frame(read_excel("GSE85917_Bacher.RSEM.xlsx",
sheet=1), stringsAsFactors=F)
> h1cells.1M <-
data.frame(read_excel("GSE85917_Bacher.RSEM.xlsx",
sheet=2), stringsAsFactors=F)
> library(SCnorm)
> cdr.1M <- plotCountDepth(Data = h1cells.1M, Conditions =
rep("1M",
ncol(h1cells.1M)))
> cdr.4M <- plotCountDepth(Data = h1cells.4M, Conditions =
rep("4M",
ncol(h1cells.4M)))
A H1 - 1M
Expression Group Medians B 2.5
H1 - 4M
Expression Group Medians
2.0 1.67 (lowest) 2 (lowest)
3 5
2.0
4.5 8.5
1.5
15
Density
6.71 1.5
Density
10 27
1.0 48.62
15 1.0
23.5 88
0.5 0.5 155.88
39
71.32 299.6
0.0 0.0 852.85 (highest)
202.96 (highest)
−2 0 2 −2 0 2
C Slope
D Slope
Fig. 2 Count-depth relationships before and during normalization for the H1-1 M and H1-4 M data. (a) For the
H1-1 M dataset, the estimated count-depth relationships for all genes within ten equally sized gene groups
where genes were grouped by their median expression among nonzero un-normalized measurements. (b)
Similar to a for the H1-4 M dataset. (c) The count-depth relationship is shown for the normalized counts for
each value of K tried by SCnorm for the H1-1 M dataset. The genes remain in their initial expression groups as
shown in a. (d) Similar to C but for the H1-4 M dataset
5
H1−1M mean H1−1M mean
H1−4M mean H1−4M mean
4
4
Log (Expression + 1)
Log (Expression + 1)
3
3
2
2
1
1
0
0
13.0 14.0 15.0 16.0 13.0 14.0 15.0 16.0
Log Sequencing Depth Log Sequencing Depth
C H1−1M mean
D H1−1M mean
useZeros =TRUE
5
4
4
Log (Expression + 1)
Log (Expression + 1)
3
3
2
2
1
1
0
Fig. 3 For a single gene, the log of the normalized counts for both the H1-1 M (blue) and H1-4 M (red) datasets
versus log sequencing depth are shown. A constant of one was added to the counts before taking the log to
highlight the zero counts. The top and bottom rows contain the normalized counts when useZer-
os¼FALSE or useZeros¼TRUE, respectively. The left column shows the condition-specific means
calculated on the nonzero counts only. The right column shows the condition-specific means calculated over
all counts
> mean(spikeRatio[colnames(h1cells.1M)])
0.04553682
> mean(spikeRatio[colnames(h1cells.4M)])
0.0460152
References
Abstract
Profiling the transcriptomes of individual cells with single-cell RNA sequencing (scRNA-seq) has been
widely applied to provide a detailed molecular characterization of cellular heterogeneity within a population
of cells. Despite recent technological advances of scRNA-seq, technical variability of gene expression in
scRNA-seq is still much higher than that in bulk RNA-seq. Accounting for technical variability is therefore a
prerequisite for correctly analyzing single-cell data. This chapter describes a computational pipeline for
detecting highly variable genes exhibiting higher cell-to-cell variability than expected by technical noise.
The basic pipeline using the scater and scran R/Bioconductor packages includes deconvolution-based
normalization, fitting the mean-variance trend, testing for nonzero biological variability, and visualization
with highly variable genes. An outline of the underlying theory of detecting highly variable genes is also
presented. We illustrate how the pipeline works by using two case studies, one from mouse embryonic stem
cells with external RNA spike-ins, and the other from mouse dentate gyrus cells without spike-ins.
Key words Single-cell RNA-seq, Technical variability, Biological variability, Cell-to-cell variability,
Gene expression noise, Highly variable genes
1 Introduction
Guo-Cheng Yuan (ed.), Computational Methods for Single-Cell Data Analysis, Methods in Molecular Biology, vol. 1935,
https://doi.org/10.1007/978-1-4939-9057-3_3, © Springer Science+Business Media, LLC, part of Springer Nature 2019
25
26 Beomseok Kim et al.
2 Materials
source("https://bioconductor.org/biocLite.R")
biocLite("scater")
biocLite("scran")
2.2 scRNA-seq To demonstrate how the pipeline works with or without the help of
Datasets spike-ins, we use two different scRNA-seq datasets: (1) mouse
embryonic stem cells (mESCs) with external RNA spike-ins [15],
and (2) micro-dissected cells from mouse dentate gyrus without
external RNA spike-ins [16].
2.2.1 scRNA-seq Dataset This dataset consists of 704 mESCs cultured in three different
with Spike-Ins conditions: serum + LIF (three replicates), 2i + LIF (four replicates)
and alternative 2i + LIF (two replicates). All the mESCs passed cell-
level quality control criteria. For each replicate, 96 single cells were
captured with the Fluidigm C1 system and 92 ERCC RNA spike-
ins were added to cell lysate. The cDNA and Illumina library were
prepared with the SMARTer Kit and the Nextera XT Kit, respec-
tively. Of four replicates of 2i-cultured mESCs, we will use two
replicates: 2i2 and 2i3. The first replicate (2i2) has poor-quality
spike-ins while the other replicate (2i3) has good-quality spike-ins.
The raw read count table is publicly available at https://www.ebi.ac.
uk/teichmann-srv/espresso.
2.2.2 scRNA-seq Dataset This dataset contains 5454 cells from mouse developing dentate
without Spike-Ins gyrus, which were sampled at four postnatal time points (P12, P16,
P24, and P35). Cells were dissociated and captured with the 10
Genomics Chromium platform on two experimental days: day1 (P12
and P35), and day2 (P16 and P24). Low-quality cells and doublets
were filtered out. The UMI count table and the corresponding
annotation data are publicly available from the Gene Expression
Omnibus (GEO) at the accession number of GSE95315.
3 Methods
3.1 A Statistical Suppose that xij is a random variable denoting the unknown num-
Framework to Account ber of transcripts (or concentration) of gene i in cell j. The number
for Technical of transcripts (or concentration) of gene i in cell j available for
Variability sequencing after cell lysis, reverse transcription, and cDNA amplifi-
cation steps is denoted by zij. We also denote by kij as the observed
read or UMI count of gene i in cell j. By the general theorem of
variance decomposition [18], the variance of kij can be decomposed
into:
Var kij ¼ E Var kij jz ij ; x ij þ E Var E kij jz ij ; x ij jx ij
þ Var E kij jx ij :
The first term explains the technical variability arising from
sequencing noise, which is usually modeled using a Poisson process
[19]. The second term quantifies the technical variability generated
by stochastic mRNA loss during the single-cell library preparation
steps, which is a major source of technical variability. The last term
quantifies the biological variability. The basic idea of identifying
highly variable genes is to find genes whose observed variance is not
dominated by the first two technical variability terms. In other
words, highly variable genes can be defined as genes showing
significant nonzero biological variance.
In principle, the technical variability terms should be estimated
from external RNA spike-ins since we can eliminate the biological
cell-to-cell variability of xij from the decomposition formula. Then,
the variance of kij for spike-ins can be simplified by the law of total
variance:
Var kij ¼ E Var kij jz ij þ Var E kij jz ij :
It should be noted that the above two terms correspond to the
first two technical variability terms if we assume xij is a fixed and
known quantity, which is a reasonable assumption for spike-ins (see
Note 1). To plug-in the estimated technical variability of spike-ins
into that of endogenous genes, we make an assumption that the
technical variance of spike-ins is a nonlinear function of their mean
expression levels. By fitting a curve to the mean-variance
(or variance derived quantities like coefficient of variation) data of
spike-ins using a nonlinear regression function, we can estimate the
average technical variance of each endogenous gene at the given
mean expression level. The biological variance of endogenous genes
Analysis of Technical and Biological Noise in scRNA-Seq 29
3.2 Identifying We first load all R and Bioconductor packages we need in this
Highly Variable Genes protocol, and then load the raw read count table for mESCs from
with External RNA counttable_es.txt.
Spike-Ins
3.2.1 Data Loading
and Normalization
library(scater)
library(scran)
ct <- as.matrix(read.table("counttable_es.txt",
sep = " ",
header = T,
row.names = 1,
check.names = FALSE))
From the count matrix, we select rows whose names start with
“ENSMUSG” (Ensembl mouse gene ID) or “ERCC” (ERCC spike-in
ID), and columns corresponding to one replicate of the 2i condi-
tion (2i3). The chosen count matrix is used to create a Single-
CellExperiment object from scater which will serve as a data
container compatible with many other Bioconductor packages
including scran. The rows corresponding to spike-ins can be set
using the isSpike function from scran.
30 Beomseok Kim et al.
colnames(ct))]
sceset
## class: SingleCellExperiment
## dim: 38653 59
## metadata(0):
## assays(1): counts
## ERCC-00170 ERCC-00171
## rowData names(0):
## ola_mES_2i_3_92.counts ola_mES_2i_3_96.counts
## colData names(0):
## reducedDimNames(0):
## spikeNames(0):
the biological size factors. In contrast, the technical size factors are
computed based on the spike-in counts to adjust for the effects of
technical factors. Since the same amount of spike-ins are added to
each cell, the cell-to-cell differences in total RNA content are not
normalized with the technical size factors (see Note 3). Using the
normalize function from scater, we normalize the raw read
counts of endogenous genes with the biological size factors, and
that of spike-ins with the technical size factors. The function calcu-
lates log2-transformed normalized expression values by adding a
pseudo-count of 1, which are stored in logcounts or exprs of
the returned SingleCellExperiment object.
3.2.2 Detecting Highly From the log2-transformed normalized expression values, we first
Variable Genes fit a curve to the mean-variance values of spike-ins with the tren-
dVar function of scran (see Note 4).
## [1] 2109
head(hvg)
## mean total
bio
ic>
4.94614195802085
6.19082343482952
2.60116398672207
2.2966896068407
1.71403288145084
2.89637950173032
## tech p.value
## <numeric> <numeric>
## FDR
## <numeric>
## ENSMUSG00000000001 1.55277820983262e-10
## ENSMUSG00000000028 2.65607694945126e-19
## ENSMUSG00000000131 2.01098169057812e-22
## ENSMUSG00000000171 2.67094826375195e-30
## ENSMUSG00000000278 4.18208059504488e-27
## ENSMUSG00000000295 0.00884495235554632
A 15
B
15
Variance of log−expression
Variance of log−expression
10
10
5
5
0
0
0 5 10 15 0 5 10 15
Mean log−expression Mean log−expression
Fig. 1 Mean-variance plots of log2-transformed normalized expression values of endogenous gens (black
points) for 2i3 with a good quality of spike-ins (a) and 2i2 with a poor quality of spike-ins (b). Each green point
represents a spike-in, and the blue line corresponds to the fitted mean-variance trend based on the spike-ins
(a) or endogenous genes (b). Detected highly variable genes are colored by red
3.3 Identifying In this section, we demonstrate how the basic pipeline described in
Highly Variable Genes Subheading 3.2 can be modified to detect highly variable genes for
without External RNA a large-scale scRNA-seq dataset without spike-ins. We first load the
Spike-Ins UMI count table for mouse dentate gyrus cells from
GSE95315_10X_expression_data.tab, which is publicly avail-
able at ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE95nnn/
GSE95315/suppl/GSE95315_10X_expression_data.tab.gz. The
cell metadata containing the batch (experimental days) and cell
type for each cell is loaded from GSE95315_series_matrix.
Analysis of Technical and Biological Noise in scRNA-Seq 35
library(scater)
library(scran)
library(ggplot2)
ct <- as.matrix(read.table("GSE95315_10X_expression_data.tab",
sep = "\t",
header = T,
row.names = 1,
check.names = FALSE))
strs <- readLines("GSE95315_series_matrix.txt.gz")
dat <- read.csv(text=strs, sep = "\t",
header = T,
check.names=FALSE,
skip = 30,
nrows=length(strs) - 3 )
## [1] 22
## class: SingleCellExperiment
## metadata(0):
## assays(1): counts
## rowData names(0):
## reducedDimNames(0):
## spikeNames(0):
.
After we test whether the estimated biological variance is equal
to 0, we define highly variable genes as ones with FDR 0.05 and
biological variance 0.1.
nrow(hvg)
## [1] 312
head(hvg)
0.22438946719247
0.158613308891032
0.132051146817701
0.136031205136787
38 Beomseok Kim et al.
1.2317340134449e-53
9.65972247799617e-34
2.45235829855e-22
3.49014478621564e-19
6.80681390145156e-21
## Acsbg1 0.198762550090734 0 0
5
Variance of log−expression
4
3
2
1
0
0 1 2 3 4 5 6 7
Mean log−expression
Fig. 2 A mean variance plot of log2-transformed normalized expression values of genes in dentate gyrus cells.
The blue line represents the fitted mean-variance trend, and highly variable genes are marked in red
= 123456)
d = 123456)
Figure 3 shows the t-SNE plots using all genes (Fig. 3a, c) or
highly variable genes (Fig. 3b, d, see Note 7). Cell types inferred
from [16] are overlaid in Fig. 3a, b. Two experimental days
(“43_1” and “46_1”) are also overlaid in Fig. 3c, d to look for
batch effects caused by experimental days.
40 Beomseok Kim et al.
Fig. 3 t-SNE plots of dentate gyrus cells using all genes (a, c) or highly variable genes (b, d). Cells are
clustered by the annotated cell types (a, b), and partially by the experimental days (c, d)
Analysis of Technical and Biological Noise in scRNA-Seq 41
4 Notes
Acknowledgments
References
1. Tang F, Barbacioru C, Wang Y, Nordman E, and biology of single-cell RNA sequencing.
Lee C, Xu N, Wang X, Bodeau J, Tuch BB, Mol Cell 58(4):610–620. https://doi.org/
Siddiqui A, Lao K, Surani MA (2009) mRNA- 10.1016/j.molcel.2015.04.005
Seq whole-transcriptome analysis of a single 4. Brennecke P, Anders S, Kim JK, Kolodziejczyk
cell. Nat Methods 6(5):377–382. https://doi. AA, Zhang X, Proserpio V, Baying B, Benes V,
org/10.1038/nmeth.1315 Teichmann SA, Marioni JC, Heisler MG
2. Tanay A, Regev A (2017) Scaling single-cell (2013) Accounting for technical noise in
genomics from phenomenology to mechanism. single-cell RNA-seq experiments. Nat Meth-
Nature 541(7637):331–338. https://doi.org/ ods 10(11):1093–1095. https://doi.org/10.
10.1038/nature21350 1038/nmeth.2645
3. Kolodziejczyk AA, Kim JK, Svensson V, Mar- 5. Kivioja T, Vaharautio A, Karlsson K, Bonke M,
ioni JC, Teichmann SA (2015) The technology Enge M, Linnarsson S, Taipale J (2011)
Analysis of Technical and Biological Noise in scRNA-Seq 43
Counting absolute numbers of molecules using 13. McCarthy DJ, Campbell KR, Lun ATL, Wills
unique molecular identifiers. Nat Methods 9 QF (2017) Scater: pre-processing, quality con-
(1):72–74. https://doi.org/10.1038/nmeth. trol, normalization and visualization of single-
1778 cell RNA-seq data in R. Bioinformatics 33
6. Islam S, Zeisel A, Joost S, La Manno G, (8):1179–1186. https://doi.org/10.1093/
Zajac P, Kasper M, Lonnerberg P, Linnarsson bioinformatics/btw777
S (2014) Quantitative single-cell RNA-seq 14. Lun ATL, McCarthy DJ, Marioni JC (2016) A
with unique molecular identifiers. Nat Meth- step-by-step workflow for low-level analysis of
ods 11(2):163–166. https://doi.org/10. single-cell RNA-seq data with bioconductor.
1038/nmeth.2772 F1000Res 5:2122. https://doi.org/10.
7. Macosko EZ, Basu A, Satija R, Nemesh J, 12688/f1000research.9501.2
Shekhar K, Goldman M, Tirosh I, Bialas AR, 15. Kolodziejczyk AA, Kim JK, Tsang JC, Ilicic T,
Kamitaki N, Martersteck EM, Trombetta JJ, Henriksson J, Natarajan KN, Tuck AC, Gao X,
Weitz DA, Sanes JR, Shalek AK, Regev A, Buhler M, Liu P, Marioni JC, Teichmann SA
McCarroll SA (2015) Highly parallel genome- (2015) Single cell RNA-sequencing of pluripo-
wide expression profiling of individual cells tent states unlocks modular transcriptional var-
using Nanoliter droplets. Cell 161 iation. Cell Stem Cell 17(4):471–485. https://
(5):1202–1214. https://doi.org/10.1016/j. doi.org/10.1016/j.stem.2015.09.011
cell.2015.05.002 16. Hochgerner H, Zeisel A, Lonnerberg P, Lin-
8. Klein AM, Mazutis L, Akartuna I, narsson S (2018) Conserved properties of den-
Tallapragada N, Veres A, Li V, Peshkin L, tate gyrus neurogenesis across postnatal
Weitz DA, Kirschner MW (2015) Droplet bar- development revealed by single-cell RNA
coding for single-cell transcriptomics applied sequencing. Nat Neurosci 21(2):290–299.
to embryonic stem cells. Cell 161 https://doi.org/10.1038/s41593-017-0056-
(5):1187–1201. https://doi.org/10.1016/j. 2
cell.2015.04.044 17. Kim JK, Kolodziejczyk AA, Ilicic T, Teichmann
9. Kim JK, Marioni JC (2013) Inferring the kinet- SA, Marioni JC (2015) Characterizing noise
ics of stochastic gene expression from single- structure in single-cell RNA-seq distinguishes
cell RNA-sequencing data. Genome Biol 14 genuine from technical stochastic allelic expres-
(1):R7. https://doi.org/10.1186/gb-2013- sion. Nat Commun 6:8687. https://doi.org/
14-1-r7 10.1038/ncomms9687
10. Stegle O, Teichmann SA, Marioni JC (2015) 18. Bowsher CG, Swain PS (2012) Identifying
Computational and analytical challenges in sources of variation and the flow of information
single-cell transcriptomics. Nat Rev Genet 16 in biochemical networks. Proc Natl Acad Sci U
(3):133–145. https://doi.org/10.1038/ S A 109(20):E1320–E1328. https://doi.org/
nrg3833 10.1073/pnas.1119407109
11. Ilicic T, Kim JK, Kolodziejczyk AA, Bagger 19. Marioni JC, Mason CE, Mane SM,
FO, McCarthy DJ, Marioni JC, Teichmann Stephens M, Gilad Y (2008) RNA-seq: an
SA (2016) Classification of low quality cells assessment of technical reproducibility and
from single-cell RNA-seq data. Genome Biol comparison with gene expression arrays.
17:29. https://doi.org/10.1186/s13059- Genome Res 18(9):1509–1517. https://doi.
016-0888-1 org/10.1101/gr.079558.108
12. Trapnell C, Cacchiarelli D, Grimsby J, 20. Lun ATL, Bach K, Marioni JC (2016) Pooling
Pokharel P, Li S, Morse M, Lennon NJ, Livak across cells to normalize single-cell RNA
KJ, Mikkelsen TS, Rinn JL (2014) The dynam- sequencing data with many zero counts.
ics and regulators of cell fate decisions are Genome Biol 17:75. https://doi.org/10.
revealed by pseudotemporal ordering of single 1186/s13059-016-0947-7
cells. Nat Biotechnol 32(4):381–386. https:// 21. Van der Maaten L, Hinton GE (2008) Visua-
doi.org/10.1038/nbt.2859 lizing Data using t-SNE. J Mach Learn Res
9:2579–2605
Chapter 4
Abstract
Unprecedented technological advances in single-cell RNA-sequencing (scRNA-seq) technology have now
made it possible to profile genome-wide expression in single cells at low cost and high throughput. There is
substantial ongoing effort to use scRNA-seq measurements to identify the “cell types” that form compo-
nents of a complex tissue, akin to taxonomizing species in ecology. Cell type classification from scRNA-seq
data involves the application of computational tools rooted in dimensionality reduction and clustering, and
statistical analysis to identify molecular signatures that are unique to each type. As datasets continue to grow
in size and complexity, computational challenges abound, requiring analytical methods to be scalable,
flexible, and robust. Moreover, careful consideration needs to be paid to experimental biases and statistical
challenges that are unique to these measurements to avoid artifacts. This chapter introduces these topics in
the context of cell-type identification, and outlines an instructive step-by-step example bioinformatic
pipeline for researchers entering this field.
1 Introduction
Guo-Cheng Yuan (ed.), Computational Methods for Single-Cell Data Analysis, Methods in Molecular Biology, vol. 1935,
https://doi.org/10.1007/978-1-4939-9057-3_4, © Springer Science+Business Media, LLC, part of Springer Nature 2019
45
46 Karthik Shekhar and Vilas Menon
1.1 What Is a Cell While every cell is unique, experience of biologists over many years
Type? has suggested that cells can be organized into groups based on
shared features that are quantifiable. This categorization makes
possible systematic and reproducible analyses of complex tissues,
similar to the concept of “species,” which greatly simplifies the
diversity of organisms into an interpretable taxonomy, while not
denying the individuality of any single member [18]. Features used
to define cell types include lineage, location, morphology, activity,
interactions with other cell types, epigenetic state, responsiveness to
certain signals, and molecular composition (including mRNA and
protein levels) [16].
scRNA-seq-based cell classification involves partitioning the
data into “clusters” of single cells, wherein each cluster is defined
by a unique gene expression “signature” relative to other clusters,
and therefore, represents a putative cell type. It must be noted,
however, that a computationally defined cluster may not necessarily
correspond 1:1 to a cell type, as the molecular state of the cell
assayed by scRNA-seq may not necessarily reflect all of the features
Identification of Cell Types from Single-Cell Transcriptomic Data 47
1.2 A Brief Overview scRNA-seq is not a single method, but a suite of protocols, each
of scRNA-Seq with its strengths and limitations [20]. Currently, every scRNA-seq
protocol consists of three steps (Fig. 1): (1) single-cell capture and
barcoding, (2) library preparation, and (3) sequencing. Current
protocols isolate single cells by tissue dissociation, followed by
either fluorescence- activated cell sorting (FACS) into separate
wells on a plate or capturing individual cells in microfluidic cham-
bers, microwells, or individual droplets. Prior to single-cell capture,
the dissociated cells can be optionally taken through a sorting step
using FACS or magnetic activated cell-sorting (MACS) to enrich or
deplete cells expressing a specific combination of markers. Library
preparation involves reverse transcribing mRNA into cDNA and
amplifying, either using polymerase chain reaction (PCR) or
in vitro transcription (IVT). Recently developed protocols tag tran-
scripts during the capture stage (step 1 above) with unique molec-
ular identifiers (UMIs), which are random nucleotide sequences
[21]. Every captured transcript is, in principle, tagged with a dis-
tinct UMI, which enables downstream correction of amplification
biases. The amplified cDNA is then fragmented, followed by the
addition of molecular adapters at the end of amplicon fragments
that allow for high-throughput sequencing. Libraries can either
retain the full length of every transcript or tag either the 30 or the
50 end of each mRNA—the choice is informed by further consid-
erations. Sequencing is generally highly multiplexed, and can either
be single-end or paired-end depending on upstream choices. An
important consideration can be the depth of sequencing per cell,
which is often related to the number of cells profiled [22].
48 Karthik Shekhar and Vilas Menon
Lysis &
Isolation Reverse transcription
or
Cells
Droplets with oligonucleotide barcodes
Pooling &
Amplification
Cell barcode
ACCGTTTCAAGTAGCGT
TGTCG ACCGT CGGTT
Alignment, CGGTTACTGGTATAGAC
Transcript 1 0 1 0 Next-Gen
Demultiplexing,
ACCGTGATCATTCAGAT Sequencing
& Quantification
Transcript 2 1 0 1
CGGTTGTGGACACTTAC
Transcript 3 0 2 0 TGTCGGTGGACACTTAC
Transcript 4 1 0 2 TGTCGCTAAATGCGATG
TGTCGACTGGTATAGAC
Transcript 5 1 0 0
CGGTTGTGGACACTTAC
ACCGTGATCATTCAGAT
Fig. 1 General experimental workflow for single-cell RNA-sequencing, as described in detail in the text,
starting from cell isolation and extending through to the generation of counts tables showing the detection of
each gene in each cell
1.3 Batch Effects Data-driven identification of cell types can be confounded by batch
in scRNA-Seq Analysis effects, which result from minor, but systematic differences
between experimental replicates prepared either at different times,
using different reagent batches, different experimenters, or a com-
bination of the three [23]. Batch effects can result in variation in the
transcriptomic state of identical cell types across different replicates
due to technical factors; when such effects are strong, cells can
cluster by batch identity rather than biological identity. Batch
effects can also arise if in addition to transcriptional differences,
the frequencies of specific cell types are different across batches
[24, 25]. If different biological conditions of interest (e.g.,
control vs. perturbation) or different sample sources (e.g., biopsies
from cancer patients) are processed in different batches, it is statis-
tically impossible to deconvolve biological versus technical effects.
While batch effects can be mitigated through careful experimental
design involving an even distribution of different biological condi-
tions across experimental batches (“block design”), although this
may not always be logistically feasible if delays in sample processing
can compromise quality. In such circumstances, cell-types and
Identification of Cell Types from Single-Cell Transcriptomic Data 49
2 Methods
Expression
Cells
Distribution of expression
Variance of expression
Selection of
Transcripts
= gene
= cell = cell
Differential
Gene expression
Expression Clustering
Feature Axis 2
Feature Axis 2
Cluster Feature Axis 1 Feature Axis 1
Fig. 2 Standard computational workflow for identifying transcriptomic types from single-cell RNA-sequencing
data, as described in detail in the text, starting from counts tables through to cluster assignment and
differential gene expression identification. Although not every computational approach incorporates all of
these steps in this order, most involve variations on this set of procedures
nGene nUMI
6000
3000
4000
2000
2000
1000
Fig. 3 Sample-wise (x-axis) distribution of the number of genes per cell (Left, y-axis) and number of UMIs (i.e.,
transcripts) per cell (Right, y-axis) depicted as violin plots. Dots represent individual cells
50
NPY
20
CLDN5
STAB1
SLC6A12
ANXA1
ID1
NPAS4
KLF2
AC004862.6
GPR17
NOTCH3
SLC12A7
SLC38A5
AP000472.2
TNFAIP8
HPGDMGPCARTPT
RP11−231C18.1
CTB−95D12.1
LINC00348
IFI44L
PACRG−AS3
DSP
CTD−2533K21.4
CYR61
HIF1A−AS2
RP11−627D16.1
AC007364.1
DIRAS3 NPSR1
F13A1
TNC
MRC1
ITGAX
SRGN
SCGN
PIK3R5
GADD45B
CV (counts)
CX3CR1
CTGF
NPR3 EPAS1
ITIH5
AC079135.1
RP11−427M20.1
SMTN
RP13−492C18.2 SPATA22
DCN IFNG−AS1
QRFPR
MFSD2A
SYT6
HLA−E
RGS16
BTNL9
DLEU7−AS1
LINC01344
NPFFR2
COL8A1
RP11−384G23.1
RP5−947P14.1
LYPD1
ADAMTS1
LINC01099
RP11−268G12.3
ZFPM2−AS1
LMCD1
ABI3BP
TAGLN
10
SYK
GALR1
LY86
CSF1R
TFAP2B
MYH11
PAX3
FER1L6−AS2
PTHLH
EDNRA
MECOM
COL20A1
POSTN
CRISPLD2
RP11−556E13.1
ADGRF5 FLT1
NPSR1−AS1
FLI1
SYT10
CTC−321K16.1
RP11−325K19.1
RP11−480C22.1
BX255923.3
LINC01036
CDH15
A2M
ITGA11 PENK HTR2C
ADAM33
ATP10A
SLC22A3
RP11−96H17.1
WDR49
C5orf17
SLC19A1
RP11−154H17.1
XAF1
NMU
CD36
SFRP1
P2RY12
MCTP2
CD74
ADAM28
RP11−384F7.1
EYA1
CRHBP
ENPP7P8
PODXL
PNOCC3 TAC3
SNTB1
HCRTR2
RP1−223B1.1
FYB
NR4A3
RP11−384J4.2
RP11−122F24.1
RP11−23D24.2
CTD−3239E11.2
XACT
ISG15
F3 VIP
ZIC2
MDFIC
LINC00499
AC007091.1
LINC01088
IGFBP7
TNNT2
LINC01500
ROBO3
CBLN3
ZIC4
SLCO1C1
NR2F2
MT2A
KMO FOS
ABCB1 SST
AC002429.5
SLC27A6
MTND1P23
TBXAS1
LPCAT2
RP11−238K6.1
ZFP36L1
CHST9
CALB1
CCBE1
BHLHE22
COL9A1
SLCO2B1
PALMD
RANBP3L
JUNB GFAP
AC092684.1
LINC00534
SAMD3
RP11−445F12.1
COLEC12 PVALB
APBB1IP
FBN2 RP11−767I20.1
CTC−806A22.1
RP11−653B10.1
TMLHE−AS1
RP11−563P16.1
RP4−668E10.4
SULF1
RP11−33A14.1
THEMIS
ADCYAP1
ELFN1
SHISA8
TACR1
ANKRD26P3CALB2
RP11−525K10.3
LHX6
GJA1
F11−AS1
LINC01470
NOVA1−AS1
NXPH2
ANKRD62
DOCK8
AGT
CMTM8
ARHGAP15
AQP4 TAC1
TTN
RP11−79E3.2
RP11−475J5.10
SGCG
LGI2
CNTNAP3B
RP11−897M7.4
PKP2
HLA−B
CHRNA7
SLC7A11
MAP2K6
STK32A
TRPC6
RP11−436K8.1
STK32B
ANOS1
KCNK2
NR4A2
SLC32A1
SPP1
RP11−460M2.1
MIR219A2
WIF1
CA8
NFKBIA
RNF152
ORAOV1
MAFB
INPP5D
LRRC63
ACSBG1 CRH
THSD7B
OLIG1
RFX4
LINC01090
CA2
SV2C
SCN9A
FAT1
RP11−69I8.3
CA3
RP11−50D16.4
EYA4
C5orf64
ADAM12
CBLN1NOS1
LINC01314
MSR1
PKIB
ARX
AC007319.1
NR2F2−AS1
MIR3681HG
FAT2
SMOC1
FAM160A1 NEFH
AC007682.1
IL16
NRP1 KIT CXCL14
CNTNAP3
SULF2
ITGA8
S100B
TRPC3
RP11−368L12.1
HIF3A
EBF1
TOX3
ACSS1
PLCE1
RP11−886D15.1
RMST
HERC2P3
LINC01619
EPS8
SLC26A4
KLHL1
RP11−308N19.1 ZNF98
GS1−57L11.1
B2M
CTC−552D5.1
ERMN
TNS3
FGFR3
CBLN4
SYT2
RUNX2
MKX
DISC1
IGF1
ALDH1A1
COL25A1
IL36B
OCA2
AGBL1
RGS20
PAMR1
LINC01202
RNU6−6P
FREM1
RP11−267C16.1
ALDH1A2
SPON1 CNR1
SGOL1−AS1
RP11−665G4.1
CTD−2058B24.2
VAT1L
COL21A1
RP11−58C22.1
ASPAAJ006998.2
CCDC175
RP11−707A18.1
ADAMTSL1
MAL
RP11−307P5.1
LAMP5
MAF
RAB3B
GLRA2
BCL11B
HHIP
MIR4300HG
ADAMTS6
SEMA3E
AOAH
PRKG2
PLEKHH2 PRELID2
5
TMEM155
ANKRD55
COBLL1
RP11−266O8.1
RP11−776H12.1
UGT8
HTR4
NHSL1
ZIC1
LAMA4
MOGCOL5A2
GRM4
MET
PDGFD
ATP8B4
ADAMTS3
SLCO1A2
BAIAP3
CARNS1
PCDH8
NR4A1
ENOX2
CDK18 VCAN RELN
RNF219−AS1
PIEZO2
NDRG2
RP11−30J20.1
TLL1
RYR1
MIR31HG
CNDP1
MEGF11GRIN3A
HS3ST2
AC133680.1
CRYAB
TPD52L1
CYP1B1−AS1
CTC−575N7.1
EGR3CST3
RP11−867G2.8
CLMP
LINC01197
RP11−454P21.1
MARCH11
MIR646HG
LINC01170
BACH1
BMPR1B
RP11−739G5.1
DACH1
RASGEF1B
MSC−AS1
LPAR1
ROR1EYS
EXPH5
SCHLAP1
CHRM2
LINC00276
EPDR1
SAMD5
GLUL
THSD4
LINC01266
SEMA5A PCP4
RP11−649A16.1
SLC5A11
COL5A3
ZBBX ATP1A2
RP11−380P13.1
LRRC3B
MYO10
EGFR
MTATP6P1
VRK2
PCSK1
ADRA1A
MEGF10
NPTX2
NNAT
LINC01608
RP11−594C13.1
ATP10B
C3orf67−AS1
COL4A5
PDE3A
COX7A1C10orf11
MMD2
RP11−206L10.9
ENTPD3CTC−535M15.2
CEMIP
ARHGAP24
LINC00923 EPHA5
TOX2
RERG
BCAN
SCD
LIMA1
CERCAM
PCBP3
PALM2
CDH7
ADAMTSL3
KLF3−AS1
PLD1
RASGEF1C
FRMD3
BCAS1
VWC2L
ETV1USP24
RXFP1
DLX6−AS1
RP11−314P15.2
PARD3B
NTN4
C10orf90
MCHR2
LRRK1
SHROOM4
LINC00970
SEMA5B
GSG1L
VGF
ANKRD18A
DNAH14
CDR1−AS
MCC
ENPP2
GHR
ALOX12P2
CRYM LAMA2
CTC−340A15.2
ALK
HPSE2
GAD2
NRIP3
EGR1
PCSK6
VAV3
LINC00507
PLCH1
ANKRD20A11P
ENPP7P4
HOPX
LINC00609
LINC00693
COL24A1
ANKRD20A5P
ST6GAL1
C4orf22
DPF3
COL11A1
SLC38A11
SNAP25−AS1
ZNF521
PLAGL1
PTGDS
TCERG1L
RP11−475O6.1
CNP
SNTB2BTBD11TSHZ2
EGFEM1P
RP11−420N3.3
RP1−90G24.10
COL19A1
MIR325HG
MIR2052HG
SPHKAP
PTN
PCDH17
RP11−624C23.1
SLC35F4
TMEM132C
AC010127.3
LAMB1
TRPC5
CNTN6 TF
RP11−123M6.2
GALNT14
AC012593.1SLC1A3
ADGRV1
ADAMTS17
DNAH6
ADAMTS9−AS2
PHACTR2
TMEM144
RP13−578N3.3
IPCEF1
ANO3
PARD3
SERPINE2
AC067956.1
NHS
GLIS3 LHFPL3 GRIK1
NEFM
FGF13
PART1
GREB1L
TSHZ3
DOCK5
GABARAPL2
GRIK3
HTR2A
LGI4
GRM8
SFMBT2
KIAA1211
ADAMTS19
PREX2
DGKD
KCNIP1
EPHA3
UBASH3B
CLDN11
NRN1
RARB
ANK1
GULP1
NSUN6
TIMP2
BICC1
FOXP2
HTR1E
HS3ST5 PCDH15
IL1RAPL2
SLC4A4
SHISA6GAD1
AP000769.1
MOBP ADARB2 GRID2
CHST11
DLGAP1−AS4
ST6GAL2
TLE4
IQCA1PTCHD4
AC114765.1
RP11−766N7.3
LY86−AS1
TMTC1
DGKH
CPLX1MYO16
NKAIN3
U91319.1
SLC9A9
CHD7
NYAP2
KLHL4
KCNH5
SCN1B
RP11−32K4.1
PLPPR1
MPPED1
SLIT1
UTRN
SATB1−AS1
LRP8
NOS1AP
PVRL3
COX6C
PLCXD3
RP11−444D3.1
KCNA2
GRIK4
CDH22 TAOK1
SLC6A1
LINC01378
EML5
PTPRZ1
PCDH11Y
SLC24A3 SLC1A2
KCNH8
RFTN1
TRPS1
PARM1
SOD1
CH17−437K3.1
L3MBTL4
RIMS3
KIF26B
SIPA1L3
ADCY8
CHGA SOX6
DNER
NWD2NXPH1
FBXL7
RBMS3
PID1
PRR16
HPCAL1
DIRAS2
ARL4C
KLHL5
KCNH3 PDZRN4
AC007563.5
NTNG1
SCG2
HINT1
ELAVL2
MAN1A1
BAIAP2
PGK1 PLD5
RIT2
SLC22A10
EPB41
GNAL
NIPAL2
SLC6A7
FRMD4BC8orf34
GRM1
LUZP2
TMSB10 GPC5
CCK
CYP46A1
LDHB
SCN3B
GOT1
GABRD
MT3
CAMKK1FRAS1
MDH1
ITM2C
MPP6
CALD1
NETO2
STXBP5−AS1 RGS12
TOX
XIST
C1orf61
SLIT3
PIP4K2A
AC091878.1
NPTXR
NPTX1
SPOCK2
RPL21
PRMT8
TENM1
PDP1 RORB
SDK1
TESPA1 PLP1
ST18
MAML3
ANKRD30BL
CNTN3
RP11−123O10.4
TAGLN3
MAST3
NREP
MAN2A1DLC1
HIP1
LMCD1−AS1
RPS27A
NAP1L3
KCNT2
MAML2
VWC2
MIR137HG
AQP4−AS1
NETO1
NDUFA4
DYNLL1
AP001347.6
PTK2B
GABRA2
EPHB6
ATP6V1B2 HS3ST4
PTPRM
CPNE4
RGS4
SORCS3
PLCB4
LHFPCHN2
CDH9
LINC01250
BEX2
UPF3B
KIAA1211L
CORO6
TMEM117 NEFLINPP4BSGCZ
SYNPR
SOX2−OT
RP11−586K2.1
LMO3 CBLN2
ZNF804B
AC011288.2
GRIP1
SNRPN
TENM3
TMSB4X
ZNF804A
TRHDE
PTCHD1−AS
CHGB
EPN2
CUX2
SATB2
TCEAL2
NCKAP5
TUBB2A
CACNG2
PDE8A
EPHB1
ZNF536
RPS14
RAB6B
NGEF
PHYHIP
PRKAR1B
SGK1
PACRG
SNED1
COX4I1
NAPB
LIN7A
DIAPH2
CSGALNACT1 NEAT1
LINC01322
KCNAB1
POU6F2
SOX5
XYLT1
SEMA6D
FHOD3
MAP7D2
PDE1C
MEIS2
CAMKV
MTCL1 RYR3
THSD7A
MLIP
CLSTN2
CDH8
FAM189A1
ATP5B
MSI2
ELAVL4
BCL11A MBP
RAB3C
PCDH11X
TIMM23B
SULT4A1
DGKG RNF220
GPC6
SLIT2 GALNTL6ERBB4
CEP126
TTTY14
CACNG3
DKK3
CAMKK2
CACNA1EUNC13C
ITPR2
CPLX2
SCG5
ARPP19
DOCK10
SUSD4
NUAK1
CNTNAP4
PLXDC2
ADRBK2
PNMA2
PDE8B
PDE3B
FAM153B
BRINP2
CDH20
FAIM2
TLN2
ALDOA
RPL31
PDE7B
NLGN4Y
KCNB2
SV2A
KCTD1
RP5−921G16.1
STMN1
DPY19L2P1
PITPNC1 ENC1
CDH13
FSTL5
DCC
PDZD2
NDST3
HOMER1
INADL
CREG2
UBA6−AS1
PPFIBP1
CAMK2D
ACTG1
SNCA
RPH3ANFIA
RASGRF2
EPHA4
GAP43
SLC12A5
DLGAP2
CABP1
CACNG8
YWHAH
RPL34
NCDN
LINC01122
SHANK2
TSPYL1
MAGI3
PLXNA4
SYNDIG1
CUX1 CTNNA3
RP11−436D23.1
RP11−191L9.4
ZMAT4
SCAIPRKG1
PRICKLE1
FTH1KCNC2
SLC17A7
ROBO1
SGCD
ST6GALNAC5
GUCY1A2
SEPW1 NRGN
FAM19A1
TUBA1B ZNF385D
CNTN5
KCNB1
KIAA1549L
WASF1SPOCK3
CALN1
C16orf45
NFIB
TMEM232
MIAT
RUNX1T1
FAT3
ALCAM
ERICH1−AS1
SERPINI1
EIF4A2
HSPH1
KLC1
SARAF
PRPF38B
PHLPP1
GLCCI1
GRM3
DOK6
PCSK2
DPYD
RP1−34H18.1
MAP1A
PPM1E
DGCR5
CALM3
MEF2C−AS1
ATP2A2 UNC5C
LMO4
TIAM1
PTPRO PTPRT
TRPM3
KCTD8
MCTP1
BEX1
NSF
CAMK2A
EXT1
STMN2
TMTC2
CDH10
KCNQ1OT1
CAMK2N1
ANO4
MYO1D
OIP5−AS1
AC074363.1
GALNT18
FMN1
LINC00657
MAP3K5KCNH7
THY1
PTPRR
RPS6KA2 TNR
NECAB1 NRG1
EPHA6
FSTL4
KIAA1217 RORA
IDS
TMEM59L
ATP1A1
SORCS1
OLFM3
ESRRG
SEZ6L
ADCY1
SLC35F1
KLF12
PPP2R2C
UCHL1
AEBP2
SHISA9
ITPR1
CAP2
SYP
RFX3 NELL1
BRINP3
RGS6
CHSY3
SV2B
TENM4 SYN3
PTPRK
CCNI
SLC35F3
TSPAN7 ROBO2
2
NDRG4
CACNA1D
GRIN1
PIK3R1
GAPDH
NLK
NGFRAP1
CSMD2
RTF1
SHTN1
PLCL1
TMEFF2
BASP1
PAM
CAMK4
DSCAML1
SRRM3 MGAT4C
CA10
CADPS2
SLC44A5
EML6
GABRA1
GRIA1
LINC00599
SLC44A1
ZFPM2
MKL2
CLU
MAST4
TNIK
BRINP1
MMP16
TSC22D1
KHDRBS3
ZEB1
FLRT2
EFNA5
SORBS2
ABLIM1
EPB41L2
CHL1
GALNT13
TMEM108
YWHAG
SH3GL2
FOXP1
NCALD
ARNT2 OLFM1
UNC5D
SRRM4ADGRL2
YWHAB
MYRIP
RP11−384F7.2
NELL2
PPIG
SETBP1
LL22NC03−2H8.5
EDIL3
RAPGEF5
AK5
HDAC9
GPR158
NPTN
GABRB1
PPM1L
SCN1A
MARCH1
FHIT
PEBP1
ZEB2
PRKCA
CADM1 VSNL1
TMEM132D
KCNH1
RBFOX3
TMEM132B
HIVEP2XKR4
KIRREL3
SPTBN4
PDE1A
LARGE
DNM1
KCNQ3
ARHGAP32
LRRTM3
PAK3
CDK14
CHD5 QKI
GABRG3SPARCL1
FAM19A2
LINGO2
KAZN
GRIA3 CHN1
KCTD16
SLC2A13
PRNP
GNAS
ARHGAP26
PSAP
ENO2
BTBD9
CAMK1D
DGKI
CACNA2D1
WBSCR17
CAMK2B
CELF4
ADCY2
THRB
HECW1
FBXW7
ASTN2
RASAL2
GABRB3
SPOCK1
CACNB2
CLSTN1
PRKCE
CKB
PREPL
STXBP5
NTRK3
PTPRN2
DOCK4 DGKB
HS6ST3
CDH12
LDB2
ZBTB20
NPAS3
ATP1B1
CDH18 ASIC2
LRRC4C
PCDH7
CACNA2D3
GRM7
DAB1
IQCJ−SCHIP1
CHRM3
CNKSR2
STXBP1
ZNF385B
ROCK2
AFF3
CACNB4
TANC2
HCN1
CNTN4
SYN2
ENOX1
TMEM178B
PLEKHA5
PDE10A
PTPRG
SORBS1
WWOXFRMPD4
ATP8A2
CADPS
MAN1A2
FMN2 KCND2
PRKCB
ERC1
SIPA1L1
ANKRD26
BDP1 CNTNAP5
PEX5L
MTUS2
PDE4B
MSRA
GRIN2B GRIK2
LRFN5
KHDRBS2
ARPP21
SLC8A1
ATP2B1
CACNA1C
SNAP91
NEBLGRIN2A
GABBR2
RTN3ERC2
ATP2B2
GABRB2R3HDM1
RTN1
GRM5
NCAM2
MACROD2
CACNA1B
DTNA
KIF5CDSCAM
RIMS1
SLC4A10
DCLK1
HSP90AB1
SMYD3 TENM2
KCNQ5
GRIA4
SNAP25
CALM1
CSMD3 ATRNL1
PPP3CA
MEF2C
NRCAM
SCN2A
CAMTA1
NTRK2
TRIM9 PHACTR1
KALRN
MDGA2
CACNA1AOXR1
SLC24A2
FRMD5 CTNNA2
RYR2 SNTG1
RALYL
NKAIN2
CELF2
AGBL4
ADGRL3
TCF4
KCNMA1
DMD
RTN4
LRRC7
GRIA2
PCLO
MAP2 MAP1B
PPFIA2
HSP90AA1
AUTS2
MYT1L
MAPK10 PDE4D
FGF12
CCSER1
AHI1
DPP6 PLCB1
NBEA
CNTN1
RBM25FRMD4A
JMJD1CNEGR1
FTXRGS7
OPCML
RIMS2
GPM6A
NLGN1
PPP2R2B
NTMFGF14
IL1RAPL1
CTNND2
FAM155A
ANK2
DOCK3
MAGI2
1
Mean Counts
Fig. 4 Mean (x-axis) vs. Coefficient of variation (CV, y-axis) of genes (dots). Two
null-models of mean-CV relationship—Poisson (dashed-red line) or the Poisson-
Gamma mixture model—are also plotted
2.4 Z-Score the Data 1. Variation in scRNA-seq data that is relevant to cell identity can
and Remove Unwanted be masked by many unwanted sources of variation. A common
Sources of Variation challenge is batch effects, which can be reflected in both tran-
Using Linear scriptomic differences and cell-type compositional differences
Regression between equivalent experimental batches. As mentioned ear-
lier, variations in lysis efficiency, mRNA capture, and amplifica-
tion can result in substantial differences between the
transcriptomes of equivalent cells. There can be additional
sources of variation resulting from biological processes such
as cell cycle, response to dissociation, stress, and apoptosis that
might dominate the measured transcriptomic state of the cell.
Correcting for such effects continues to be an active area of
research, and many sophisticated approaches have been recently
introduced [24, 25], but a comprehensive overview is beyond our
scope. Here, for demonstrative purposes, we remove variation in
gene expression that is highly correlated with library size nUMI.
Seurat performs a linear fit to the expression level of every gene
using nUMI as a predictor, and returns the residuals as the “cor-
rected” expression values. Next, the expression values are z-scored
or standardized along every gene,
56 Karthik Shekhar and Vilas Menon
E ij Ei
E ij
σi
Here Eij is the corrected gene expression value of gene i in cell
j , Ei and σ i are the mean and the standard deviation of gene i‘s
expression across all cells. The transformed expression values now
have a zero mean and standard deviation equal to 1 across all genes.
2. Removing the effects of nUMI and z-scoring are performed
together using Seurat’s function ScaleData, which then
stores the transformed gene expression values in the slot
snd@scale.data.
2.6 Visualize PCA 1. Seurat allows multiple ways to visualize the PCA output, and
Output these are useful to gain biological intuition. VizPCA shows the
genes with the highest absolute loadings along any number of
user specified PVs (Fig. 5).
3. Figures 5 and 6 show that the cells with high values of PC1 are
oligodendrocytes, characterized by the high loadings of char-
acteristic genes such as Proteolipid Protein 1 (PLP1) and Mye-
lin Basic Protein (MBP) (Fig. 5). Next, PCHeatmap allows for
58 Karthik Shekhar and Vilas Menon
QKI ZNF385D
PLP1 FSTL5
CTNNA3 GRID2
ST18 TIAM1
RNF220 UNC13C
ZBTB20 RP11−649A16.1
MOBP FGF14
MBP KCND2
NCKAP5 GRIK2
SLC44A1 RELN
PDE4B RORA
DOCK10 CDH18
TF RIMS1
PHLPP1 TRPM3
CLDN11 CALN1
FRMD4B FAT2
PTGDS GRM4
DOCK5 TENM1
PPP2R2B CHN2
AGBL4 ZNF521
OLFM1 CADPS2
NRGN CA10
LDB2 ST18
ATRNL1 PPP2R2B
SNAP25 CTNNA3
KCNQ5 SLC44A1
CHN1 MOBP
PLCB1 MBP
PHACTR1 RNF220
KALRN PLP1
PC1 PC2
Fig. 5 Genes (y-axis) with the highest negative and positive loadings (x-axis) for the top two principal
components, PC1 and PC2
10
Cerebellum
PC2
FrontalCortex
0 VisualCortex
−10
−10 0 10 20
PC1
Fig. 6 Scatter plot showing the scores of individual cells (points) along the top two principal components, PC1
and PC2
PC 1 PC 2 PC 3
QKI ZNF385D CHN2
PLP1 FSTL5 PLP1
CTNNA3 GRID2 ST18
ST18 TIAM1 NKAIN2
RNF220 UNC13C MBP
ZBTB20 RP11−649A16.1 MOBP
MOBP FGF14 PDE1A
MBP KCND2 RNF220
NCKAP5 GRIK2 CADPS2
SLC44A1 RELN CTNNA3
PDE4B RORA CDH18
DOCK10 CDH18 FSTL5
TF RIMS1 TF
PHLPP1 TRPM3 SLC44A1
CLDN11 CALN1 UNC13C
R3HDM1 CLDN11 CTNNA2
ASIC2 EDIL3 SLC4A4
OPCML TF RYR3
KHDRBS2 IL1RAPL1 HPSE2
AGBL4 PEX5L COL5A3
OLFM1 DPYD ZNF98
NRGN SLC24A2 NKAIN3
LDB2 ST18 FGFR3
ATRNL1 PPP2R2B ATP1A2
SNAP25 CTNNA3 PITPNC1
KCNQ5 SLC44A1 RNF219−AS1
CHN1 MOBP SLC1A3
PLCB1 MBP ADGRV1
PHACTR1 RNF220 GPC5
KALRN PLP1 SLC1A2
PC 4 PC 5 PC 6
CADPS2 PDZD2 VCAN
RALYL GRIA4 LHFPL3
TRPM3 GRID2 SNTG1
CA10 CA10 TNR
RP11−649A16.1 DGKB DSCAM
CDH18 HS6ST3 RORB
SV2B CBLN2 POU6F2
CHN2 NLGN1 LRRC4C
SLC17A7 DSCAM RORA
SLC1A3 DMD C10orf11
PPP3CA FGF14 DCC
SLC1A2 IL1RAPL1 FOXP2
ADGRV1 TNR PTPRZ1
CAMK4 ADGRL3 ST6GAL1
GPC5 SYN3 PHACTR2
SPOCK3 STMN1 RGS12
KCNC2 CKB ROBO1
PTPRM GNAS MGAT4C
CXCL14 EIF4A2 SYNPR
ERBB4 NDRG4 CHRM3
SLC6A1 CALM1 RASGRF2
GRIP1 PEBP1 CDH9
GAD2 NGFRAP1 GRIK2
DLX6−AS1 ENO2 ENC1
ADARB2 TUBA1B CXCL14
NXPH1 OLFM1 ROBO2
ROBO2 NDUFA4 EPHA6
GAD1 GAPDH DAB1
RP11−123O10.4 SNAP25 HPCAL1
GRIK1 SPARCL1 CBLN2
PC 7 PC 8 PC 9
DOCK8 PHACTR2 APBB1IP
APBB1IP PRKG1 DOCK8
C10orf11 UNC5D ST6GAL1
ADAM28 FRMD4A HS3ST4
C3 PLCH1 RASGEF1C
ST6GAL1 INPP4B C3
INPP5D AUTS2 ADAM28
P2RY12 MAGI2 OPCML
SLCO2B1 UNC5C ROBO2
TBXAS1 RORA P2RY12
PLXDC2 RYR1 PLXDC2
FYB CLMP FOXP2
RP11−480C22.1 TENM2 OXR1
ATP8B4 FRMPD4 SLCO2B1
AOAH HCN1 INPP5D
PPP2R2B BCAN CNTNAP5
NKAIN2 MEGF11 GRIA4
IL1RAPL2 CHST11 KIT
CTC−535M15.2 SMOC1 GRID2
CTNNA2 EPN2 TFAP2B
CTC−340A15.2 XYLT1 CLMP
DCC COL9A1 INPP4B
SNTG1 LUZP2 RORA
CNTN5 DSCAM UNC5C
NRG1 SEMA5A EML5
OPCML PTPRZ1 RP11−886D15.1
MAGI2 TNR GRM1
RALYL PCDH15 PHACTR2
POU6F2 RP4−668E10.4 VCAN
RORB LHFPL3 SYN3
PC 10 PC 11 PC 12
ADARB2 FAM19A2 KIT
CXCL14 SLIT2 FGF13
RGS12 FGF13 PRELID2
DLX6−AS1 TENM2 SV2C
GALNTL6 CNTN5 CTC−806A22.1
CNR1 ERBB4 PTPRT
VIP RP11−767I20.1 GAD2
CRH TRPC3 PTCHD4
CCK HS6ST3 LAMP5
CALB2 NTNG1 EYA4
TAC3 ATRNL1 BCL11B
KCNT2 PTCHD4 MGAT4C
KIT CUX2 SGCZ
NR2F2−AS1 GRIN2A GRIN3A
C8orf34 CUX1 FREM1
LRP8 TMSB10 BTBD11
NMU GALNT14 HS6ST3
LHX6 SEMA3E RGS12
MAFB NKAIN2 PLXNA4
RASGRF2 SHISA6 GRM7
KIF26B SYNPR SYNPR
SPARCL1 DMD PLCE1
GRIK3 GRID2 TENM2
PCDH15 CLSTN2 ERBB4
TAC1 SST ZNF804A
GRIK1 GRIK3 GALNTL6
SST RXFP1 CALB2
SOX6 PDZRN4 VIP
NXPH1 EGFEM1P THSD7A
KIAA1217 ROBO2 TAC3
Fig. 7 Heatmaps showing expression of top 15 positive and negative loading genes in individual cells along
PC1–PC12
Standard Deviation of PC
5
1
0 10 20 30 40 50
PC
2.7 Identify Clusters 1. We choose 25 PCs based on Fig. 8. Every cell in the data is thus
reduced from ~23,000 genes to 25 PCs (a ~1000 fold reduc-
tion in dimensionality!). Next, we determine subpopulations in
this data using Graph-based Clustering [48] using the Seurat
FindClusters function. Graph clustering has been widely
used in recently scRNA-seq papers and has many desirable
properties compared to other methods such as k-means clus-
tering, hierarchical clustering, and density-based clustering.
Here, we first build a k-nearest neighbor graph on the data,
connecting each cell to its k-nearest neighbor cells based on
transcriptional similarity. The nearest neighbors are determined
based on proximity in PC space using a Euclidean distance
metric. Next, similar to the strategy employed in Levine et al.
[49] and Shekhar et al. [13], the graph edge weights are refined
based on the Jaccard-similarity metric, which removes spurious
edges between clusters. FindClusters implements an algo-
rithm that determines clusters that maximize a mathematical
Identification of Cell Types from Single-Cell Transcriptomic Data 61
40
21 19
22 8
23 24
14 0 13
20
1 14
5
12 2 15
0
17 3 16
16 25 4 17
tSNE_2
5 18
0 3 2 10 6 19
11
7 20
8 21
4 7
9 9 22
10 23
13 11 24
−20 1
12 25
15 20
6
18
−40
−20 0 20 40
tSNE_1
Fig. 9 Visualization of Lake et al. data using t-distributed neighbor embedding (t-SNE). Cells are colored
according to their cluster membership
62 Karthik Shekhar and Vilas Menon
27
29
31
28 30
35
37
38
36
41 40
39
42 33
43 44
45 46 32
47 48 50
34 51
49
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
Fig. 10 Dendrogram showing transcriptional relationships between clusters (nodes)
2.8 Compare 1. Here, we see that the maximum “out of bag classification
Clusters with Original error” (OOBE), is less than our threshold. Thus, we retain
Cell Type Labels from all 26 clusters. Next, we compare our clustering result to the
Lake et al. [39] cluster labels published in Lake et al. [39], which nominated
33 clusters in their analysis. While we have obviously fewer
clusters, it would be interesting to examine how they compare
to Lake et al.’s results. We first read in their cluster labels,
Purk1
Purk2
Gran
In7
In8
In6a
In6b
In4b
In1c
In2
In3 Percentage
In1a 0
In1b 25
In4a 50
OPC 75
Ex6b
Known
Ex6a
Percentage
Ex8 100
Ex3a 75
Ex3b
50
Ex3c
Ex3d 25
Ex3e 0
Ex4
Ex5a
Ex5b
Ex1
Ex2
Mic
Per
End
Ast
Oli
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Predicted
Fig. 11 Transcriptional correspondence between clusters determined from the Lake et al. dataset in this study
and in the original study. Circles depict the percentage of cells of a given Lake et al. cluster (row) assigned to a
cluster determined above (column)
perc
0
10
Cerebellum
20
30
Sample
40
FrontalCortex
perc
40
30
VisualCortex 20
10
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Cluster
Fig. 12 Cluster composition of each brain region. Circles indicate the proportion of each cluster (columns)
within each region (row). Each row sums to 1
1.5
1.0
0.5
0.0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Identity
HTR2C
2.0
1.5
1.0
0.5
0.0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Identity
NPSR1−AS1
1.5
1.0
0.5
0.0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Identity
Fig. 13 Cluster-expression of a pan-excitatory neuronal marker SLC17A7 (top) and markers specific to cluster
12 HTR2C (middle) and NPSR1-AS1 (bottom)
68 Karthik Shekhar and Vilas Menon
2.11 Compare 1. One of the many challenges in cell type classification studies is
with Mouse Cortical that of aligning clusters across different datasets, which might
Cell Types include different batches, different conditions (e.g.,
normal vs. disease), or even different species. Here we attempt
to map clusters from a dataset of visual cortex (VC) neurons
isolated and profiled from adult mouse using the Smart-seq
method [15] to our Human CB, VC, and FC clusters using a
supervised learning algorithm. We use a multiclass classification
approach described previously [13].
First, we read in the mouse VC data comprised of 1679 cells
and create a Seurat S4 object. To match the gene ID’s to Human
data, we capitalize all gene names—note that a more exact, albeit
lengthier approach, would be to match genes based on an appro-
priate orthology database. We also read in the cluster assignments
of each cell. Tasic et al. identified 49 transcriptomic types, compris-
ing 23 inhibitory, 19 excitatory, and 7 non-neuronal types [15]. We
next select features to train our classifier. We identify variable genes
using Seurat’s FindVariableGenes function (Fig. 14), which is
more appropriate for Smart-seq data [40]. After expanding the set
of variable genes in the snRNA-seq data using NB.var.genes, we
compute the common variable genes to train a multi-class classifier.
VTN
APOD P2RY12ITM2A
6
RGS5 CTSS
LY6A LY6C1 PTGDS
GPR34
FCRLS
PLTP
SLCO1A4 HEXB ALDOC TRF
CSF1R
CX3CR1 C1QB
LAPTM5 ZFP36
IL33 GJA1
CNP
CP
CD53 MAL MFGE8SLC1A3 CAR2
MBP
OLFML3
TGFBR1
RNASE4 CFHGJB6 SPARC
MOG
ERMN F3
PLA2G7
SLCO1C1
SEPP1
PDGFRA EDNRB TTYH2ATP1A2
GSTM1 PENK
CRHBP CST3
PLP1
SDC4
4
ACSBG1S1PR1
TIMP3
CLDN11
AQP4
UGT8A TSPAN2 CD9SIRT2
GATM
TAC2 BTG2 ENPP2
PRDX6 IGF1 VIP
GPR37L1
FAM107A
SGMS1
DIO2
ZFHX3
ETS1 PTPRZ1 SEPT4
TAC1
B230120H23RIK
ABCG2
CLDN10
FYN
PPAP2B
LPAR1
SLC7A11
EVI2A
HTR3ACHN2 CRYAB MMD2SGK1 QK SST
CLU NPY
HBA−A1 TMEM100 LASS2
CPLX3
PCDH15 CAR4
QDPR
DBNDD2GLUL CALB2
SAT1
ACTA2
DCN PLAT
PMP22 DHRS3 MT1 KIT LGMN
TAGLN
CCNB1
SPP1IFNGR1
CMTM5HAPLN1
SOX9 SP8 GM2A FOS
SERPINE2
B2M
Dispersion
SLC6A13 TMEM176B
EPB4.1L2
ENPP6 ID3
PLEKHB1
KCTD12 ARL4A
PIK3R3 RELN
GRM3BTG1
PSAT1 RAB31 ID2
RAB3B
POSTN
MYH11
GM5127 ATP13A5
MYL9
LUM NPY2R
MYL12ACROT
S100A16
PHGDH
ITIH5 GRNELOVL5
KITL BSG LUZP2
NDRG2 SC4MOL RNF13 GAD2
P2RY13
SLC22A8 DLX6AS1 SCD2 FTH1 SLC1A2
COL1A2RGS1
EMCN
MYLK
9630013A20RIK
BGN
CCL4
SLC38A5
IGF2
SCAMP2
FCGR3
SLC4A4
SLC2A1
SLC40A1
AKR1C18
PTRF
MTUS1
TEK
NUPR1
NEU4
ABHD3
LIX1
G530011O06RIK
PTHLH
3632451O06RIK
JAM3
DDIT4L
RASGEF1B FSTL1
MAOA
BDNF
NR2F2
KLF6
CD63
GJC3
RGS10
SPRY4 TSC22D3
FAM19A1 RORB FXYD6SEPT2
6330527O06RIK SYNPR
2
MRC1
CRIP1
SLC19A3
TOP2AGIMAP6
CMBL
8430408G22RIK
SERPING1
ITIH2
FERMT3
SLC6A20A TXNIP
VCAM1
ELTD1
ITGAM
GPR17
TMEM204
IL2RG
PRC1TYROBPSEMA3C
FERMT2
CD93
LY86
HSD11B1
SIGLECH
PPP1R3C
SELPLG
FN1 PTTG1IP
ADARB2
CTSZ CAV2NFIA EGR1 PDE1A
LAMA1
HMCN1
CEACAM1
PCOLCE
SLC22A6 ANXA2
PGLYRP1
PROM1
CCL3
P2RY14
PTPRB
GM4070ATF3
RAB43
UBA7
SLC16A4
IL10RA
HMGCS2
PTPRC
CD86
ART3
KRT73
TBX18 FLT1
PLD4
CD14
FCER1G
TM4SF1
NCKAP1L
2610305D13RIK LCP1
C3AR1
OCLN
EFEMP1
C1QC
EMR1
4632428N05RIK
ABCB1A
STAB1
IKZF1
TBXAS1
OGN KLHL5
GRPR
MOBP
ANXA3
AKAP13
CTSH
SLC39A12
IIGP1
TRIM30D
TMEM119 ESAM
CCDC141
GPR116
C5AR1 SS18
DUSP1
GPR37
TH
PLEKELMOD3
NKAIN4
ITGB1
TCF7L2
RLBP1 NPC2
SLC12A2
OPALIN
ALDH6A1 TSNAX
QPCT
PVALB
NFASC
PNOC
FAM46A
SPON1
SLC6A6 KLHL13
RGS2
ANXA5
ACSL6 ETV1
SYPL CDH13
IVNS1ABP DUSP6 LY6E CTSD
ZCCHC12
PPP1R16B RGS4
GPR30
TMEM45A
ABCC4
GSTM2
MYCT1
GGTA1
CKAP2L
ASPNHPGDS
ICAM2
FBLIM1
MPEG1
AOX3
HIGD1B
SERPINB1A
A2M
UNC93B1
EMP1
SIGIRR
PLSCR4 MSN
2810468N07RIK
ADAMTS1
NFATC1PAQR5
GVIN1
MYL1
SERPINB6B SRGN
CXCL12
IL10RB
SLC38A11
RAB3IL1
GPR84LAIR1
TTF2CHRDL1
SUSD3
FAS
C1QA
CASP8
ATP10A CD68
LPAR6
KANK1
LGALS9
CCR5
GFAP
BCAS1
TAGAP
TLR7 VIM
TMEM63A
OLFR558
SFT2D2
FAM55B
OLFR78 FCGRT
CCL6
HK3
CTLA2BFGD2
CLEC5A
SYK
2310008H04RIK GSTA4
MPZL1
NTSR2
ASPA
IL1R1 TCN2
MAG
HK2 TBL3SSH2
9430020K01RIK
MAN2B1
F11R NRP1
UNC5C
SEMA5A
LITAF
EML1SNX5
GPD1
IQGAP1
LAP3
SLC17A6 CD34
IDH1 OLFM3
ARC
LAMP2 SLITRK2
3110035E14RIK
EPB4.1L3
GSTM5 PCP4
KCNC2 RAB3C
ARHGAP36
BRCA1
TAGLN2
ITGA1
SLC2A5
ARHGAP4
ITGB2
TGTP1
TMEM173
TLR5
SLITRK6 PLN
PECAM1
TREM2
RHOH
UGT1A6A
P2RY6
CENPE
NOSTRIN
FZD4
PSD4 LYN
AHNAK
CPT1A
SP100
IGSF6 H2−K1
NID1
IL1A
TNFAIP3
INPP5D
ABCA9
MT_AK131586
TLR4
ENTPD1
CTSC
DAPP1
CD33
5430407P10RIK
ARHGDIB
SLC39A8
ZBTB46
POLD1ZFP36L1
TES
BCO2
DDX60
SQRDL
GCNT1
FOXC1
ST6GALNAC2XDH
BMP4
EBF1
ATP13A4
ENG
LRRC33
NFKBIE
ENPEP
TMEM154
MFNG
NFAM1
IRF7
GBP2
LPCAT2
FCGR2B
SLCO2B1
AGMO
GALNT6
KIF2C
CKAP2
LOXL3
GYPC VCAN
MERTK
IRF1
CHODL
RP2HCAV1
EPAS1
TLR3FLI1
NDRG1
LIMS2
OLFML1
LCAT
IFITM3
THBD
KIF20A
E130306D19RIK CD59A
UCP2
FYB NFKBIA
PHYHD1
NFE2L2
CSF3R
CD37
CYSLTR1
RFX2
GRAP
MCM3
TSPAN18
SH3TC2
RBM47
TRIM16
HHEX GPC5 SLC7A3
VAMP3
RGS8
PCDH20
PLS1
TRIL
CYP2J9
FCHSD2
PGCP PTPRE CALB1
E130309F12RIK
CTSO
GOLM1
FMO1
GJB1
PREX2
ITIH3
LCP2
SEMA4D
GPRC5C
APOBEC1
SLC14A1
HAVCR2
SASH3 CALCRL
SLC38A3
JAG1 ARRDC3
EPHX2 MAP3K1
ZEB1 UTRN
ATP1B2
TTYH1
TSPAN12
CYFIP1 DNER
MAF PER3
CD164 PTN
DDAH1 NRN1
SLC38A2
GSTP1NFIB MARCKS SCG2
BMPR1B
PLA2G4A
CXCL16
RCSD1
AGBL2
ANTXR2CSPG4
HPGD
AF251705
PDGFRL
RNF135
CYTH4
KLHL6
AGXT2L1
CCRL2
CYP4V3
TGIF1
SERPINF1
GGT5
VIPR2
ADRB2
TRIM34A
SPC25
SLC25A45 TNFAIP6
2010002N04RIK
CASQ2
USP18
FILIP1L
THBS1
SLC15A2
GJA4
GPR160
COL4A3 CBS
RASGRP3
MEGF10
ITGB5
LAMA2
PLCD4
CYBA
ITPRIPL1
PLEKHA2
CYP1B1
GALNT10DEGS2
FCGR1
TMEM123
DNASE2AARSG
SNX33
CYP2D22
SYNGR2
KDR
GPR146
ANXA1 SLA
PON2
PRRG1
SALL1
ALDH1L1
AFAP1L2
ABI3
ARHGAP29
PALMD
CTTNBP2NL
GPR183
GM10790 PHKA1
CML5
PIK3CG AIF1
GDPD2
DOCK2 GBP7
TAX1BP3
CD97
COL4A1
X99384RHOG
ECSCR
GAB1
PROS1
ENTPD2
MYO18B
PPIC
MLC1
RFX4
OASL2
TMEM140
HSD3B7
CDC14A
PTGFR
ATP2A3
ADAMTSL3
TNMDTMBIM1
ZFP90
ZIC4
BC028528
VAV1
LYZ2
BFSP2
SNCG
ZIC3
TNFAIP8
AW112010
PDGFRB
AASS
DMP1
PMEL
IFI44
PTPN6
PLSCR1
PARP14 NDE1
PLD1
MFSD2A
FMNL3
MS4A6D
SLC52A3CMTM6
CD48
CENPF
HVCN1
KCTD12B
BCL2A1B
SEMA3D
TNFRSF1B
TGFBI
CD274
PTPLAD2
ANLN
IRGM2
RNF122
5430416O09RIK
KCNH8 FGFR3
TIMP4
KCNJ10
PDYN
PDLIM2
AIF1L
PLXNB3
CCL12
HEYL
CYP2J12
ADHFE1
SLC25A18
MFSD7C
LRMP
GATA2 CD82
NEAT1
LGALS3BP
A630033H20RIK
TRH
SLC29A3
GM14023
CFTR CD38
FABP7
RBL1
KCNJ16
PRODH
SH3BP4
RNF125
GJC1
CLEC4A3 PLLP
JAM2
GNG12
GABRG1
PDPN
CAR14 PTGS1
GSN
STXBP3A
PRSS23CR1L
LYPD6
HTRA1
CHST11
CALD1
1700017B05RIK
RASL12 ELOVL1TOB1
RHOB
GPR56
PTPRR
MAML2
SEMA6A
RIN2
LGALS1
GPX3
APPL2
ABCA1
EGFL7
AMPD3 LEPROT
FSTL5
CTNNA1
GLUD1
MYH9
CLCC1
CAT
SORCS3
LAMP1
EPN2
LIPA ZNRF3
PAMR1
WLS
COCH HBP1
BHLHE40
TALDO1
TMEFF2
ACSL3 PCP4L1
IDI1 SH3BGRL
DKK3
CRIP2
ASAH1
CHST2 NRIP3
GALNTL6
ELMO1
MT_AK139026 DBPHT2
RESP18 LRRC58
RCAN2
ARPP21ADO
CD81 ENC1
LDHB
PDGFD
TNFRSF1A
FRMD7
ITPR2
MAP3K8
KCNE4KLF4
1810011H11RIK
ACVRL1IL16
SALL3
PODXL
FAM107B
CRISPLD2
HRCT1
APOLD1
IRAK4
WWTR1
FRRS1
HMHA1
TGFBR2
ACOT11
NCAPH
MAB21L1
PSMB8
SGK2
ARHGEF26 MAOB
RTP4
ABCA8A
PNPLA7
MLXIPL
ARHGAP31
PLSCR2
MXRA8
LMOD1 TFPI
ALOX5AP
COL6A3
LPAR4
ZFP764
LRRK1
ABCA6
EMP3 TLN1
DAAM2
TRIM25
TGFA
GBP3
IL13RA1 ID1
TMEM98
EMP2
PRKCD
GAL3ST4
FANCI
GRB14
VWF
SPATA13
PNLIP
ARHGAP30
RASSF4
STARD6
SYNPO2
FOLH1
SERPIND1
PAX6
ZFP418
LAT2
ABCC3
SAMD9L
DDX58
LTC4S
BANK1
LMCD1
PRRX1
PRKCQ
PRELP
ROBO4GLDC
CDKN2C
EDN1
PON3
GM10345
TEC
GRAMD3 TPM4
PPFIBP1
NCF1
SMPDL3A
CPT2
SSFA2
TNS1
MAVS DNAJC13
TMEM2
ZFP658
ITGA6
TNFRSF19
ANXA4
NCF2
ECT2
ARAP1 PDE3B
TST
LHFPL2 BTD
FXYD5
A930038C07RIK
CDH19
FGD6
1700019G17RIK CYR61
TGFBR3
FLNA
NOTCH2
FGFR2
BCAN
LPCAT3
FRMD8
EZR
FAM105A FRMD4A
RRBP1
RRP8
ADM ANO6
SLC31A1
ZFP420
CRYM
TMEM18
GLTP
MYO10
NPC1
S100A1
CYP2J6
LATS2
SEMA6D
6330503K22RIK
A130022J15RIK
SUCLG2
CAPN3
EHD2
MMRN2
ZFP456
RAC2
NFKB1
SLC46A3
DENND3
MS4A6B
SORBS3
TMC6
RHOJ
CXCR7
TIFAB
RFTN2
TNFAIP8L2
SAMSN1
PLXNB2
GSDMD
PLXNB1
ADAM12
ALPL
FAM198A
RBMS2
HIF3A
APBB1IP
RNF43
MYO1F
HCLS1
DNA2
OSMR
KIF13A
SLC7A2
IL4RA
PDK4 CLIC1
LTBP1
ZFP619
SLC44A1
FAM129A
PCDHGC3
ZSCAN20
PDLIM5
SKAP2
SHE PAG1
ITFG3ABHD4
DOCK1
TMEM229A
GM98
SLC16A12 NDP
ADAP2
DOCK11
AI464131TMCC3
APOE
TRP53INP1
IRF8
0610040J01RIK
B4GALT1
AMOT
GBP9
KLF2
RCN3
CCDC90A
SLC5A7
PLCE1
MCAM
IFI47
HYAL1
SULT1A1
ZFP229
EGFLAM CPM
COBLL1PIGA VCL
RGS12
ZFP36L2
CLIC5
SLC30A10
VAMP8
GM9897
NIPAL4
1110015O18RIK FRMD4B
DOCK10
IFITM2
ST5
LIMA1
FAM163A
TMEM176A
INPP4B
NT5E
ESYT1
CNN2 ACAA2
VWA5A UGDH
CLMN
AI987944
PTGS2 CBLN2
SQLE
GSSUGP2
AGPAT5
LHFPL3
EDEM1
MYO6
IFIT2
CADPS2 THSD7A
SERINC3
NEK7
IL1RAP
TBC1D14
MCL1
GNB4 LBRSDC2
TMEM47
TUBGCP5 ZFP62
GPCPD1
TMEM129
SOX2OT
MT_AK157367LGR4PLCXD2
FAM5C
IRS1
4732418C07RIK
DBI NEUROD6
CCND2
TFRC
SOX11
AFAP1 CTSA
KCNIP3
CAMK4
PITPNC1
KCNAB1 NRSN1
GOLGA7B
R3HDM1
VSTM2A
OSBPL1A GAD1
AI593442
SLC6A1
CHI3L1
FAM55D
CGNL1
FKBP9
HHIP
5430435G22RIK
MGST1
SEPT10
SLC26A6
MSRB3
HEY2
DCT
CAR8
MTMR10
C1QTNF7
ARRDC4
RARRES2
BCHE
SERPINB9
ICOSL
ICAM1
FIGNL1
THSD1
GM973
CD1D1
ANGBLNK
FIGN
PLOD1
DDC
MYH4
DBX2
ECM2
HACL1
ANO1
TBX3
EDN3
SELENBP1
CPXM1
STOM
ERAP1
SWAP70
STK17B
PLCG2
TUBA1C
TNFSF13
FAM20A
PDE8ADAB2
AFF1
MX2
STEAP3
RAB7L1
HPS1
NOTCH1
SAMHD1
GPSM2
TAF4B
PHLDB2
PTGR1
FAM70B
PYROXD2
WFDC1 MGP
TPX2
CECR2
GNGT2
ZCCHC24
CEP72
FBN2
FKBP10
ACSS1
SDPR
AURKA
MMP11
APLN
KCNJ13
0610007N19RIK
TSPO
CRYBB1
KIF20B
PLEKHO2
TACC3
ZFP783
SLC11A1
UGT1A6B
HFE
ALDH3B1
DDO
CSRP2
EMID1
VRK2
TMEM144
SHC4
GPX8
UBE2C
MYH13
GPR77
SPRY1
H6PD
TMEM146
TUBB6
GLI3
CDH5
NFATC2
2900052N01RIK
LDLRAD3
DOCK5
ADORA2B
A230001M10RIK
ARAP3
TAP2
GM15880
PLEKHH1 EPS8
AKAP12
AXL
MRVI1
AQP11
ZFP521
ANGPT1
TTPA
I830012O16RIK
COL15A1
SLC9A3R1
CMTM3
RREB1
OLIG1
TCIRG1
PCSK6
HEATR5A
NWD1
HGF
TLCD1 IGTP
LHFP
IKZF2
9830001H06RIK
EPHX1
4930594C11RIK KCNJ2
PROX1
HADH
SMPD2
CABLES1
TRIM12A PSD2
S100B
STAT6
P2RX7
ZBTB37MYC
EIF2AK3
TRAF6
S100A11
CDK5RAP2GRHL1
MEIS1
ITGA7
ISLR
ZFP174
CENPI
AOX1
DHTKD1
CYP4F14
SLC13A3
FGD3
GMIP
TNNI1
ZFP41
NHSL1
GM5069
ABCD1
PRKAB1 PNP
TMC7
FAM70A
SCRG1
PLEKHG1
CYP4F15 SLC16A1
GNG11
FAM176A
LONRF3
PDLIM3
SNAP23
ATG4A
TRIP6
CAPG
NCKAP5
RENBP ARSK
TSPAN4
KCNMB1 PPCDC
ZFP59MYL4
PYGB
LRP4
SPG20
RASA3
PCDHB7
SLC27A1
PBXIP1
CLEC14A ROD1
FAM63A
SERINC5
MT2
THRSP
MYBPC1
GPLD1
CTXN3
XLR3B
FA2H MDGA2
ATP6V0E
IFIT3 MUT
TMED7
ERBB2IP
TAB2
BC013529
RSU1
ACER3
MAGT1
IGFBP5
CEP110
SOAT1
GM20199
PCDH11X
TMEFF1
GPR155
ADAM9
HPS5FBXO7
COL25A1
TYK2
FAM114A1
WNT5A
ALDH1A1
ADAMTSL4SPG21
MT_BC006023 GNS LRP1
POU3F2
SRGAP2
SLC48A1
IL6ST
ZFP407
ARHGAP25
CD83 DRAM2
HDAC9
ABCD2
NSDHL
ZFP868
TANK STT3A
CSMD3
PAPSS1
PNRC2
PIGS
GPD2
MT_BC055066
ASRGL1
CHST10
FAM102B
ERCC5
MCM4
CML1
PIK3R5
FGF1 LPP ILK DEGS1
KLF10
BEND4
NXPH1
GRIK2
TOX
CNTN4
KCNIP4
IGSF3 LIFR
WDR1
LMO3 ECE1
EPHA4
SIRPA
NECAB1
SCG3 PDP1
SV2B
NTRK2 STX1A
D0H4S114
RASGRP1 DBC1HMGCS1CHGB
SCHIP1SPNB2
SNCA
PRKCB
PTPN13
ADAMTS4LAMA4
RIPK1
ZFP820
CEBPA
ACAT3
GMFG
PPFIBP2GHR
KIF18A
CCL9
HTRA3
GM216
WIPF1
PCDHGA2
MC4R
PNPLA2
CACNG4
MVP
H2−DMA
ELOVL2
UNC5B
PRKD3
PIK3R6
2810459M11RIK
B3GNT5
RPS6KA1
COL27A1
PARP3
GPAM
KANK3
NBEAL2
APOBEC3
UACA
CCDC18
GSTM7
NT5DC1
EHD4
STARD8
FUT10
INSC
CHD7
D2HGDH
HSPBAP1
ARHGAP18
TBC1D4
LDLR
SCARA3
SOX7
RAPGEF3
KCNK13
CYP27A1
CENPN
CREB5
ITGA8
NEDD9
NAALAD2
NHLRC1
FPGS
GLIPR1
CHRNA3
UHRF1
CRLF3
LSP1
CYP39A1
KLHL25
IGFBP7
SGMS2
TMEM106A
TJAP1 PTPRO
PION
TPRNANKRD28
ST18
LLGL1
HTR2C
CNN3
RNF103
EFNA1
PLIN2
LIMD1
UHRF1BP1
SRD5A1
COL4A2
AA387883
GPC3
BST2
SLFN2
SLN
ST3GAL4
MRE11A CRH
CASP2
KIF23
SUSD5
CHAT
PCDHGA5
OLIG2
VANGL1
ARHGAP6 SMOX
PDS5A
DCDC2A
1700110I01RIK
SOCS3
NXN
2510039O18RIK
CENPO
PCDHGB7
RAMP2
INHBA
GEM
EBI3
9530051G07RIK
5031410I06RIK
DPY19L4
ZFP438
ARHGAP19BCAR3
LMO2
LECT1
SLC12A4
ALDH4A1
TOP3A
PRTG
SMOC1
MKI67
COL23A1 TOM1L1
FANCC
UTP14B
CAPRIN2
TRIM45
SERPINA3N
VAV3
AMOTL2
TPK1
HSPA1A
VEGFC
SRGAP1
ST6GALNAC3
FTSJD1
E030010A14RIK
ST6GAL1
RAD54B NRM
ZFP566
ARRDC1
ADAMTS6
E130114P18RIK AS3MT
TXLNA
ZFP52
PLIN3
SULF1
RECQL5
ADAM17
CHD1L
EDNRA
CDC42EP1
PCDHA7
PCDHGA4
LTBR
LEF1 ESYT2
RCBTB2
EYA1
CPNE3
PAN2KLHL4
PPP1R15A
PEX5L
SIAH1A
PCDHB9
PLXDC2
STK40
PGGT1B
NAAA
SGPP2
MAP3K3
TNFRSF11B CLIC4
GPT
FIBIN
LTBP4
CARHSP1
GCNT4
FGD5
CASP7
SCARF1
NPHP3
KLK8
ZFHX4CTH
TPM2
GULP1
LAYN
EGFL6
SCD1
AW551984
PRR5L
DDR2
CCNB2
SHCBP1
PAPSS2
DCLRE1B
ARPC1B
FZD6 PCCA
RHOQ
GNA13
DHX33PTPRG
PPP2R1BHIP1
RPTOR
CXCL14
DYNLT1C
PRKD1 JUN
QRFPR
NEDD1
TRIM59
TMEM37
CASP1ABL1
LSS
POLK
ZFP934
SLC9A9
PIGN
6430706D22RIK
TICAM1 WFS1
TIFA
BBS10
LRIG1
S1PR3
MT_AF093677
FZD1
SLC9A3R2
CSRP1
SERPINH1
COL9A3
TMEM125
MBNL3
BCL2L11
SLC7A10
ZFP709
AGTURGCP
FAT4
PELI2
B3GNTL1HIC2
ABCD4
NEK3
SLC22A4 ZFP873
TNFAIP8L3
FKBP7
HN1L
PCDHB19
2610034M16RIK WSCD1
FOXO1
SVIL
C030030A07RIK
POLR3E
CTSK
HMOX1
CNTD1
CALCA
CSF1
FAM69C
CCBL2
ECM1 AK3
ZFP110
AMOTL1
CD2AP
ZFP51
TMEM168
GEMIN5
HCRTR2
GALNT3
KIRREL2
GSTK1
MOB3B
LOC545261
TNPO1
SPATA17
SERHL
CYP4F13
SYCP2
SCARB2
MAFK
SEC16B
ARHGEF10
MUSTN1 HSDL2
CAST ENPP4
ZFP143
ANO4
DECR1
CYBASC3 UST
PRCP CLDN12
ADAMTS3
PCDHB20
D3ERTD751E
1500010J02RIK
XLR
CYP20A1
ABTB2 ZFP866
TNR NELL1
NBEAL1
FAM60A
ERLIN1
HEATR1
9930111J21RIK1
ZFP870
ATP7A
FAT1
GPT2
PAQR8 RNF138
TTI1
IL18
GM5141
GM5595
CDH4
PRR14
TSPAN14
NET1
GCNT2
SLC44A2
SLC19A1
TMEM39A
GM10635
PLEKHF1
C2CD2
PCDHGA7
RFX1
NR1H3
D930014E17RIK
KLF15
MUM1L1
SREBF1 VWC2L
GUSB
PCDHGA11
MID1IP1
CISH EGFR
BMPER
ZFP429
ABHD1
ZBTB39
NINJ2MCM7
MCM5
PCSK7 CD84
DDHD1 TRIB2
ECH1
EXOC6B
ZFP433
SLC5A6
ZFP87
SEMA3E
GABRB1
USP24
PCDH18
PID1 IGF1R
TMEM209
TMEM132C
SLC20A2 VPS13B
ADI1
IDUA
ST8SIA4 ZMYM6
2700078E11RIK
ARHGEF6
PFKFB4
SOX21
IL6RA NQO2
SLC35F5
MT_BC081549
MED12
CCDC114
EZH2
ENGASE B4GALT4
NDST2
NKD1 FNBP1
APH1B
CDK19
AKT2
SH3GLB1
SMAD4 CRK
INTS4
RALGPS2
ZFP788 DGKB
SFXN5
HEXA
NPY1R
JMJD1C
ZFP292
LGALS8
ZFP719
ZFP948
SORCS1
ZFP192 TLE4 STT3B
ULK2
COL19A1
ZFP57
REV3L
ZFP869
GFPT2
CDH9
MYNN
MAFB
LRCH3
GIT2 GM11549
FOXN3
CCDC50
LIMCH1
SNX1
P4HB
BPGM
FGF13
NR1D2
ZFP280D
4933407H18RIK
ZFP740
SLC46A1 CHD9
ZFP617
KCNV1 PDE4B
SEMA3A
DUSP14
CTNNB1
CALU
ALCAM
SLC2A13
CACNA2D1
TPP1
ZMAT4 HNRNPF
KPNA2
ADK LRRTM4
PTCHD1
RPN2
MAL2KCNA1
FOXP1
TPM3 PRDX1
SPOCK3
PTPRD
PAM
SLC39A10
CTSL
NELL2 SYNE1
CPLX1SEPT11
THRB
NRXN3 ARPP19
GPM6B SPARCL1
INPP5FSERPINI1
PCDHGA6
CYP4F16
OSTN
PARP12
ETAA1
CHRNA2
PRR18
RDH5 SYNE2
MYLIP
SNX29
SEMA5B
IFIT1
SMC4
TCEANC
FMO5
SOSTDC1
PCDHGA8
GM9079
CCL28
CLEC4A2
CNKSR3
DHX58
ZBTB5
FXYD1
AGTRAP
SLC25A29
HUS1
RMST
PPAPDC1A
CD55
ADAMTS10
ACY3
DERA
CACHD1 ZFP810
PLAGL1
NUF2
GM16523
GM4636
TTC7IRF9
1700066M21RIK
OAF
3110062M04RIK MIOS
FARP1
TM6SF1
MIF4GD
ANGPTL1
IFI35
TSGA14
SLC41A1
HIST2H2BE
1700028K03RIK
RHPN2
DNAHC7B
SARDH
PARP9ELK3
BC046404
ZFYVE21
FBLN5
PXMP2 ELMOD2
RASSF2
PGF
ZFP273
HYAL2
TJP2
IDH2
MSH6
PCDHGB4
ZFP493
CCND1
LRP10
ANKRD6
PABPC4L
TTC32 RPRM
TM7SF3
C1RL PHC2
PLXNA3
PLEKHG3
C330027C09RIK
AMIGO3
B3GNT2
TUSC5
LGALS4
SNX9
GM10052
MAP3K14
MCM6
2900005J15RIK
SERTAD2
CCDC46 ZFP606
OSBPL7
GM16515
CCDC129
CCNYL1
CLEC1A
4930503L19RIK
MOCS1
STK32A
CLN5ZFP518A
CSGALNACT1
WNT7A
AGPHD1
PHXR4
DOCK6
GSTT2
4930523C07RIK
GM10220
ADAM4
LAMA3
TESK2
RNPEPL1 PUS10
STAT2
CBFB
PDE1C
MEPCE
RDH12
SRPK3
PDLIM4
RHBDD1
SLC16A6 ZFP119B
MTRR
FOXP2
LRRC16APCDH19
LMF2
ATP8A2
PVRL4
HTR4
POMT1
GFRA1
TSC22D4
ZSCAN22
4930422G04RIK
DENND2A
FBXW17
HMGB2
MYT1
SMURF1
ZFP710
PIK3CD
NECAP2
MKS1
MARVELD2 SGK3
NMRAL1
RETSATF8
PCDHAC1
MEX3C
CBL
MAP6D1
DDX21
PIGG
AW146154
ELOVL7
ZBTB20
FAM101B
PLEKHF2
AGTR2 NARG2
EFSEDEM2
SERP1
CD52
FBLN2
SOX10
RBP1
CACNG5
CDCA4
TEP1
SFXN2
1700029I01RIK
FAM82A1
NADSYN1
RRAS2
NTF3
2610019F03RIK
NTS
DAP
DLL1
PHLDB1
SFRP1 CRAT
ALG9
CTBS
AA987161
NKIRAS2
CDC42EP2
TRP53I11
ABHD2
ITGB4 AI854517
FMNL2
GM10509
CD46
ANKRD13A
PTH2R
WT1
TGFB2
AEBP1
PLA2G16
SORCS2
4732471D19RIK
1190002H23RIK
ZFP82
CR2
SMAD7
PCDHB16
LGI4
FAM122B
OBSL1
ABCC10
RASL11A
ONECUT2
CLDN1
TENC1
PTGES
RCOR1
KLF11
SLC30A1
ZFP775
SRPX2
ZFP595
CD24A
UNC119B
D330045A20RIK
COL24A1
TOPBP1
RTEL1
TBC1D8B
ZFP28
NR2E1
CDK18
ARID5A
TFCP2L1
HECA
ENO3
PBX3
STC1
TRIM68
BAG3
PTGDR
UNC13C
RECK
MYD88
9030617O03RIK
CDCA8
BC025920
EML3
CCDC89
KCTD14 FCHO2
F630111L10RIK
TUBB2B
BC017643
UBOX5
RCBTB1
NXPH2
ABHD14B LXN
TET1
ZFP14
IER3
ENTPD5
ZFP808
B3GALT5 P4HA2
GOLPH3LZFP26
CD300A
PCDHB17
4930452B06RIK
A130040M12RIK
4632415K11RIK
NFKBIZ
C030019I05RIK SLC22A23
SLC1A4
SIN3A
MAS1
CPNE1
PPFIA4
PRAMEF8
PIGZ
CYP7B1
HSPB8RGS9
KDELC1
GEMIN4
ASCC2
PREX1
PCDHGA1
PFKFB1 PPP2R3A
CTNS
HPS3
TBC1D5
G2E3
ATR
TMPO
GLRA2
RASSF5
1110020G09RIK
PCDHGA10ATXN1LIER5
CCNG2 MDM2
SP1
INTS2
2610008E11RIK
RDM1
PARS2
PER2PXDN
TMTC4
MAP4K5
6720401G13RIK
SCN7A STAT1
FKBP15 GPR83
H2−D1
INPP5B
GNG5ACO1
MITF
SGPL1
EIF2C4
MYCL1
ZFP189
ITGA9USP38
OSBPL11 MLH3
BCL2L13
SLC30A7
BCOR
HIST1H2BC
TSPAN15
MSL3
ZFP280C
KNDC1
4931406P16RIK
DPY19L3
9130023H24RIK
TGDS
PCDHA9
SNX22
CTDSP1 ACPL2
NDOR1
STX2
CCDC77SYT6
TANC1
1110031I02RIK
ZFP472
TMEM200A
6530401N04RIK
EFNB3
WDR89
ZFP85−RS1
GLP2R
9430015G10RIK
ZFP963
RNF182
PCDHA6
PCDHA4
CRISPLD1
GPR82
PCDHA1 ACOX3
NUP188
PCDHA10 CASC3
ZFP772
OXTR
SEPHS1
SNX18BICD2
SLC25A37
NUMA1
1810041L15RIK
SCGN
HIST1H1C
SLC26A11 SMAD5
PTCHD2
CTDSP2
WDR5BSTXBP4
MED12L
GALNT14
CTNND1
QSER1
EGR2
USP35
ZFP182
AU041133
XRCC3
2310042E22RIK
GLB1LZFP623
TRP53BP2
1700084C01RIK
ZBTB1
PXN
SLC35C1
FBXO30
GM3414 TRPS1
SUV39H1
0610007P08RIKMAD1L1
MAST4
RELA
DDX19B
PAQR7
SLC7A1
TSTD2
2810055G20RIK
SLC35B2
INSR
ZFP7
ACVR1C
MEI1 GGH
8430427H17RIK
ARHGAP42 KRBA1
ZDHHC15
GATAD2A
ELF1
6030446N20RIKTBCCD1
WDR20A
1700049G17RIK
ZFP811FGD4
MEIS2 RRM1
ZFP157
PANK3
CLN8
TTC30B
PDLIM1MTHFD1
IBTK
PCDHGC4
TSPAN9
CCDC80
TMEM229B
LAG3
HTR3BPOLA1
B130024G19RIK
VAMP5 POLR1A
MYBBP1A
DCC
ZFP748
ABHD5
ALG12
ZBED4 ATM
TRAF7
LIMS1
PPPDE2
RYBP CNTN6
ADD3
ASPH
PARM1
DOCK9
CNTNAP4
CLN3
AGPS
CORO1B
SOX5
ZFP84
PCSK5
ST3GAL6
FGFR1
LRRN1
PHLPP1
PPPDE1
PLA2G15
CIDEB OMA1
1810014B01RIK IPO8
KIRREL3
OSTF1
DLG5 RB1 BTBD11
SULF2
EXPH5
PIGQ
GRIK1
MARS
SLMAP
IFI27L1
WNK1
RAP1B
ZFP35
ZFP874A
PDE5A
ATXN7 BAZ2B
GLCE
FAM102A
RBPJ RDX
HDAC6
PHF8
CHUKSESN1
SLC39A9
INTS12
RFX7
TIPARP
INTS3
PHRF1
PDK1
ORMDL1
9830147E19RIK
CHRNA4
PLD5 GCDH 2900092D14RIK
PLCXD3
ARID2
HDAC1
TBRG3
2610203C20RIK
ALDH3A2
PHKB
DHX40
GM3002
PDGFB P4HA1
GNG4 PICALM
B3GALT2
TNKS2
RAB8BFMNL1
TRPM7SHROOM2
TCP11L2
LRRN3
ADCYAP1R1
NTRK3
C
TAF9B
CSNK1G1
TXNDC16
ABCD3
ADRBK2
ATP2B4 RBBP9
SLC35A4
DSCR3
GPKOWBIRC2
ZFP763 CDR2L
FAM196A
SLC3A2
OSBPL9
ARHGAP12
ECI2 PAK7
SEZ6 ADORA1
GUCY1A3
NEFM AP3M1
HADHB
AW549877 FLRT3
NTNAP2 NDN
NRSN2
CAPZA1
SC5D MACF1
BAIAP2
ABAT
BCAT1
TGOLN1
ARHGAP5 EDIL3
CAMK2N1
PLK2
PTP4A1
OLFM1 PGM2L1
TMEM130TACC1
ASNS
LRRC4C
FAM81A
MKL2
DIP2B
CHL1
SKIL DCLK1 STMN1
TUBB4A
MEF2C
FREM1
NUSAP1
SLC18A3
AKR1B8AFAP1L1
EFEMP2
A530054K11RIK
TMEM86A
AQP5MKL1
PLCD1
GDAP10
ERBB3CCNF
ASH2L
COL6A1
POLM
ZDHHC7
PDIA5
NUBP1
ARHGAP17
CECR5
FAM40B
STK32B
FAM78A
DPYDSAMD10
KBTBD8
PCDHGB6
MYO1D
PAK4
ADAMTSL1
TNFSF12
STARD13
CD44
SNCAIP
RHOC
HOMEZ
RNF152
ZFP276
GM410
KRCC1
ABCC9
KCNRG MMAB
FAP
SLC35F2
2310042D19RIKTTN
FDXR
MOB3A
ZFP458
FNDC3B
CDC42EP4
OPRK1
KCTD5
WNT3
CML3
HS1BP3
PTPRK
TOR3A
PCDHB8
RXRG
CASP6
PCDHB10
SFRP4
WNT5B
EFCAB7
POLHISPD
POMT2
FAM84B
CCDC62
ZDHHC14
PDZRN4
LY6G6E
D19ERTD386E
SLC25A35
2810046L04RIK
OLFML2B
ADAM1A
RRM2
PCDHGB2
SLC10A3
1700003M07RIK
SPSB4
9930012K11RIK
COL14A1
CCDC3ID4
SKP2
RXFP1
TRPC6
ZFP54
RNASET2A
MARCH8
PAFAH2
AP1S2
XKR8
ZFP647
OPLAH
WNT7B
ARHGAP24
TMEM179B
DHDH
ZFP677
SLC16A13
4933426M11RIK
APOOL
PCDHB3
MBOAT2
THNSL2
STARD5
FAM117A
IGHMBP2
C030018K13RIK
FAM111A
DDAH2
CENPJ
BMP2K
CCNJL
GDF10
CEP152
SHMT1PCNT
CHST15
GPR133
GDPD5
CENPL
FEM1C
EHHADH
CAPN6
DDR1
CLGN
PKNOX1
4930534B04RIK
RNF130
NUDT7
SLC35A3
SHC1
RTTN
CTDP1
ZFP389
DNASE1L3
ZFP964RPAP1
MICAL1
ZFP551
ATAD5
MASTL
PCDHB15
MT_AK140300
1700029J07RIK
CERCAM
TMEM194B
JMJD7
LRRC47
NOS1
ENDOU
F2R
ZFAND2A
CREG1
GM10336
FOXRED2
ITGB8
RASD1
MIA1
GLT25D2
DNAHC5
COL8A1 PML
CAD
TRIM67
KIF6
HRH2
HAUS5
RNF17
THAP6
TMCO7
4933424B01RIK
PCDHGB8
D630037F22RIK
SLC4A2
EFHC2
D14ERTD668E VILL
STAT4
KIF13B
SRD5A3
BAAT1
CTGF
MAPK7
SLC35E4
SHROOM1
SMARCD2
PLAUR
MORN1
GPR137B
BRWD3
SLC9A8
ZFP455
CD74GSE1
DEM1
OBFC2A
PUS7
NCAPD2
2810002D19RIK
CIB1
4831440E17RIK
ARHGEF1
ZFP568
GOLIM4
ZFP708
PCDHB13
BUB1B
PCDHB6
RAD54L2
ZC3H7B
PRMT6
FADS2
GRID2
NHLRC3STAR COBRA1
DUSP18
LYSMD3
ATP11C
CDK20
DNM3OS
GAB2
3110056O03RIK
RABEP2
ZFP335TAPBP
STX17
AGA
CORT
USP54
HDAC10
GRAMD1A
HEPACAM ZFP108
TMTC3
FOSB
GTF3C5
THBS2
ZFP287
ZXDB
ISYNA1
ZFP641
2610020C07RIK
NAGLU
TNFRSF22
GPR115 PWP2
PCDHGA12
TACR1
MTTP
WDR41
KIF22
OPTN
THSD7B
FAM164C
RSC1A1
PCDHA11 JRK
ABCB10 MKNK1
TOE1
ACBD3
GABPA
CTNNAL1
ZFP354A
D230025D16RIK
PHKG1
TPH2
RXFP3
IQUB
E130304F04RIK
TSPAN11
ANGPT2
ELMO3
SH3PXD2B
WDR25
CALR3 RCN1
H2−M3PRKX
USP21
POU2F1
PPM1D
TPST2
3110001I22RIK
RGNEF
PCDHA8
GNB1L
SIX4
PALB2
EPB4.1L4ASLAIN1
MCAT
ADIPOR2
MOSPD2
DHX38
KCTD9
OSGIN2
EPHB3
WIF1
NOC3L
CD6
D16ERTD472E
SYT9
QTRTD1
KDM1B
SLC6A9
GLIS2
ATG4D
DDX59 POLL
NAIP5
MRAP2
MFAP2
TTC21B
D930015E06RIK
ALMS1
4933407K13RIK
C330006K01RIK
ADRA2B POT1A
NPAS3
TJP1
HRH1
SOX6
RGS19
CPNE9
ATF7IP
4930471M23RIK
ANKIB1
STARD3
NAT10
LARS2 MLST8
ZFP961
FEZ2
LYPD1
PLCL1
GXYLT1
DQX1
ALDH9A1
PSTPIP2
IL11RA1
FAM59A
4833442J19RIK
SLC6A11 USP1
XRCC2
PCDHB12
SRRD
ANKRD44
SCEL
1700013N18RIK
ABCB1BFAM160B1
ATPAF1
ZSWIM5
PCDHA12
PPP1R18
5930403L14RIK
GAS1 ASB1 CDH7
LYPD6BINO80D
USP40
BMPR2
MTFR1
NMI
MDK
LAMB1 SLC41A2
MCCC2
ZFP58
B630005N14RIK
TGM2
C030037D09RIK
GM20257
TRAF4
GM12942 PRPF4
JAK2
AU040320
ZFP592
PGBD1
ZSWIM3 TRAPPC8
RG9MTD2 PKD1
ANKRD27
SAAL1
ZFP142
C1QL3
PRKDC
H2−T9
STK3
SALL2
CEP192
1810026J23RIK
DHX35 XRCC6
MAP3K2
ADCK3
ZFP119A
PPP1R26 CCNL1
GPR176
PRRG3
PPOX
D15ERTD621E
TCF12
SHISA6
E4F1
NSUN3
VWA1TRIM26
TMPRSS7
PEO1
KCTD21
6720489N17RIK
D11WSU47E RNF180
SUOX
S100A13
C330018D20RIK
TOR1AIP1
NRARP
PROKR2 DCP1A
MANBA
CCDC111
RAD51
ZFPM2
1300010F03RIKATG14
PIR
SERTAD4
NAPEPLD
2010015L04RIK
SKIV2L
VGLL4
ZFP94
PCDHGA3
GPR64
SCML2
IGSF10
TNNI3K
DHRS11
SLC13A5
CAR12
FHOD1 UBE2L6
MET
PTPDC1
PUS7L
ZDHHC24
TMEM164
LENG1MTSS1L
SLC52A2
PCDHB4
SLC4A11 STEAP2
FRMD6
EML6
DUSP10
TPMT
LRRC3
TBC1D22A
TEAD1
ZBTB24
LRSAM1
A930017M01RIK
CCDC14 DLC1FLNB
MUM1
PRPS1L3
PCK2
DHCR7
OPHN1
ZFP882
LRP12
HERC4
ZFP958
NUP50
ARFGEF2
LACE1
SH3KBP1
ERO1LB
CCDC88C ACACA
RBMS1 AKAP2
CSNK1G3
ZXDC
POLI
XPO5
GALNT7 UBN1
PCX
LNPEP
NF2
SSH3
ZFP760 NKAP
TM9SF1
GLT25D1
TUBGCP3
PTPN9
KAT2B
XRCC4
MVK
TRMT2B
AASDH
ZFP128
SH3GL1
FASTKD5
BNIP2
CDC25B
IFNAR2
ZFP280B
MTMR12
SLC31A2
FAM135A
NPL
DFFB
PCDHB14
LRRC14
PCDHB5
DCTD
ARHGEF15
HBEGF
CTNNA3
KLHL1
RPS6KA6
PABPC5
BC051142
XLR3A
STARD4 ODZ1
TYW1 M
POLD3
SIPA1L2
LIG1
2810021J22RIK TAB1
KDM1A
ANKRD49
DSTYK
RBM33
SERPINB6A
KCNH5 RBP4
LSG1
MED7
CYB5D2 SLK
ZFP180
FAM53C
BBS9
ASCC3
KBTBD7
1810030O07RIK
E130307A14RIK
MOB3C TPBG
CDC42SE1
ERCC8
ZFP804B
TAF5L FRS2
NNT
ZFP30
SLC16A7
ORMDL2
CARTPT
DEPTOR
ZHX3
DNAHC1
ATP10D MSH2
RNF219
PPP1R15B
ECI1
PARD3
ECHDC2
MYH8IRAK2
SLC43A2
OSBPL5
DDX4 MBTPS2
D9ERTD402E
D430042O09RIKIGFN1
ADCK2
ITPKC
BMP1
DTX3L
ASXL3
MFSD1
PACSIN3MVD TSHZ1
ZNFX1
ZFP759 PDP2
GABRA5
ACTL6A
CSK
DSE
BCAT2
ATAD2B
WDPCP
ANKRD39
C1QTNF1
4930525G20RIK
CDHR1 HR RLF
TMEM177
CLCN7
ZFP939
GTPBP10
ZBTB10 IPMK
UBE2Z
CTHRC1
BAG2
SENP3
TAB3
FAM26E GRIK3
RDH11
COG3
RAI14
ADCY5
STK38
2610301G19RIK
MAGEL2 ZCCHC6
GALNTL4
C730002L08RIK
KLC4
CCDC93
CDO1
POLD2
SENP1 TAZ
VHL NR4A1
ZZZ3
ZIK1
ZSCAN21
NADK
DCBLD2
ZFP738
ASXL1
NEBL
MMS19
RNFT1
UBR7
FTSJ3
CEP97
MGAT4A
2010106G01RIK
PGPEP1
ARRDC2
PLOD3
EPB4.1L5
1110028C15RIK STAT3
OTUD7B
ZFP101DTX4
PCDHAC2
GPR107
ZFP661
TMEM186
ARFIP1
GANC
LEKR1
BCAR1
SARS2
NAGPA
ANTXR1
PCDHGB1RET
SPSB1
D830031N03RIK AACS
1810031K17RIK
LGI3
CYTH3
XPCFGGY
MUS81CEBPG
COX4NB
APAF1
CBLB
HS6ST2
PPP2R5A
CCDC137
LRWD1
DACH2 ZFP81
HERC6
PEX12 LPL
KCTD4
PIP4K2A
CNTN2 SHISA9
ZBTB41
STYX
NUP93
KAT2AMID2
TRHDE
LATS1
WDR24
ZFP945
DHX32
LINGO2
IRGM1
NOL9
KLHL20
FAM199X
ARSB
DDX26B
POGK
ZFP712
UNC13B
SDC3
3110052M02RIK
SMAD3
GPR137B−PS CPD
FPGT
ZMPSTE24
CBLN4
HMGXB3
LCORL
PNPLA3
BMPR1A
ZKSCAN14
C230081A13RIK
SLC22A5 ICK
LZIC
MYO1B
CCDC21
ZFP236
PCDHA3 PTPRM
ACADM
ZFP846
GNAI3
COX6A2
ST7L
ZFP39
SETDB1
SPICE1
EIF2C1 RASA2
NUP153
FAM172A
RHOUTRUB2
THADA
NPY5R
AKR1B10
SSTR1
LAMC1
ZFP53 TTC17
TEX10
WWP2
GNAI2
NUCB2
DERL1
MAN1C1
ZFP346
RBAK
CCDC123
SLC25A13
OSBPL10 ACADLSTX16
ATP11ATRMT2A YIPF6
ERLIN2
INO80C MED23
FOXN2
CTDSPL
DGCR14
SLC33A1
RNF19A
CPNE2
SERPINB8 TRPV2
MFSD8
PUS3
CRYL1F AH
ATMIN
CSNK1D
MCC
YOD1
ZFP867
TDRD7 EFTUD1
CPNE7
PIGC
TTLL5
TTF1
GM13298
MAN2C1NRIP1
ZFP111
BC057079
ZFP93
ZXDA
ACY1LIG3
HAUS3
TTC37
CEP68
HDAC8
5031439G07RIK
KCTD10
RHOBTB3
PRKG2
DGKA USP42
ZFX
NCAPD3
TRIT1 SKI
EDC4
CCNL2
LPIN1
PIGH
CEP120
MYST3
RNF139
CWF19L2
ADNP2
TRIM12C
NPNTBMI1
BC017647
ANKS1 NEURL1B
FADS1
TAF1A
MAGEE2
GTF2E1
C630043F03RIK KCNS3
LETM1
ABCC1ATP9B
RAB23
RFWD3
PIAS3
ACVR1
ALG8
DBT
ABCA8B
GPR125
FBN1 ANEA
ARL4C
ZDHHC21
DHRS7
DSCAM
VPS8
YIPF2
GLT8D2
PIBF1
FBXL4
LRP6
AVL9
XPO6
MTX3
TMEM62 CHSY1
SIDT1
PDZD2
CTBP2
KIF21B
GGCX
ANKRD10
FLII
ZFP758
ZFP940
FAM189A1
GCN1L1
EXT2 VRK1
DGCR8 RPA1
ZFP37
WDR13
NPTX1
TMOD3
ZFP715
XRN1 CDH10
TMEM43
NR2C2
LPIN2
ZFP563
REEP3
GRM8 CPNE4
HLTF
GTF3C2
KCNK2
MGAT4C SGCE
RNF2
ODZ3
GLCCI1
TSPAN6
TBC1D19
CEP350
2510009E07RIK
SEPN1
DNM2
KLF3 CTSF
MTMR1
RAB36 ROCK1
ZFP2
NETO2
MPP5 CDC23
CAMK1G
INPP5K
IKBKG
ITCH
HSD17B4
USPL1
PLEKHA1
SIAE
AP3B1
CHMP1B STAM2
BC018507
PPFIA1
PIGM
SEC24B
PLEKHA8
CD302
PIGY
SMC2 RNF185
MUDENG
CHD4
ZFP160
CWC22 LDB2
FHL1
SYBU
ARID4B
DENND5A
TEF
RND3
ZMYM3
D6WSU116E
KLHL23
CDH11
ZC3H7A
ZFP281
6330578E17RIK
GM5643
KCND3 FOSL2
SCN3B
FBXO38 NEFL
PHIP
USP4
PCDH17
UNC5D
TMEM132B
ITGAV
2310046O06RIK
PDZRN3
HIBCH
PTPN12
2010321M09RIK
FRAS1 DPY19L1
CACNA2D3
STARD7
GABRA2
ANKRD29 MAN1A2
GPR158
PTK2
ABLIM1 TMED10
CORO1A
9530091C08RIK
SUN1
FAT3
AGPAT4
PRKCC
GOSR2
IPCEF1
GANAB
G6PDX
CREB1
USP28 GM3893
CASD1
WDR6
GNPDA1 IARS
NR3C1
USP29
NAMPT
OGFRL1 CLK4
EPHA7
ZDHHC20 CPLX2
CAR10
NR1D1
SLC1A1 NAV1BTBD10
OAT
COPB1
BTBD3
IPW
IP6K1
RAB27B
MKLN1 LIN7A
ITM2B
HERPUD1
RBM39 GAA
SLC20A1
PSD3
AHCYL1
ZFP871
CLDN25
MT_AK153847 GAP43
HPCA
PCDH9
PCSK2
MT_AK140457
CACNG3
PRMT2
SLC8A1
ATP8A1
GPRASP2
NCOA4
HPCAL4
SESTD1 LGI1 OGT
CLK1
GRIA3
RASGRF1
NRAS
SCN1A
NAP1L5 VMP1
AKAP5 ATP2B1
MT_AK159262
HSPA5 PAK1H3F3B
MT_AK165865
FAM175A
GPR1
PCDHB11
SERTAD3
MAN2B2
GM5475
DNAHC9
KCNK10
PPAP2C
PLEKHA7UNK
DENND1A
GM16973
SLC10A7
2700081O15RIK
GTPBP1
MAML1
HCRTR1TOPORS
ZFP212
FHL4
APEX2
GSG1L
ACCS
EVI2B
SOX8
INTS5
PTPN21
SYDE1HPS4
TXNDC5
BC026590
VPS37C
ZFP454
COL4A3BP
4930453N24RIK
MFSD7B
FAM35A
HECTD2
LGR5
N4BP2 CRKL
PRDM4
MAML3
MATN2
E130309D14RIK
DSC2
AKAP10
PLD2TELO2
SMCR8
SLC16A14
RPS6KA4
VAC14
MED9
MRO
MCM8
RFX5
MTHFR
NGB
IFIH1
MTMR14
8430406I07RIK
1190007F08RIK
SDR42E1
CCNK
KLHL14
MUTYH
ZFP773
FAM184B
ISOC1
SEC24D
VMAC
AP1G2
BTN2A2
IFI30
NFXL1
NINJ1
THBS3
DDX31
GSTM6
ZKSCAN4
KLHL15
RFC5
PHACTR4
TMEM8
NMBR
GM10767
LEPREL4
ZFP652
SHKBP1
DHFR
ANGEL1
MOXD1
ZSCAN12
MAMDC2
1300014I06RIK CDH6
2700049A03RIK
AMBRA1
PVRL3
CRYZ
A930013F10RIK
RACGAP1
TMEM163
IL17RA BMP3
PTPRT
PIP5K1B
ATL3
WRN
CHRM2ELL
TMEM159
PRKAA1
PKD2
PARP11
KDM6A
SYTL5
DPP4
TWSG1
LRRC1
FAM160A1SLC35D1
RSPRY1
FAM76B
SPECC1L
RPUSD4
DIS3
TMEM63C
PPP1R12C
GPC6
MTR DNAJC3
NRF1
MON2
TK2
FAM118A
LMF1
CDKAL1
FNDC4
AGFG1
FAIM
NUDT22
ZSCAN18
PPP4R1L−PS
1700020O03RIK
SMURF2
4933400F21RIK
GPR126
ARL13B
GAL3ST1
GINS3
E230029C05RIK
4430402I18RIK FLCN
CHST12
FAM116A
NUP62
NR3C2
3110057O12RIK
AHCTF1
EIF3F
PMS1
CNTLN
NEK6
ZKSCAN3
SPAG5
SLC19A2
ZBTB34ANO10
TPPP3
IPP
RFFL
NR2C1
GALNS
STK24
MYO9B
1810010H24RIK
NKAIN1
GRTP1
RAB34
GPR151YES1
SMUG1
HHATL
TTC30A1
SLAIN2
1110008L16RIK
PYCR2
SLC16A2
DZIP1L
ZCWPW1
D19WSU162E
LRRC51
RAVER2
TMEM88
PTBP1
DUSP16
CREB3L1
MYLK3
TACR3
GRIN3A
CNTNAP5B
IL15RA
PGAM2
NFATC3
TRIM65
ST6GALNAC4
NR6A1
PCDHB22
BC065397
SESN3
EPHX3
BC027231
OTOF
LRRC61
FAM188B
MAP3K6
ZFP862
NUDT15
HIST2H2AA1
ARHGAP27
5930438M14
SOWAHC
TRABD
9130019O22RIK
ZBTB7B
CC2D1B
MMGT2 ACP6
KCNS2 GNG7
ESRRG
RINT1
ZFP426
ZFP956
LSM14A
MBTPS1
ZFP654
MMACHC
1110038D17RIK
ZFP770
CMYA5
MAPKAP1
ANKRD23 IFT122
PARVA
MBTD1
GTPBP2
CD2BP2
OTUB2
SLC25A40
BOC
CC2D2A
NEK4
ABCB6
METTL15
GM13251
PDE7A PNPLA6
ERI1
PIK3R4
LANCL3
1700052N19RIK
D1ERTD622E ACTN4
HADHA
LMBR1
PCYT1A
FAM3A
MFAP3L
RTN4IP1
SLC35A2
CLPB
HOMER2
ULK1
ZFP532
ALKBH1
TMEM64
FIG4
PRKAR2A
ZFP583
PDCD7
SLC25A24 PHKA2
LAMC2
1110034G24RIK
DLX1AS
PAK2
LRCH1
TNS3 SIL1
VEZF1
KDM2A
EPC2
WDR11
POLR3A
MLXIP
SPTY2D1
GEMIN8
NAA40ZFP97
RAMP3
PKN2
RSBN1
CSAD
2210404J11RIK
RFXANK
ZFP442
KCTD13
ZFP629
ERRFI1FASN
USP3
LRBA
DPAGT1
CABLES2
IQCG
KLHL26
ANKRD50
ZFP449
CRYBG3
ZFP334
RARB
CCDC157
DCAF15
PRDM5
HMBOX1
FUT11 CLP1
TTI2
SLC7A5
HS3ST4TLN2
EIF2C3
NT5C2
1300001I01RIK
CASP8AP2
CPSF3L
DPH1
UBAC2
NUP133
CDKL4
TMEM161A
6430584L05TSPYL3
RCAN3
AGPAT3 SNN
SUPT7L
1600012H06RIK
RYR3
EEPD1
ARNTL
GAS8
CD320 NAA30
ZDBF2
USP53
DEAF1
ASAP2
LCA5
E2F6 ABCA2
DIABLO
VPS4B
PLEKHA6
TMED8
NAA16
SPECC1
DOLK
EXT1
3110047P20RIK
ZCCHC16
TRDMT1 CCNG1
SRSF6
ZFP397
ABHD13
POLR1B
ALDH18A1
CERK
AP4B1SOS2
FEM1A
DDX27
FAM78B ZFP790
CPSF1
KLF12
INTS9
EML4
PIK3C2A
PITRM1
GNPAT
LTV1
IKBKB
THTPA
PEX5
LRRCC1
AHCYL2
ERMP1
DIS3L
CHD1
BC031353
KLHDC8B
ZFP786
ZFP354B NDST3
4932415G12RIK
C030048B08RIK
GNE
DIAP2
RBM41
9530077C05RIK
A130010J15RIKHNRPLL
DKKL1
SPRYD4
IGDCC4 HCCS
CHML
FRMD5
FAM40A
NUP54
KCNB2
TMEM88B
ZFP112
2610002I17RIK
CARS2 AMY1
TRIM62DR1
SLC8A3RPA2
FBXL20
FAM171A1
TFDP2
GBE1
INF2
TBCD
SLC35B3
NUP107
ZFP597
SPAG6
BC052040
RDH10
ENHO
WRAP53
TBL2
E430025E21RIK
1110007C09RIK TRAF2
APEH
PARP8
POLR2A
GALNTL1
AIFM2
TMTC2 GORAB
FBXO31
POMGNT1
UBIAD1
THOC6
D10WSU102E NUP35
EP300
TBC1D9B
FBXO32
SLC35F1
RRM2B
POLR3H
ZFP369 SLIT2
ZFP560COG2
PRDM2
ZFP804A
ZFP275
TRAFD1
HSD17B7
OXSR1
CIDEA
TMEM214
TGFB3
PUS1 RRAGC
C030039L03RIKC87436
ZDHHC16
SFMBT1
ERI2
PCYT1B
ENOX1
CCDC130
SUMF2 SRSF1
EIF4ENIF1
TM9SF2
POR
MFAP3
TOM1
ZBTB43
VPS13D
RAP1A
1600021P15RIK
RAD17
2700050L05RIK CDC7
NGLY1
DDX18 TGS1
GFRA2
GLB1
STARD3NL ZEB2
FAM120B
TMEM109
ZFP60
FNDC3A
PXMP3
HSPA13
PGM2
3110043O21RIK
ZC3H12C
EDEM3
ITSN2
SLC29A1
CAPN7 HSPA2
ADCY2
ENDOD1 SYTL2
NUDT4
LAPTM4B
6530418L21RIK
SP4 PWP1
RAPGEF6
SURF4
SBF2
BBS2ZFP40
VTI1A
ZFP952
NPLOC4
INTS8
NEK9
ILVBL ZMYM5
SNRNP200
PABPC1
EXOSC10
VPS39
0610031J06RIK
OPRL1
DTL
RPAP2 CDK4 PACSIN2
MLL3
ARFGAP2 PIK3R1
HERC3
ADD1
IPO5 KLHL24
PDIA4
NCSTN
SORT1 FRY
SORBS1
REPS2 HMGCR
MYSM1 SPRY2
TPPP
TMEM170B
PPP3CA
FBXW7
FTL1
MAPRE1 GRIA1 CD200 LRRC17
GABRA1
4
0.
0
0.2
0.6
−2
−4
0 1 2 3 4 5
Average expression
Fig. 14 Identification of highly variable genes using Seurat’s FindVariableGenes function (see documen-
tation for details). This is an appropriate strategy for feature selection on scRNA-seq that does not contain UMIs
Identification of Cell Types from Single-Cell Transcriptomic Data 73
Sst Cdk6
Pvalb Cpne5
Smad3
Vip Gpc3
Vip Sncg
OPC Pdgfra
L6b Rgs12
L4 Scnn1a
L5a Pde1c
L2/3 Ptgs2
L6a Syt17
Micro Ctss
Endo Xdh
Astro Aqp4
Oligo Opalin
Endo Myl9
Igtp Percentage
L2 Ngb 100
L4 Arf5
75
L4 Ctxn3
L5 Hsd11b1 50
L5 Ucma
25
L5a Batf3
L5a Tcerg1l
Known
0
L5b Cdh13
L5b Chrna6 Percentage
L5b Tph2 0
L6a Car12
25
L6a Mgp
L6a Sla 50
L6b Serpinb11 75
Ndnf Car4 100
Ndnf Cxcl14
Oligo 9630013A20Rik
Pvalb Gpx3
Pvalb Obox3
Pvalb Rspo2
Pvalb Tacr3
Pvalb Tpbg
Pvalb Wt1
Sncg
Sst Cbln4
Sst Chodl
Sst Myh8
Sst Tacstd2
Sst Th
Vip Chat
Vip Mybpc1
Vip Parm1
1 3 5 6 7 8 9 10 11 12 14 15 16 17 18 20 21 22 23 24 25
Predicted
Fig. 15 Transcriptional correspondence between mouse cortical clusters reported in Tasic et al. [15] (rows)
and those in this study (columns). Representation as in Fig. 11
This concludes the basic workflow. We can save files from the
analysis as follows.
Identification of Cell Types from Single-Cell Transcriptomic Data 75
PVALB
1.5
1.0
0.5
0.0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Identity
SST
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Identity
VIP
2.0
1.5
1.0
0.5
0.0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Identity
Fig. 16 Expression of three classic markers known to distinguish inhibitory neuronal categories PVALB (top),
SST (middle), and VIP (bottom)
Acknowledgments
References
1. Vickaryous MK, Hall BK (2006) Human cell single-cell transcriptomics. Nat Rev Genet 16
type diversity, evolution, development, and (3):133
classification with special reference to cells 18. Arendt D (2008) The evolution of cell types in
derived from the neural crest. Biol Rev Camb animals: emerging principles from molecular
Philos Soc 81(3):425–455 studies. Nat Rev Genet 9(11):868–882
2. Regev A et al (2017) The human cell atlas. 19. Ecker JR et al (2017) The BRAIN initiative cell
Elife:6 census consortium: lessons learned toward
3. Tosches MA et al (2018) Evolution of pallium, generating a comprehensive BRAIN cell atlas.
hippocampus, and cortical cell types revealed Neuron 96(3):542–557
by single-cell transcriptomics in reptiles. Sci- 20. Kolodziejczyk AA et al (2015) The technology
ence 360(6391):881–888 and biology of single-cell RNA sequencing.
4. Boisset JC et al (2018) Mapping the physical Mol Cell 58(4):610–620
network of cellular interactions. Nat Methods 21. Islam S et al (2014) Quantitative single-cell
5. Tanay A, Regev A (2017) Scaling single-cell RNA-seq with unique molecular identifiers.
genomics from phenomenology to mechanism. Nat Methods 11(2):163
Nature 541(7637):331–338 22. Menon V (2017) Clustering single cells: a
6. Trapnell C (2015) Defining cell types and review of approaches on high- and low-depth
states with single-cell genomics. Genome Res single-cell RNA-seq data. Brief Funct
25(10):1491–1498 Genomics
7. Cleary B et al (2017) Efficient generation of 23. Hicks SC, Teng M, Irizarry RA (2015,
transcriptomic profiles by random composite 025528) On the widespread and critical impact
measurements. Cell 171(6):1424–1436.e18 of systematic bias and batch effects in single-
8. Klein AM et al (2015) Droplet barcoding for cell RNA-Seq data. bioRxiv
single-cell transcriptomics applied to embry- 24. Butler A et al (2018) Integrating single-cell
onic stem cells. Cell 161(5):1187–1201 transcriptomic data across different conditions,
9. Macosko EZ et al (2015) Highly parallel technologies, and species. Nat Biotechnol 36
genome-wide expression profiling of individual (5):411
cells using nanoliter droplets. Cell 161 25. Haghverdi L et al (2018) Batch effects in
(5):1202–1214 single-cell RNA-sequencing data are corrected
10. Zheng GX et al (2017) Massively parallel digi- by matching mutual nearest neighbors. Nat
tal transcriptional profiling of single cells. Nat Biotechnol 36:421–427
Commun 8:14049 26. Lopez R et al (2018) Bayesian inference for a
11. Habib N et al (2016) Div-Seq: single-nucleus generative model of transcriptome profiles
RNA-Seq reveals dynamics of rare adult new- from single-cell RNA sequencing.
born neurons. Science 353(6302):925–928 bioRxiv:292037
12. Lake BB et al (2016) Neuronal subtypes and 27. Lee JH et al (2014) Highly multiplexed sub-
diversity revealed by single-nucleus RNA cellular RNA sequencing in situ. Science 343
sequencing of the human brain. Science 352 (6177):1360–1363
(6293):1586–1590 28. Stahl PL et al (2016) Visualization and analysis
13. Shekhar K et al (2016) Comprehensive classifi- of gene expression in tissue sections by spatial
cation of retinal bipolar neurons by single-cell transcriptomics. Science 353(6294):78–82
transcriptomics. Cell 166(5):1308–1323.e30 29. Chen KH et al (2015) Spatially resolved, highly
14. Villani A-C et al (2017) Single-cell RNA-seq multiplexed RNA profiling in single cells. Sci-
reveals new types of human blood dendritic ence 348(6233):aaa6090
cells, monocytes, and progenitors. Science 30. Lubeck E et al (2014) Single-cell in situ RNA
356(6335):eaah4573 profiling by sequential hybridization. Nat
15. Tasic B et al (2016) Adult mouse cortical cell Methods 11(4):360
taxonomy revealed by single cell transcrip- 31. Fuzik J et al (2016) Integration of electrophys-
tomics. Nat Neurosci 19(2):335–346 iological recordings with single-cell RNA-seq
16. Zeng H, Sanes JR (2017) Neuronal cell-type data identifies neuronal subtypes. Nat Biotech-
classification: challenges, opportunities and the nol 34(2):175
path forward. Nat Rev Neurosci 18(9):530 32. Dixit A et al (2016) Perturb-Seq: dissecting
17. Stegle O, Teichmann SA, Marioni JC (2015) molecular circuits with scalable single-cell
Computational and analytical challenges in
Identification of Cell Types from Single-Cell Transcriptomic Data 77
RNA profiling of pooled genetic screens. Cell 43. Keogh E, Mueen A (2017) Curse of
167(7):1853–1866.e17 dimensionality. In: Encyclopedia of machine
33. Stoeckius M et al (2017) Simultaneous epitope learning and data mining. Springer, pp
and transcriptome measurement in single cells. 314–315
Nat Methods 14(9):865 44. Hotelling H (1933) Analysis of a complex of
34. Frieda KL et al (2017) Synthetic recording and statistical variables into principal components.
in situ readout of lineage information in single J Educ Psychol 24(6):417
cells. Nature 541(7635):107–111 45. Hyv€arinen A, Karhunen J, Oja E (2004) Inde-
35. Raj B et al (2018) Simultaneous single-cell pendent component analysis, vol 46. Wiley,
profiling of lineages and cell types in the verte- New York
brate brain. Nat Biotechnol 36(5):442–450 46. Lee DD, Seung HS (2001) Algorithms for
36. Pertea M et al (2016) Transcript-level expres- non-negative matrix factorization. In: Leen
sion analysis of RNA-seq experiments with TK, Dietterich TG, Tresp V (eds) Advances in
HISAT, StringTie and Ballgown. Nat Protoc neural information processing systems, vol 13.
11(9):1650 MIT, Cambridge, UK
37. Villani AC, Shekhar K (2017) Single-cell RNA 47. Haghverdi L et al (2016) Diffusion pseudo-
sequencing of human T cells. Methods Mol time robustly reconstructs lineage branching.
Biol 1514:203–239 Nat Methods 13(10):845
38. Satija R et al (2015) Spatial reconstruction of 48. Lancichinetti A, Fortunato S (2009) Commu-
single-cell gene expression data. Nat Biotech- nity detection algorithms: a comparative analy-
nol 33(5):495–502 sis. Phys Rev E Stat Nonlinear Soft Matter Phys
39. Lake BB et al (2018) Integrative single-cell 80(5 Pt 2):056117
analysis of transcriptional and epigenetic states 49. Levine JH et al (2015) Data-driven phenotypic
in the human adult brain. Nat Biotechnol 36 dissection of AML reveals progenitor-like cells
(1):70–80 that correlate with prognosis. Cell 162
40. Pandey S et al (2018) Comprehensive identifi- (1):184–197
cation and spatial mapping of Habenular neu- 50. LVD M, Hinton G (2008) Visualizing data
ronal types using single-cell RNA-Seq. Curr using t-SNE. J Mach Learn Res 9
Biol 28(7):1052–1065.e7 (Nov):2579–2605
41. Andrews TS, Hemberg M (2017) Identifying 51. Soneson C, Robinson MD (2018) Bias,
cell populations with scRNASeq. Mol Asp Med robustness and scalability in single-cell differ-
42. Brennecke P et al (2013) Accounting for tech- ential expression analysis. Nat Methods 15
nical noise in single-cell RNA-seq experiments. (4):255
Nat Methods 10(11):1093 52. Breiman L (2001) Random forests. Mach
Learn 45(1):5–32
Chapter 5
Abstract
High-throughput single-cell technologies have great potential to discover new cell types. Here, we present
a novel computational method, called GiniClust (Jiang et al., Genome Biol 17(1):144, 2016), to overcome
the challenge of detecting rare cell types that are distinct from a large population.
Key words Clustering, Single-cell analysis, RNA-seq, qPCR, Gini index, Rare cell type
1 Introduction
The cell is the basic unit of structure and function in life; however,
our knowledge of cell types remains largely incomplete. Interest-
ingly, in many development and disease context, rare cell types
often been see to play an important role although they only con-
tribute to a small proportion of the cell population. For example,
stem cells that contribute to new born neuron in adult brain are
critical to reverse neurodegenerative diseases [1], and drug-
resistant cells are the key barrier to cure cancer [2].
Genome-wide gene expression profiles now are widely accepted
to define cell types. Recently technical advance of massive parallel
single-cell RNA-seq on large-scale provides an unprecedented
opportunity to discover previously unrecognized cell types due to
their rarity. Although the number of cells being profiled single-cell
transcriptome assay increase the chance that rare cell being sam-
pled, tailored computational method to detect them remains highly
demanded.
One of the major challenges is to identify genes that are asso-
ciated with rare cell types without prior biological knowledge.
GiniClust [3] adapted Gini index, an informative statistical measure
widely used in social domain, to selecting rare cell-type-associated
genes. It is implemented in Python and R and can be applied to
Guo-Cheng Yuan (ed.), Computational Methods for Single-Cell Data Analysis, Methods in Molecular Biology, vol. 1935,
https://doi.org/10.1007/978-1-4939-9057-3_5, © Springer Science+Business Media, LLC, part of Springer Nature 2019
79
80 Lan Jiang
2 Materials
2.2 Python Packages The graphical user interface of GiniClust relies on wxPython, a
Python wrapper for the cross-platform wxWidgets API. Please
ensure that you have Python 2.7 in your environment. In addition,
GiniClust relies on the following libraries:
1. Gooey.
2. Setuptools.
Those packages should be automatically installed or upgraded
via a pip installation. For instance, to install Gooey, proceed as
follows:
3. Start a terminal session;
4. run $ pip install Gooey --upgrade.
2.3 R Packages As for the R code at the core of much of GiniClust’s computations,
for MAC and WINDOWS only the official R (3.2.1 or higher
version) installation file is supported and tested. Using other instal-
lation methods, such as brew, may lead to running error. Besides,
some users might experience issues installing another of GiniClust’s
dependencies: the MAST R package (see Note 1).
2.4 Input Files The input file is a gene expression matrix in comma-separated value
(csv) format. Specifically, for qPCR data, each row is log2 gene
expression level; for RNAseq data, each row is UMI-Count/Cell or
Raw-Read-Count/Cell (see Note 2). The first row contains cell
IDs. The first column contains unique gene names. For example,
user can take a look at one of our test datasets (stored in the
sample_data folder within GiniClust’s repository):
Rare Cell Type Detection 81
3 Methods
3.1 Run GiniClust To run GiniClust, please download the GiniClust GitHub repository
Through Python-Based from https://github.com/lanjiangboston/GiniClust/archive/mas
Graphical User ter.zip, unzip it and move to the extracted directory so that it
Interface becomes your current working directory. Then, in a Linux environ-
ment, start a terminal session and enter:
$ python GiniClust.py.
$ pythonw GiniClust.py
3.2 Run GiniClust Alternatively, GiniClust can be run directly as an R script at the
Through R Script Main command-line interface. User can run GiniClust in terminal session
Function using Rscript like following:
minCellNum = 3
minGeneNum = 2000
expressed_cutoff = 1
log2.expr.cutoffl = 0
log2.expr.cutoffh = 20
Gini.pvalue_cutoff = 0.0001
Norm.Gini.cutoff = 1
span = 0.9
outlier_remove = 0.75
Gamma = 0.9
diff.cutoff = 1
lr.p_value_cutoff = 1e-5
CountsForNormalized = 100000
rare_p = 0.05
perplexity = 30
3.3 Run GiniClust While GiniClust_Main.R is more friendly to use, running Giniclust
Through R Script Step step by step gives users more control and thus is more suitable for
by Step customized analysis. GiniClust stores intermediate files for each
step. So it is convenient for users to check the results and to change
parameters if necessary and rerun a single step before moving to the
next one. A step-by-step instruction of running GiniClust is
described in the following:
3.3.1 Loading Packages This step first loads all the additional required packages. If it is the
first time running GiniClust, this step automatically installs all R
packages, so it is necessary to have Internet access.
>source("Rfunction/GiniClust_packages.R")
> source("Rfunction/GiniClust_Preprocess.R")
> source("Rfunction/GiniClust_Filtering.R")
> source("Rfunction/GiniClust_Fitting.R")
> source("Rfunction/GiniClust_Clustering.R")
> source("Rfunction/GiniClust_tSNE.R")
> source("Rfunction/DE_MAST.R")
> source("Rfunction/DE_t_test.R")
84 Lan Jiang
3.3.2 Preprocessing This step is used for preprocessing and loading the input data. The
input data for RNA-seq should be raw read counts or UMI counts.
For qPCR data, since the input data are in the log2 scale, GiniClust
transforms back to normal scale to consistently process the data for
the later steps. The variables of input include data.type, out.folder,
and exprimentID, while the file of input is “exprimentID_raw-
Counts.csv.” And the variable of output is “ExprM.RawCounts,”
and is saved as a file named “exprimentID_rawCounts.csv.”
3.3.3 Filtering GiniClust provides some basic filtering functions. The variable of
the Preprocessed Data input is “ExprM.RawCounts.” The variable of output is “ExprM.
RawCounts.filter.” The intermediate stored file is “exprimentID_-
gene.expression.matrix.RawCounts.filtered.csv.” The parameters
involved and default values are:
> minCellNum = 3
> minGeneNum = 2000
> expressed_cutoff = 1
After modifying the parameters (see Note 3), run the code
below:
3.3.4 Selecting a Subset This step normalizes the Gini index by LOESS curve fitting in Gini
of the Genes for Clustering vs. Max space. The variable of input is “ExprM.RawCounts.filter.”
The variable of output is “Genelist.top_pvalue” or “Genelist.High-
NormGini.” And the intermediate stored files are “Gini_related_-
table_RNA-seq.csv” or “Gini_related_table_qPCR.csv.” The
parameters involved and default values are:
> log2.expr.cutoffl = 0
> log2.expr.cutoffh = 20
> Gini.pvalue_cutoff = 0.0001
> Norm.Gini.cutoff = 1
> span = 0.9
> outlier_remove = 0.75
After modifying the parameters (see Note 4), run the code
below:
3.3.5 Clustering This step builds a cell-cell discard distance based on selected genes
from step 4. Then DBSCAN is used to detect clusters. The variable
of input is “ExprM.RawCounts.filter.” The variable of output is
“cell.cell.distance,”, “c_membership,” “clustering_member-
ship_r,” and “rare.cells.list.all.” And the intermediate files are
“exprimentID _clusterID.csv,” “exprimentID _rare_cells_list.txt.”
The parameters involved and default values are:
After modifying the parameters (see Note 5), run the code
below:
> table(c_membership)
> print(rare.cells.list.all)
3.3.6 tSNE Visualization A nonlinear dimension reduction technique called tSNE is used to
visualize clustering results. The variable of input is “c_member-
ship,” “cell.cell.distance.” The variable of output is “Rtnse_-
coord2.” And the intermediate stored files are
“exprimentID_Rtnse_coord2.csv.” A figure is also generated and
stored in the work folder (Fig. 2). The parameters involved and
default values are:
> perplexity = 30
After modifying the parameters (see Note 6), run the code
below:
singleton
10
cluster1
cluster2
5
tSNE_2
0
−5
−10
−10 −5 0 5 10
tSNE_1
3.3.7 Differentially Comparing the single-cell expression data of each putative rare cell
Expressed Genes type cluster to the largest cluster will identify differentially
expressed genes. The variable of input is “ExprM.RawCounts.fil-
ter,” “rare.cells.list.all,” “c_membership.” The variable of output is
“differential.r.” And the intermediate stored files are “RareCluster.
overlap_genes.txt” and “RareCluster_lrTest.csv” or “RareCluster.
diff.gene.t-test.results.csv.” The parameters involved and default
values are:
diff.cutoff = 1
lr.p_value_cutoff = 1e-5
After modifying the parameters (see Note 7), for RNA-seq data,
run the code below:
> DE_t_test(ExprM.RawCounts.filter,rare.cells.list.all,c_membership,out.folder,exprimentID)
Rare Cell Type Detection 87
3.4 Full List of Output The output directory specified by the user at the graphical user
Files and Description interface contains the following files and directories.
Main results:
exprimentID_rare_cells_list.txt: the clusters of rare cells
detected by Giniclust.
RareCluster_lrTest.csv or RareCluster.diff.gene.t-test.results.
csv: Differentially expressed genes results for the rare cells type
cluster.
Other supporting results:
exprimentID_rawCounts.csv: the raw counts.
exprimentID_normCounts.csv: the normalized counts.
exprimentID_gene.expression.matrix.RawCounts.filtered.csv:
the raw counts after filtering.
exprimentID_gene.expression.matrix.normCounts.filtered.
csv: the normalized counts after filtering.
Gini_related_table_RNA-seq.csv: the table related with Gini
index for RNA-seq data.
Gini_related_table_qPCR.csv: the table related with Gini index
for qPCR data.
exprimentID_clusterID.csv: clustering result, the first column
represents cell IDs and the second column is the corresponding
cluster result for each cell.
exprimentID_Rtnse_coord2.csv: coordinates of cells in
tSNE plot.
exprimentID_bi-directional.GiniIndexTable.csv: For qPCR
data the table of bidirectional Gini index.
RareCluster.overlap_genes.txt: overlap genes between the
selected high Gini genes and DE genes in rare cluster.
Sub-folder ‘figures’:
exprimentID_histogram of Normalized.Gini.Socre.pdf: histo-
gram of estimated p-values based on a normal distribution approxi-
mation for genes.
exprimentID_smoothScatter_pvalue_gene.pdf: the smooth-
Scatter plot in which the red points are the selected high Gini
genes according to specified cutoff.
exprimentID_tsne_plot.pdf: tSNE plot for cells.
exprimentID_RareCluster_diff_gene_overlap.pdf: Venn dia-
gram for differentially expressed genes and high gini genes.
exprimentID_RareCluster_overlapgene_rawCounts_bar_plot.
genename.pdf: barplot of rare cluster and major cluster for the
overlap genes.
3.5 Further Reading It is worth noting that while the GiniClust is powerful for detecting
rare cell types cluster, it is not sensitive for distinguishing major cell
types. This limitation can be partially resolved by updated version of
GiniClust called GiniClust2. It uses a novel cluster-aware weighted
88 Lan Jiang
4 Notes
References
1. Habib N, Li Y, Heidenreich M, Swiech L, 4. Tsoucas D, Yuan GC (2018) GiniClust2: a
Avraham-Davidi I, Trombetta JJ, Hession C, cluster-aware, weighted ensemble clustering
Zhang F, Regev A (2016) Div-Seq: single- method for cell-type detection. Genome Biol
nucleus RNA-Seq reveals dynamics of rare adult 19(1):58. https://doi.org/10.1186/s13059-
newborn neurons. Science 353 018-1431-3
(6302):925–928. https://doi.org/10.1126/sci 5. Liao Y, Smyth GK, Shi W (2014) Featurecounts:
ence.aad7038 an efficient general purpose program for assign-
2. Sharma SV, Lee DY, Li B, Quinlan MP, ing sequence reads to genomic features. Bioin-
Takahashi F, Maheswaran S, McDermott U, formatics 30(7):923–930. https://doi.org/10.
Azizian N, Zou L, Fischbach MA, Wong KK, 1093/bioinformatics/btt656
Brandstetter K, Wittner B, Ramaswamy S, 6. Anders S, Pyl PT, Huber W (2015) HTSeq--a
Classon M, Settleman J (2010) A chromatin- Python framework to work with high-
mediated reversible drug-tolerant state in cancer throughput sequencing data. Bioinformatics 31
cell subpopulations. Cell 141 (2):166–169. https://doi.org/10.1093/bioin
(1):69–80. https://doi.org/10.1016/j.cell. formatics/btu638
2010.02.027 7. Torre E, Dueck H, Shaffer S, Gospocic J,
3. Jiang L, Chen H, Pinello L, Yuan GC (2016) Gupte R, Bonasio R, Kim J, Murray J, Raj A
GiniClust: detecting rare cell types from single- (2018) Rare cell detection by single-cell RNA
cell gene expression data with Gini index. sequencing as guided by single-molecule RNA
Genome Biol 17(1):144. https://doi.org/10. FISH. Cell Syst 6(2):171–179.e175. https://
1186/s13059-016-1010-4 doi.org/10.1016/j.cels.2018.01.014
Chapter 6
Abstract
For decades, people have been trying to define cell type with the combination of expressed genes. The
choice of the limited number of genes for the classification limits the precision of this system. Here, we build
a “single-cell Mouse Cell Atlas (scMCA) analysis” pipeline based on scRNA-seq datasets covering all mouse
cell types. We build the scMCA reference and then use the tool “scMCA” to match single-cell digital
expression to its closest cell type.
1 Introduction
Huiyu Sun, Yincong Zhou, Lijiang Fei, and Haide Chen contributed equally to this work.
Guo-Cheng Yuan (ed.), Computational Methods for Single-Cell Data Analysis, Methods in Molecular Biology, vol. 1935,
https://doi.org/10.1007/978-1-4939-9057-3_6, © Springer Science+Business Media, LLC, part of Springer Nature 2019
91
92 Huiyu Sun et al.
2 Materials
2.1 scRNA-Seq We use scRNA-seq data with more than 800 mouse cell clusters
Datasets Resource from different tissues to build our scMCA reference (see Table 1).
The datasets we used to build the reference are listed as below,
details of datasets are available via MCA website http://bis.zju.edu.
cn/MCA/gallery.html.
2.2 Software You can use MCA website or R package scMCA to define the
and Algorithms mouse cell types of your data. The scMCA on MCA website is
available at http://bis.zju.edu.cn/MCA/blast.html. We have
tested it on Firefox, Chrome, and Safari. For huge digital gene
expression (DGE) data (the DGE file larger than 200 MB), the R
package is available on our Github page: https://github.com/
ggjlab/scMCA. The softwares and algorithms used for scMCA
are listed in Table 2.
Table 1
The datasets used for building scMCA reference
Datasets Reference
Datasets of microwell-seq data (more than 40 tissues) Han et al. [1]
Dataset of Arc-ME (the complex of the arcuate hypothalamus and median Campbell et al. [2]
eminence)
Dataset of pancreatic islets Baron et al. [3]
Dataset of lung mesenchyme Zepp et al. [4]
Dataset of whole mouse E8.25 embryos Ibarra-Soria et al. [5]
Dataset of retina Macosko et al. 2015
[6]
Table 2
The packages used for scMCA
Packages Source
Pheatmap https://cran.r-project.org/web/packages/pheatmap/index.html
Shiny http://shiny.rstudio.com/
Dplyr https://cran.r-project.org/web/packages/dplyr/index.html
Defining Mouse Cell Types by scMCA 93
3 Methods
3.1.2 Define Reference By default, we randomly choose 100 cells (all cells will be chosen if
Values of Cell Clusters the total number is less than 100) from each cell cluster for 3 times
to get the gene average expression values (see Note 2), and then
take the integer of the values for each sample (see Note 3). The
average of 3 representative values is used as cell cluster value in
reference.
3.1.3 Feature Gene We used the integer values in three representative averaged data to
Selection perform differential gene expression analysis. Top 10 genes
(avg_logFC > 1) of each cell cluster were merged to make the
reference feature gene list (3028 feature genes). The differential
gene expression analysis was done by the Wilcoxon rank sum test
method using R Package Seurat.
3.1.4 Get the MCA Cell The representative values of more than 800 cell clusters with 3028
Type Reference features genes were used as MCA cell type reference, and the
reference was log-transformed before calculating scMCA scores.
3.2 Using scMCA After uploading the scRNA-seq or bulk RNA-seq DGE file, you
on MCA Website should click the “scMCA” button to perform scMCA pipeline
(Fig. 1). The uploaded file should be a DGE matrix in.txt or.csv
3.2.1 Submit RNA-Seq
format with each row representing a gene, and each column repre-
DGE File to scMCA Website
senting a cell (or sample). The numerical expression can be raw
count, FPKM, and RPKM. After submission, the DGE matrix will
be log-normalized (see Note 4) and the feature gene expression
matrix will be extracted (see Note 5).
3.2.2 Get Defined Results Pearson correlation coefficients of the extracted expression matrix
from scMCA Website against MCA cell type reference are calculated, and the correlation
coefficients are used as the scMCA scores in scMCA pipeline. The
scMCA results can be shown by interactive heatmap, in which the
row means the defined cell type, the column means the query cell,
and the color of block indicates the strength of the correlation
(Fig. 2). One can also download the result table in a csv format,
94 Huiyu Sun et al.
Fig. 1 Data submission interface of scMCA on MCA website. You can upload DGE file in txt or csv format by
clicking the “Add files” button and perform scMCA pipeline by clicking the “scMCA” button
Fig. 2 The scMCA results of website. The result can be shown by interactive heatmap, in which the row means
the defined cell type, the column means the query cell, and the colors of blocks indicate the strength of the
correlations. You can also click the “result from scMCA” button to download the file which records the scMCA
scores between query cells and cell types. The uploaded example dataset of distal lung epithelium is from
Treutlein et al. [7]
Defining Mouse Cell Types by scMCA 95
Fig. 3 The scMCA results of R package. You can choose option buttons on the left of web interface to adjust
the results. The uploaded example dataset of distal lung epithelium is from Treutlein et al. [7]
3.3 Using R Package The scMCA package is hosted on Github. It can be conveniently
scMCA installed via “devtools” by typing “devtools::install_github
(“ggjlab/scMCA”).”
3.3.1 Installation of R
Package scMCA
3.3.2 Usage of R scMCA package has two main functions, scMCA and scMCA_vis.
Package scMCA scMCA calculates the Pearson correlation coefficient between each
query cell and cell type. scMCA_vis is used to visualize the result
returned from scMCA. To use scMCA, you should take the follow-
ing steps:
1. Loading the scRNA-seq or bulk RNA-Seq DGE file to R
environment.
2. Setting the number of most relevant cell types for each query
cell. The corresponding parameter is “number_plot” in func-
tion scMCA. You can type “?scMCA” in R for more
information.
3. Execute the scMCA and use scMCA_vis to get the results.
Using scMCA_vis, you can open a web page in localhost
which reflects the results of scMCA (Fig. 3).
For more details, you can find the instructions of R package
scMCA on Github page.
96 Huiyu Sun et al.
4 Notes
References
1. Han X, Wang R, Zhou Y et al (2018) Mapping 5. Ibarra-Soria X, Jawaid W, Pijuan-Sala B et al
the mouse cell atlas by microwell-Seq. Cell (2018) Defining murine organogenesis at
172:1091–107.e17 single-cell resolution reveals a role for the leuko-
2. Campbell JN, Macosko EZ, Fenselau H et al triene pathway in regulating blood progenitor
(2017) A molecular census of arcuate hypothal- formation. Nat Cell Biol 20:127–134
amus and median eminence cell types. Nat Neu- 6. Macosko EZ, Basu A, Satija R et al (2015)
rosci 20:484–496 Highly parallel genome-wide expression
3. Baron M, Veres A, Wolock SL et al (2016) A profiling of individual cells using nanoliter dro-
single-cell transcriptomic map of the human and plets. Cell 161:1202–1214
mouse pancreas reveals inter- and intra-cell pop- 7. Treutlein B, Brownfield DG, Wu AR et al (2014)
ulation structure. Cell Syst 3:346–60.e4 Reconstructing lineage hierarchies of the distal
4. Zepp JA, Zacharias WJ, Frank DB et al (2017) lung epithelium using single-cell RNA-seq.
Distinct Mesenchymal lineages and niches pro- Nature 509:371–375
mote epithelial self-renewal and myofibrogenesis
in the lung. Cell 170:1134–48.e10
Chapter 7
Abstract
Integrating prior knowledge of pathway-level information can enhance power and facilitate interpretation
of gene expression data analyses. Here, we provide a practical demonstration of the value of gene set or
pathway enrichment testing and extend such techniques to identify and characterize transcriptional sub-
populations from single-cell RNA-sequencing data using pathway and gene set overdispersion analysis
(PAGODA).
Key words Single cell, Pathway, Gene set enrichment analysis, Differential expression analysis,
Clustering
1 Introduction
Guo-Cheng Yuan (ed.), Computational Methods for Single-Cell Data Analysis, Methods in Molecular Biology, vol. 1935,
https://doi.org/10.1007/978-1-4939-9057-3_7, © Springer Science+Business Media, LLC, part of Springer Nature 2019
97
98 Jean Fan
score
+
running-sum
Gene Set:
statistic
gene1
gene2
gene3
gene4
gene5 0
gene6
gene7
names
gene1
gene2
gene3
gene4
gene5
gene6
gene7
gene
Gene
Universe:
gene1 downregulated
... gene genes
gene57 upregulated ranking
genes
Fig. 1 A standard gene set enrichment plot. Genes in the gene universe are ranked according to a differential
expression statistic from most upregulated to most downregulated. A running-sum statistic then traverses the
ranked list and increments the enrichment score statistic upon reaching a gene within the gene set of interest
2 Materials
2.1 Liger R Package In Subheading 3.1, we will perform gene set enrichment analysis
using the Lightweight Iterative Gene set Enrichment in R (liger)
package, an R implementation of the GSEA algorithm [5]. liger can
be installed from CRAN using the following command in R:
install.packages("liger")
2.2 Scde R Package In Subheadings 3.2 and 3.3, we will perform pathway and gene set
overdispersion analysis (PAGODA) using the Single Cell Differen-
tial Expression (scde) package. Scde can be installed from Biocon-
ductor using the following command in R:
# try http:// if https:// URLs are not supported
source("https://bioconductor.org/biocLite.R")
biocLite("scde")
100 Jean Fan
3 Methods
library(liger)
2. Load a gene set based on Gene Ontology (GO) terms.
# load gene set
data("org.Hs.GO2Symbol.list")
genes
cells
expression magnitude
Fig. 2 Gene expression heatmap for select simulated genes. Rows are genes and columns are cells. Gene
expression is colored using a color ramp from blue to white to red, with highly expressed genes colored in red
and lowly expressed genes in blue. Column side bar is colored using the cell group labels, with group 1 cells
labeled in red, and group 2 cells labeled in blue. Row side bar is colored in green if the gene is within our
selected gene set and black if not
3.0
2.5
-log10(p-value)
2.0
1.5
1.0
0.5
0.0
DNA2
OPA1
STOML2
MGME1
AKT3
C10orf2
LIG3
MEF2A
MPV17
PID1
PRIMPOL
SLC25A33
SLC25A36
SLC25A4
TYMP
Fig. 3 Differential expression analysis results for select simulated genes. Barplot shows -log10( p-value) for
each gene. Red line shows the p ¼ 0.05 significance threshold. Note none of the tested genes passes the
significance threshold
# run iterative bulk gsea on our true gene set and 9 other gene sets as
test
gseaVals <- iterative.bulk.gsea(
values = vals,
set.list = org.Hs.GO2Symbol.list[1:10],
rank=TRUE)
## initial: [1e+02 - 3] [1e+03 - 1] [1e+04 - 1] done
print(gseaVals)
## p.val q.val sscore edge
## GO:0000002 0.00009999 0.00059994 2.5584741 2.0848842
## GO:0000003 0.66336634 0.66336634 0.4924230 0.3374948
## GO:0000012 0.11888112 0.25774226 -0.9737758 -0.1256842
## GO:0000014 0.24752475 0.36831683 0.6518057 -0.5915193
## GO:0000018 0.30693069 0.36831683 -0.7279604 0.9366813
## GO:0000022 0.12887113 0.25774226 0.9455950 -0.4223886
# look at plots
for(i in seq_along(gseaSig)) {
gs <- org.Hs.GO2Symbol.list[[gseaSig[i]]]
gsea(values=vals, geneset=gs, mc.cores=1, plot=TRUE, rank=TRUE)
}
So, although no individual gene was found to be statistically
significantly differentially expressed between our two biological
states, gene set and pathway enrichment analysis identified a signifi-
cantly enriched gene set, GO:0000002, which is exactly the gene
set that we simulated to show concordant differences. By looking
for coordinated changes in genes within these a priori defined gene
sets, we are able to increase our statistical power to identify differ-
ences between our two biological states.
Differential Pathway Analysis 105
15000
P value < 1e 04
10000
score
5000
0
4
Fig. 4 Gene set enrichment plot for gene set GO:0000002 demonstrates significant enrichment as simulated
3.2 Applying Gene set testing with methods such as liger can be used for differ-
a Pathway-Integrated ential expression analysis to increase statistical power and uncover
Approach likely functional interpretations. However, such testing requires
with Pathway knowledge of biological conditions or subpopulations for compari-
and Gene Set son. To identify these transcriptionally distinct subpopulations, a
Overdisperrsion similar rationale can be applied in single-cell RNA-seq data analysis.
Analysis Highly variable genes may partition cells into transcriptionally dis-
tinct subpopulations but carry consideration uncertainty as
observed variability in gene expression may be the result of techni-
cal artifacts such as drop-outs. Yet whereas variability in the expres-
sion of a single gene may be noisy, coordinated upregulation of
many genes within a gene set or pathway in the same subset of cells
could provide a prominent signature to distinguish subpopulations.
Pathway And Gene set Over-Dispersion Analysis (PAGODA)
[6] looks for coordinated expression variability of genes in both
annotated pathways and automatically detected “de novo” gene
sets. PAGODA then uses this gene set and pathway-level informa-
tion to cluster cells into transcriptional subpopulations.
Briefly, PAGODA first estimates the effective sequencing
depth, drop-out rate, and amplification noise of each cell using a
previously described mixture-model approach with minor enhance-
ments. Using these models, the observed expression variance of
each gene is renormalized on the basis of the expected genome-
106 Jean Fan
Fig. 5 Gene expression heatmap with cells grouped by hierarchical clustering shows inconsistency with cell
group labels. Rows are genes and columns are cells. Gene expression is colored using a color ramp from blue
to white to red, with highly expressed genes colored in red and lowly expressed genes in blue. Column side bar
is colored using the cell group labels, with group 1 cells labeled in red, and group 2 cells labeled in blue. Row
side bar is colored in green if the gene is within our selected gene set and black if not
Differential Pathway Analysis 107
library(scde)
#PC1# GO:0000018
#PC1# GO:0000003
#PC1# GO:0000002
Fig. 6 Pathway expression heatmap with cells grouped by hierarchical clustering shows consistency with cell
group labels. Rows are pathways and columns are cells. Pathway expression, summarized by the first
principal component (PC1) of gene expressions for genes within the pathway, is colored using a color ramp
from blue to white to red. Column side bar is colored using the cell group labels, with group 1 cells labeled in
red, and group 2 cells labeled in blue
library(scde)
data(pollen)
# remove poor cells and genes
cd <- clean.counts(pollen)
# check the final dimensions of the read count matrix
dim(cd)
## [1] 11310 64
10. We can then view these top aspects in a heatmap (Fig. 7).
Indeed, we see a correspondence between out derived cell
annotations and the previously published annotations.
Fig. 7 Pathway expression heatmap for single-cell RNA-seq data from Pollen
et al. The columns are cells and the rows represent a cluster of pathways. The
row names are assigned to be the top overdispersed aspect in each cluster. The
green-to-orange color scheme shows low-to-high weighted PC scores (aspect
patterns), where generally orange indicates higher expression and green lower
expression. The column colors are cell annotations from the original publication
Differential Pathway Analysis 113
# compile a browsable app, showing top three clusters with the top color
bar
app <- make.pagoda.app(tamr2, tam, varinfo, go.env, pwpca, clpca,
col.cols = col.cols, cell.clustering = hc, title = "NPCs")
# show app in the browser (port 1468)
show.app(app, "pollen", browse = TRUE, port = 1468)
The PAGODA app allows you to view the gene sets grouped
within each aspect (row), as well as genes underlying the detected
heterogeneity patterns. In this manner, you can interactively
explore the pathways and genes driving each identified transcrip-
tional subpopulation.
Acknowledgment
References
1. Soneson C, Delorenzi M (2013) A comparison Methods 13:241–244. https://doi.org/10.
of methods for differential expression analysis 1038/nmeth.3734
of RNA-seq data. BMC Bioinformatics 14:91. 7. Wagner F (2016) The XL-mHG test for
https://doi.org/10.1186/1471-2105-14-91 enrichment: algorithms, bounds, and power.
2. Jaakkola MK, Seyednasrollah F, Mehmood A, https://doi.org/10.7287/peerj.preprints.
Elo LL (2016) Comparison of methods to 1962v1
detect differentially expressed genes between 8. Huang DW, Sherman BT, Lempicki RA (2009)
single-cell populations. Brief Bioinform 18: Bioinformatics enrichment tools: paths toward
bbw057. https://doi.org/10.1093/bib/ the comprehensive functional analysis of large
bbw057 gene lists. Nucleic Acids Res 37:1–13. https://
3. Kharchenko PV, Silberstein L, Scadden DT doi.org/10.1093/nar/gkn923
(2014) Bayesian approach to single-cell differ- 9. R Core Team (2017) R: a language and envi-
ential expression analysis. Nat Methods ronment for statistical computing
11:740–742. https://doi.org/10.1038/ 10. Liberzon A, Birger C, Thorvaldsdóttir H,
nmeth.2967 Ghandi M, Mesirov JP, Tamayo P (2015) The
4. Finak G, McDavid A, Yajima M, Deng J, molecular signatures database (MSigDB) hall-
Gersuk V, Shalek AK, Slichter CK, Miller HW, mark gene set collection. Cell Syst 1:417–425.
McElrath MJ, Prlic M, Linsley PS, Gottardo R https://doi.org/10.1016/j.cels.2015.12.004
(2015) MAST: a flexible statistical framework for 11. Dunnett CW (1955) A multiple comparison
assessing transcriptional changes and characteriz- procedure for comparing several treatments
ing heterogeneity in sin gle-cell RNA sequencing with a control. J Am Stat Assoc
data. Genome Biol 16:278. https://doi.org/10. 50:1096–1121. https://doi.org/10.1080/
1186/s13059-015-0844-5 01621459.1955.10501294
5. Subramanian A, Tamayo P, Mootha VK, 12. Pollen AA, Nowakowski TJ, Shuga J, Wang X,
Mukherjee S, Ebert BL, Gillette MA, Leyrat AA, Lui JH, Li N, Szpankowski L,
Paulovich A, Pomeroy SL, Golub TR, Lander Fowler B, Chen P, Ramalingam N, Sun G,
ES, Mesirov JP (2005) Gene set enrichment Thu M, Norris M, Lebofsky R, Toppani D,
analysis: a knowledge-based approach for inter- Kemp DW, Wong M, Clerkson B, Jones BN,
preting genome-wide expression profiles. Proc Wu S, Knutsson L, Alvarado B, Wang J, Weaver
Natl Acad Sci U S A 102:15545–15550. LS, May AP, Jones RC, Unger MA, Kriegstein
https://doi.org/10.1073/pnas.0506580102 AR, West JAA (2014) Low-coverage single-cell
6. Fan J, Salathia N, Liu R, Kaeser GE, Yung YC, mRNA sequencing reveals cellular heterogene-
Herman JL, Kaper F, Fan J-B, Zhang K, ity and activated signaling pathways in develop-
Chun J, Kharchenko PV (2016) Characterizing ing cerebral cortex. Nat Biotechnol
transcriptional heterogeneity through pathway 32:1053–1058. https://doi.org/10.1038/
and gene set overdispersion analysis. Nat nbt.2967
Chapter 8
Abstract
In many single-cell RNA-seq (scRNA-seq) experiments, cells represent progressively changing states along
a continuous biological process. A useful approach to analyzing data from such experiments is to computa-
tionally order cells based on their gradual transition of gene expression. The ordered cells can be viewed as
samples drawn from a pseudo-temporal trajectory. Analyzing gene expression dynamics along the pseudo-
time provides a valuable tool for reconstructing the underlying biological process and generating biological
insights. TSCAN is an R package to support in silico reconstruction of cells’ pseudotime. This chapter
introduces how to apply TSCAN to scRNA-seq data to perform pseudotime analysis.
Key words Single-cell RNA-seq, Gene expression, Pseudotime, Minimum spanning tree, Genomics,
Bioinformatics
1 Introduction
Guo-Cheng Yuan (ed.), Computational Methods for Single-Cell Data Analysis, Methods in Molecular Biology, vol. 1935,
https://doi.org/10.1007/978-1-4939-9057-3_8, © Springer Science+Business Media, LLC, part of Springer Nature 2019
115
116 Zhicheng Ji and Hongkai Ji
2 Materials
Fig. 1 TSCAN analysis workflow. Starting from a preprocessed gene expression matrix, TSCAN constructs
cells’ pseudo-temporal ordering using the following steps: reduce dimension of the gene expression matrix,
group cells into clusters, construct MST that links the cluster centers to create the backbone of the pseudo-
temporal trajectory, and project individual cells onto tree edges to obtain pseudotime
2.2 TSCAN Package TSCAN is an open-source R package that can be freely downloaded
and installed from Bioconductor [12] or Github [13]. After
installing R, the latest TSCAN can be installed from Github by
typing the following commands in R:
if (!require("devtools"))
install.packages("devtools")
devtools::install_github("zji90/TSCAN")
2.3 ScRNA-Seq Data We assume that raw scRNA-seq data have already been processed
and summarized into a matrix of normalized gene expression values
(see Note 1). In this matrix, each row represents a gene and each
column represents a cell. The matrix is stored in a tab-delimited text
file. The first row of the file stores cell names, and the first column
contains gene names.
In this chapter, we will demonstrate the pseudotime analysis
using a scRNA-seq dataset of differentiating human skeletal muscle
myoblasts [1]. The dataset contains 271 cells collected at 0, 24,
48, and 72 h after switching human myoblasts to low serum. The
log2 transformed gene expression matrix (HSMM_Y) can be
downloaded from Github [14].
3 Methods
3.1 Loading Data The first step of TSCAN is to load the scRNA-seq data into R and
and Preprocessing convert it into a matrix object. For example, the HSMM dataset,
which is stored in a tab-delimited text file, can be loaded using the
following R command (see Note 2):
library(TSCAN)
data <- as.matrix(read.table("HSMM_Y.txt", as.is=T, header=T, sep="\t",
row.names=1))
In this plot, each dot represents a cell. Different cell clusters are
represented using different colors and marker shapes. Cluster cen-
ters are marked with numbers.
3.3 Pseudotime Once cells are clustered, TSCAN will construct an MST to connect
Reconstruction the cluster centers. This tree will serve as the backbone of the
pseudo-temporal trajectory. The tree may have multiple branches.
By default, TSCAN will choose the path with the largest number of
cell clusters as the main pseudo-temporal path. If two paths have
the same number of cell clusters, the path with the largest number
of cells will be chosen as the main path. For example, the main path
in Fig. 2a will be path 3-1-4.
Sometimes, some branches of the tree are not of biological
interest. For example, some cell clusters and tree branches represent
contaminated cells. To help identify and remove such branches, one
can use TSCAN to visualize expression of marker genes. To dem-
onstrate, the following R commands show the expression of two
marker genes in the HSMM data: PDGFRA (Fig. 2b) and MYOG
(Fig. 2c).
Fig. 2 TSCAN analysis of the HSMM data. A. Scatterplot showing PC2, PC3, and cell clustering. Path of interest
is 3-1-4. B. Visualization of PDGFRA marker gene expression. C. Visualization of MYOG marker gene
expression. D. Scatterplot showing PC2, PC3, and cell clustering. Path of interest is 2-1-4
singlegeneplot(data["ENSG00000125414.13",],HSMMTSCANorder214)
6
Expression
State
4 1
2
4
0
0 50 100 150
Pseudotime
Fig. 3 Differential gene detection by TSCAN. (a) Scatterplot showing expression of MYH2 gene along the
pseudotemporal path 2-1-4. (b) An example of TSCAN output that shows a few top differentially expressed
genes
122 Zhicheng Ji and Hongkai Ji
3.4 Differential Gene Given cells’ pseudo-temporal ordering, one can detect genes with
Expression Analysis significant expression changes along pseudotime. To detect such
genes, TSCAN fits a generalized additive model (GAM) for each
gene to describe the functional relationship between gene expres-
sion and pseudotime. The fitted model is compared to a null model
that assumes constant expression along pseudotime. The model
fitting is performed using mgcv package in R [16]. A likelihood
ratio test is conducted to obtain p-value. To account for multiple
testing, p-values are converted to false discovery rate (FDR). By
default, genes with FDR < 0.05 are reported as differential. To
demonstrate, the following command performs differential analysis
along the pseudo-temporal path 2-1-4:
4 Notes
After loading the cell clustering results, users can then proceed
with the remaining steps of pseudotime reconstruction.
5. For user’s convenience, TSCAN also provides a graphical user
interface (GUI) to perform pseudotime analysis. Most of the
functions and commands described above have corresponding
buttons in the GUI. Instead of typing the commands in R, one
can also use GUI to execute the same functions. This usually
only requires one to click a few buttons. The link to the online
version of TSCAN GUI can be found on TSCAN’s Github
page [13]. On that page, there is a video demonstrating how to
use the GUI. Since the video is quite straightforward, we will
not repeat the demonstration of GUI here. The TSCAN GUI
can also be invoked locally in R on user’s own computer using
the following command:
TSCANui()
References
1. Trapnell C, Cacchiarelli D, Grimsby J et al tools. bioRxiv. https://doi.org/10.1101/
(2014) The dynamics and regulators of cell 276907
fate decisions are revealed by pseudotemporal 9. Herring CA, Chen B, McKinley ET et al
ordering of single cells. Nat Biotechnol 32 (2018) Single-cell computational strategies
(4):381–386 for lineage reconstruction in tissue systems.
2. Zheng C, Zheng L, Yoo JK et al (2017) Land- Cell Mol Gastroenterol Hepatol 5(4):539–548
scape of infiltrating T cells in liver cancer 10. Street K, Risso D, Fletcher RB et al (2018)
revealed by single-cell sequencing. Cell 169 Slingshot: cell lineage and pseudotime infer-
(7):1342–1356 ence for single-cell transcriptomics. BMC
3. Shalek AK, Satija R, Shuga J et al (2014) Genomics 19(1):477
Single-cell RNA-seq reveals dynamic paracrine 11. R project. https://www.r-project.org/
control of cellular variation. Nature 510 12. TSCAN R package on Bioconductor.
(7505):363–369 https://www.bioconductor.org/packages/
4. Marco E, Karp RL, Guo G et al (2014) Bifur- release/bioc/html/TSCAN.html
cation analysis of single-cell gene expression 13. TSCAN R package on Github. https://github.
data reveals epigenetic landscape. Proc Natl com/zji90/TSCAN
Acad Sci U S A 111(52):E5643–E5650
14. HSMM singe-cell RNA-seq dataset.
5. Shin J, Berg DA, Zhu Y et al (2015) Single-cell https://raw.githubusercontent.com/zji90/
RNA-Seq with waterfall reveals molecular cas- TSCANdata/master/HSMM_Y.txt
cades underlying adult neurogenesis. Cell Stem
Cell 17(3):360–372 15. Fraley C, Raftery AE (2002) Model-based clus-
tering, discriminant analysis, and density esti-
6. Ji Z, Ji H (2016) TSCAN: pseudo-time recon- mation. J Am Stat Assoc 97(458):611–631
struction and evaluation in single-cell RNA-seq
analysis. Nucleic Acids Res 44(13):e117–e117 16. Wood SN (2011) Fast stable restricted maxi-
mum likelihood and marginal likelihood esti-
7. Haghverdi L, Buettner M, Wolf FA et al (2016) mation of semiparametric generalized linear
Diffusion pseudotime robustly reconstructs models. J Royal Stat Soc Sec B 73(1):3–36
lineage branching. Nat Methods 13
(10):845–848 17. Maaten LVD, Hinton G (2008) Visualizing
data using t-SNE. J Mach Learn Res
8. Wouter S, Robrecht C, Helena T, et al (2018) 9:2579–2605
A comparison of single-cell trajectory inference
methods: towards more accurate and robust
Chapter 9
Abstract
The ability to measure molecular properties (e.g., mRNA expression) at the single-cell level is revolutioniz-
ing our understanding of cellular developmental processes and how these are altered in diseases like cancer.
The need for computational methods aimed at extracting biological knowledge from such single-cell data
has never been greater. Here, we present a detailed protocol for estimating differentiation potency of single
cells, based on our Single-Cell ENTropy (SCENT) algorithm. The estimation of differentiation potency is
based on an explicit biophysical model that integrates the RNA-Seq profile of a single cell with an
interaction network to approximate potency as the entropy of a diffusion process on the network. We
here focus on the implementation, providing a step-by-step introduction to the method and illustrating it
on a real scRNA-Seq dataset profiling human embryonic stem cells and multipotent progenitors represent-
ing the 3 main germ layers. SCENT is aimed particularly at single-cell studies trying to identify novel stem-
or-progenitor like phenotypes, and may be particularly valuable for the unbiased identification of cancer
stem cells. SCENT is implemented in R, licensed under the GNU General Public Licence v3, and freely
available from https://github.com/aet21/SCENT.
1 Introduction
Guo-Cheng Yuan (ed.), Computational Methods for Single-Cell Data Analysis, Methods in Molecular Biology, vol. 1935,
https://doi.org/10.1007/978-1-4939-9057-3_9, © Springer Science+Business Media, LLC, part of Springer Nature 2019
125
126 Weiyan Chen and Andrew E. Teschendorff
2 Materials
2.2 RNA-Seq Data The first data input for the computation of differentiation potency
using the SCENT package is a single-cell RNA-Seq dataset. The
procedure is identical for bulk RNA-Seq data, the only difference
being in the specific preprocessing and normalization of the data.
To illustrate the method, we shall use a scRNA-Seq dataset from
Chu et al. [19], generated with the Fluidigm C1 platform. This
dataset can be downloaded from the GEO website under accession
number GSE75748, i.e., via https://www.ncbi.nlm.nih.gov/geo/
query/acc.cgi?acc¼GSE75748, and the specific file to download is
GSE75748 sc cell type ec.csv.gz. It contains scRNA-Seq profiles for
128 Weiyan Chen and Andrew E. Teschendorff
Table 1
Distribution and number of single-cell types in Chu et al. dataset
2.3 User-Defined The second required input for the SCENT algorithm is a user-
Functional Gene defined functional gene network, for instance, a protein-protein
Network interaction (PPI) network documenting the main interactions
that take place in a cell. For justification as to why a PPI network
is needed (see Notes 1 and 2). Although these networks are mere
caricatures of the underlying signaling networks, ignoring time,
spatial, and biological contexts, some features of the network may
nevertheless be informative, and as we shall see below this is indeed
the case. For instance, if protein-A is a hub (a node of very high
connectivity) and protein-B has only a few connections, then it is
likely that protein-A will have a higher connectivity than protein-B
in any particular biological context (unless of course protein-A is
absent from the cell altogether). As we shall later, the scRNA-Seq
data will provide us with the biological context in which to generate
context or cell-type-specific networks. The specific PPI network we
use here is derived from Pathway Commons (www.pat
hwaycommons.org) (downloaded in Jan. 2016), which is an
integrated resource collating together PPIs from several distinct
sources. In particular, the network is constructed by integrating
the following sources: the Human Protein Reference Database
(HPRD), the National Cancer Institute Nature Pathway Interac-
tion Database (NCI-PID) (http://pid.nci.nih.gov), the Interac-
tome (Intact), and the Molecular Interaction Database (MINT).
Protein interactions in this network include physical stable
Estimating Differentiation Potency of Single Cells Using Single-Cell. . . 129
3 Methods
3.1.1 Checking We assume that the scRNA-Seq data have been appropriately nor-
and Further Preprocessing malized. Given a count matrix of reads mapping to genes, we
of the scRNA-Seq Data assume that the user has run this count matrix through a single-
cell processing and quality control pipeline (see, e.g., [27]), such as
that provided by the Bioconductor package scater [28] or
R-package Seurat [29]. The end result of this is typically a
log-normalized scRNA-Seq data matrix. The log-transformation
provides a natural regularization of the data, stabilizing the variance
of highly expressed genes, and is strongly recommended for any
down-stream analyses [27], especially in the context of SCENT. In
addition, we must also take care of the smallest values in the
130 Weiyan Chen and Andrew E. Teschendorff
Fig. 1 (a) The estimation of differentiation potency using the signaling entropy rate requires two inputs: a user-
defined gene functional network (e.g., a PPI network) which does not depend on biological context, and a
normalized scRNA-Seq profile of a cell or sample, which thus provides the biological context. The latter profile
is overplayed onto the network to define a cell-specific stochastic matrix P with entries pij . From this matrix,
we can derive the invariant measure (steady-state probability) π i, which satisfies πP ¼ π, and finally the
signaling entropy rate SR is obtained as a weighted average over local signaling entropies. This allows cells to
be ordered according to their SR values, i.e., according to potency. SR can be quantified on a scale between
0 and 1 (not shown). (b) Transforming the normalized SR to the logit-scale and fitting Gaussian Mixture models
allows the identification of potency states. (c) The distribution of these potency states across cell-types can be
analyzed to identify novel cellular phenotypes that differ in potency. For instance, this strategy could be used
to identify cells primed for differentiation within a multipotent or pluripotent cell population
Estimating Differentiation Potency of Single Cells Using Single-Cell. . . 131
3.1.2 Integration Integration is achieved with the DoIntegPPI function, and consists
with User-Defined Gene of two steps:
Functional Network
1. The function takes as input two arguments, the normalized
scRNA-Seq data matrix exp.m with rows labeling genes and
columns labeling cells/samples, and a user-defined network
ppiA.m with rows and columns labeling genes. The same
gene identifier must be used for both expression and network
matrices. The function finds the overlap between the gene
identifiers specifying the network with those specifying the
scRNA-Seq matrix, and then extracts the maximally connected
subnetwork (see Note 3), specified by the adjMC output
argument.
132 Weiyan Chen and Andrew E. Teschendorff
3.1.3 Computation The output from DoIntegPPI is then used as input to the functions
of the Signaling that compute the signaling entropy rate, denoted as SR. For a given
Entropy Rate fixed network, i.e., for the given adjMC matrix, there is a maximum
possible SR value (denoted as maxSR), which is obtained for a
particular edge-weight configuration [22]. It is thus very conve-
nient to report the SR value of any given cell, normalized against
this maximum possible value, which means that SR is then bounded
between 0 and 1. The maximum entropy rate value can be calcu-
lated using the CompMaxSR function. This function takes as input
the adjacency matrix, i.e., the adjMC output from the DoIntegPPI
function, and returns the maxSR value as output.
Having obtained the normalization factor maxSR, we can then
proceed to compute the SR value for any given cell/sample, using
the CompSRana function. This function takes four objects as input:
1. The expression profile vector of the cell/sample (exp.v), which
is a given column of the expMC output matrix from
DoIntegPPI,
2. The network adjacency matrix (adj.m), i.e., the adjMC output
from DoIntegPPI,
3. A logical parameter local to tell the function where to report
back the local, i.e., gene-centric, signaling entropies, and,
4. maxSR, the maximum entropy rate calculated earlier. Specify-
ing maxSR ¼ NULL will force the function to return
non-normalized SR values.
Two notes with the above procedure: (a) the local gene-based
entropies can be used in downstream analyses for ranking genes
according to differential entropy, but only if appropriately normal-
ized. For instance, they could be used to identify the main genes
driving changes in the global signaling entropy rate of the network.
However, if the user only wishes to estimate potency, specifying
local ¼ FALSE is fine, which will save some RAM on the output
object, (b) by specifying the input object as an expression vector,
the user can easily use the parallel R-package to compute the SR
values for all cells/samples simultaneously. For this purpose, we
also provide on the github website another function CompSRa-
naPRL, which takes in as input an index and the full scRNA-Seq
data matrix in place of the expression profile of one cell/sample.
One can then use the mclapply function in the parallel package to
loop over the index values, which specify the columns (i.e., cells/
samples) for which the SR values are desired. This will become
clearer in the example given further below. The output of the
Estimating Differentiation Potency of Single Cells Using Single-Cell. . . 133
3.2 Real Dataset Test To illustrate the procedure above, we now run in detail through the
example given in Table 1. As mentioned, we assume that the count
matrix has been normalized, say using a specific package (e.g.,
scater). The normalized count matrix we use can be downloaded
from http://github.com/aet21/SCENT/scChu.Rd and loaded into
your session using the load command:
load(“sceChu.Rd”);
load(“hprdAsigH-13Jun12.Rd”);
Fig. 2 Boxplot of signaling entropy rate values (SR, y-axis) against cell-type
(pluripotent hESCs vs non-pluripotent (NotPl), x-axis). P-value is from a
one-tailed Wilcoxon rank sum test. Number of cells in each cell-type category
is given in group labels
> scent.o$distPSPH
ordpotS.v
celltype 1 2
1 355 19
2 90 83
3 8 61
4 0 159
5 0 105
6 10 128
Estimating Differentiation Potency of Single Cells Using Single-Cell. . . 135
Fig. 3 Normalized mRNA expression values for 3 genes that mark hESCs primed for differentiation into
mesoderm (PECAM1) or ectoderm (SOX2 & HES1) lineages, against the predicted potency group of hESCs.
PS1 ¼ pluripotent/higher potency state, PS2 ¼ non-pluripotent/lower potency state. P-values are from a
one-tailed Wilcoxon rank sum test
4 Notes
5 Conclusions
References
5. Lang AH, Li H, Collins JJ, Mehta P (2014) analysis reveals insights into cellular differenti-
Epigenetic landscapes explain partially repro- ation and development. Nat Biotechnol
grammed cells and identify key reprogramming 35:551–560. https://doi.org/10.1038/nbt.
genes. PLoS Comput Biol 10:e1003734. 3854
https://doi.org/10.1371/journal.pcbi. 17. Haghverdi L, Büttner M, Wolf FA et al (2016)
1003734 Diffusion pseudotime robustly reconstructs
6. Tirosh I, Venteicher AS, Hebert C et al (2016) lineage branching. Nat Methods 13:845–848.
Single-cell RNA-seq supports a developmental https://doi.org/10.1038/nmeth.3971
hierarchy in human oligodendroglioma. 18. Angerer P, Haghverdi L, Büttner M et al
Nature 539:309–313. https://doi.org/10. (2016) Destiny: diffusion maps for large-scale
1038/nature20123 single-cell data in R. Bioinformatics
7. Tirosh I, Izar B, Prakadan SM et al (2016) 32:1241–1243. https://doi.org/10.1093/bio
Dissecting the multicellular ecosystem of met- informatics/btv715
astatic melanoma by single-cell RNA-seq. Sci- 19. Chu L-F, Leng N, Zhang J et al (2016) Single-
ence 352:189–196. https://doi.org/10. cell RNA-seq reveals novel regulators of human
1126/science.aad0501 embryonic stem cell differentiation to defini-
8. Wang Z, Gerstein M, Snyder M (2009) tive endoderm. Genome Biol 17:2315.
RNA-Seq: a revolutionary tool for transcrip- https://doi.org/10.1186/s13059-016-1033-
tomics. Nat Rev Genet 10:57–63. https:// x
doi.org/10.1038/nrg2484 20. Grün D, Muraro MJ, Boisset J-C et al (2016)
9. Grün D, van Oudenaarden A (2015) Design De novo prediction of stem cell identity using
and analysis of single-cell sequencing experi- single-cell Transcriptome data. Cell Stem Cell
ments. Cell 163:799–810. https://doi.org/ 19:266–277. https://doi.org/10.1016/j.
10.1016/j.cell.2015.10.039 stem.2016.05.010
10. Trapnell C, Cacchiarelli D, Grimsby J et al 21. Guo M, Bao EL, Wagner M et al (2017)
(2014) The dynamics and regulators of cell SLICE: determining cell differentiation and
fate decisions are revealed by pseudotemporal lineage based on single cell entropy. Nucleic
ordering of single cells. Nat Biotechnol Acids Res 45:e54. https://doi.org/10.1093/
32:381–386. https://doi.org/10.1038/nbt. nar/gkw1278
2859 22. Teschendorff AE, Enver T (2017) Single-cell
11. Marco E, Karp RL, Guo G et al (2014) Bifur- entropy for accurate estimation of differentia-
cation analysis of single-cell gene expression tion potency from a cell’s transcriptome. Nat
data reveals epigenetic landscape. Proc Natl Commun 8:15599. https://doi.org/10.
Acad Sci U S A 111:E5643–E5650. https:// 1038/ncomms15599
doi.org/10.1073/pnas.1408993111 23. Gómez-Gardeñes J, Latora V (2008) Entropy
12. Setty M, Tadmor MD, Reich-Zeliger S et al rate of diffusion processes on complex net-
(2016) Wishbone identifies bifurcating devel- works. Phys Rev E Stat Nonlinear Soft Matter
opmental trajectories from single-cell data. Nat Phys 78:114. https://doi.org/10.1103/
Biotechnol 34:637–645. https://doi.org/10. PhysRevE.78.065102
1038/nbt.3569 24. Banerji CRS, Miranda-Saavedra D, Severini S
13. Bendall SC, Davis KL, E-AD A et al (2014) et al (2013) Cellular network entropy as the
Single-cell trajectory detection uncovers pro- energy potential in Waddington’s differentia-
gression and regulatory coordination in tion landscape. Sci Rep 3:1129. https://doi.
human B cell development. Cell org/10.1038/srep03039
157:714–725. https://doi.org/10.1016/j. 25. Teschendorff AE, Sollich P, Kuehn R (2014)
cell.2014.04.005 Signalling entropy: a novel network-theoretical
14. Chen J, Schlitzer A, Chakarov S et al (2016) framework for systems analysis and interpreta-
Mpath maps multi-branching single-cell trajec- tion of functional omic data. Methods
tories revealing progenitor cell progression 67:282–293. https://doi.org/10.1016/j.
during development. Nat Commun 7:11988. ymeth.2014.03.013
https://doi.org/10.1038/ncomms11988 26. Banerji CRS, Severini S, Caldas C, Teschen-
15. Qiu X, Mao Q, Tang Y et al (2017) Reversed dorff AE (2015) Intra-tumour Signalling
graph embedding resolves complex single-cell entropy determines clinical outcome in breast
trajectories. Nat Methods 14:979–982. and lung cancer. PLoS Comput Biol 11:
https://doi.org/10.1038/nmeth.4402 e1004115. https://doi.org/10.1371/journal.
16. Rizvi AH, Camara PG, Kandror EK et al pcbi.1004115
(2017) Single-cell topological RNA-seq
Estimating Differentiation Potency of Single Cells Using Single-Cell. . . 139
27. Lun ATL, McCarthy DJ, Marioni JC (2016) A cell RNA-seq data in R. Bioinformatics 247:
step-by-step workflow for low-level analysis of btw777. https://doi.org/10.1093/bioinfor
single-cell RNA-seq data with bioconductor. matics/btw777
F1000Res 5:2122. https://doi.org/10. 29. Butler A, Hoffman P, Smibert P et al (2018)
12688/f1000research.9501.2 Integrating single-cell transcriptomic data
28. McCarthy DJ, Campbell KR, Lun ATL, Wills across different conditions, technologies, and
QF (2017) Scater: pre-processing, quality con- species. Nat Biotechnol 36:411–420. https://
trol, normalization and visualization of single- doi.org/10.1038/nbt.4096
Chapter 10
Abstract
Single-cell RNA-Sequencing is a pioneering extension of bulk-based RNA-Sequencing technology. The
“guilt-by-association” heuristic has led to the use of gene co-expression networks to identify genes that are
believed to be associated with a common cellular function. Many methods that were developed for bulk-
based RNA-Sequencing data can continue to be applied to single-cell data, and several of the most widely
used methods are explored. Several methods for leveraging the novel time information contained in single-
cell data when constructing gene co-expression networks, which allows for the incorporation of directed
associations, are also discussed.
Key words Gene co-expression network, Gene regulatory network, Single-cell RNA-Seq, Correlation
coefficient, Count data, Directed network, Pseudotime
1 Introduction
Guo-Cheng Yuan (ed.), Computational Methods for Single-Cell Data Analysis, Methods in Molecular Biology, vol. 1935,
https://doi.org/10.1007/978-1-4939-9057-3_10, © Springer Science+Business Media, LLC, part of Springer Nature 2019
141
142 Alicia T. Lamere and Jun Li
Gene A
Gene B
Expression
Gene A
Gene B Gene C
0 Pseudotime
(A) (B)
Fig. 1 (a) Example GCN. Here, because there exists an edge between the nodes representing Gene A and
Gene B, these two genes show evidence of co-expression. Meanwhile, no edge exists between Gene B and
Gene C, so there is no evidence of co-expression for this pair of genes. (b) Example of two gene expressions
that exhibit a regulatory relationship when ordered by pseudotime. If we simply considered their correlation,
the pair appear unrelated. However, if we correlate the lagged expression, looking at Gene A’s expression
from time 0 and Gene B’s expression from time l, then they exhibit a strong positive correlation
2 Materials
3 Methods
3.2 Identifying After normalization and before constructing any GCNs, it usually is
Highly Expressed important to filter your dataset to only those showing reasonably
Genes high average expression and range of expression. This step is partic-
ularly important for scRNA-Seq data which has a large number of
drop-out events. By filtering to keep genes with not only moder-
ate/high expression, but a large range of expression, we can be
more confident that the edges identified in our network are not
simply the result of noise in the dataset. One method for identifying
these genes is to:
Find the average and inter-quartile range of the normalized
expression values for each gene. In R, this can easily be done using
base functions:
>iqr_vals = apply(norm_data, 1, IQR)
>mean_val = apply(norm_data, 1, mean)
>iqr_scale = iqr_vals/mean_val
>enableWGCNAThreads()
3. Based on this plot, choose the lowest power for which the
signed R square curve flattens out upon reaching a high
value. Let us say that this value is 6, we can then construct
our network with the code:
>d = colMeans(filter_data)
>depth = exp(log(d) - mean(log(d)))
3.4.1 Directed GCNs The network construction methods explored above only describe
Through Pseudotime co-expression of genes. They were originally designed for use on
for scRNA-Seq Data either microarray or bulk-based RNA-Seq data, and therefore do
not leverage any of the additional information available through
Inference of GCNs from scRNA-Seq 147
3.4.2 Estimating Key to any time-based inference on gene expression data is the
Pseudotime method used to estimate the time information. For scRNA-Seq,
several algorithms have been developed for the estimation of these
time points, called “pseudotime.” In general, the pseudotime anal-
ysis of the scRNA-Seq data often involves dimension reduction
methods to deal with high-dimensionality, due to the often
thousands of gene expression levels measured in each sample.
There are many pseudotime algorithms available now (see Note
6). We will use the method Monocle because of its ease of use
and implementation as an R package. The authors of Monocle [8]
recognize that by projecting the data to a lower dimensional space,
natural clustering of cells can occur and this clustering can capture
cells at different time points. Their algorithm works by first repre-
senting each cell’s expression as a point in a Euclidean space with
dimensions representing each gene included in the sample. Then
this high-dimensional space is reduced using independent compo-
nent analysis (ICA), which, as its name implies, projects the gene
expression profiles into a lower-dimensional space that best distin-
guished the independent components—or in our case, cells. The
algorithm then constructs a minimum spanning tree on the cells in
this lower-dimensional space. This tree is simply the shortest path
that connects all cells without revisiting any edges. Finally, cells’
positions in the minimum spanning tree are used to assign “pseu-
dotime” values (see Note 7). Monocle does not require the scRNA-
Seq data to be normalized beforehand. Instead, it handles all nor-
malization internally. As the first step in our network construction,
we will use Monocle to generate the pseudotime for each cell in the
scRNA-Seq dataset using genes that are known to be associated
with cell cycle.
1. Let filter_data be data that have not been normalized, but
have been filtered to keep the most highly expressed genes.
Create a data set object for monocle to use:
>mon_data= newCellDataSet(as.matrix(filter_data))
148 Alicia T. Lamere and Jun Li
3.4.3 LEAP Algorithm LEAP is a method created for direct use on scRNA-Seq data to
for Estimating construct directed GCNs. Borrowing from time-series analysis,
Co-expression LEAP sorts cells according to the estimated cell-cycle-based pseu-
dotime creating a “pseudotime-series,” and then computes the
maximum correlation over all possible time lags [23]. This maxi-
mum correlation is used as the statistic to replace the traditional
Pearson’s correlation coefficient for constructing the gene
co-expression network, and the statistical significance of this statis-
tic is measured by the false discovery rate (FDR) calculated using
permutations. LEAP is implemented as an R package, and the
general steps for generating a GCN are:
1. Sort the cells in your scRNA-Seq dataset according to the
pseudotime you calculated using Monocle.
2. In order to apply LEAP, the scRNA-Seq expression counts
must be normalized using any of the methods described
above. Usually, TMM is recommended.
3. Apply the MAC_counter() function to your normalized dataset
to generate the correlation matrix that the GCN will be based
on. By maximizing over all possible time lags, the correlations
found by LEAP can often be larger than a traditional Pearson’s
correlation coefficient (see Note 8). The following is the R code
to estimate the directed GCN:
>MAC_results = MAC_counter(data=filter_data,
max_lag_prop=1/3, MAC_cutoff=0.2, lag_matrix=T)
4 Notes
between pairs and groups of genes. The first three steps for the
ODE framework described above can be implemented to dis-
cern a computationally feasible subset of genes to work with.
Although not available as an R package, Julia code implement-
ing PIDC is available through the authors. A third method is
designed for experiments collected at specific time points.
These time points may also contain important information for
constructing GCNs. The algorithm SINCERITIES [34] uses
ridge regression and partial correlation analysis to directly
incorporate temporal changes in expression that are observed
through these time points. Though not implemented as an R
package, the R and Matlab code are both available through the
authors.
6. Wanderlust. Another popular pseudotime estimation method
is Wanderlust. Instead of using ICA, Wanderlust takes the
high-dimensional data and transforms it into a nearest-
neighbor graph—Meaning cells with similar expression profiles
will be connected [12]. It then repeatedly identifies the short-
est path and takes the average, using a cell’s placement along
this average path to determine its pseudotime.
7. Estimating Pseudotime with Monocle: Monocle’s demonstrated
effectiveness and ease of implementation tend to make it easier
implement. However, when using Monocle it is important to
pay attention to the “states” in which cells are classified and
note that pseudotimes from one state do not correspond to the
times in other states. As a result, each state should be treated
separately. In practice, there usually is a state that most cells are
captured by and hence analysis may be restricted to those cells.
8. Choosing window size for LEAP: Note that the default for this
function is a maximum window size of two-thirds of the num-
ber of cells present in the dataset. Deviating significantly from
this size may result in correlations that are artifacts of noise in
the expression profiles rather than capturing true biological
effects.
References
Abstract
Allele-specific expression is traditionally studied by bulk RNA sequencing, which measures average
gene expression across cells. Single-cell RNA sequencing (scRNA-seq) allows the comparison of expression
distribution between the two alleles of a diploid organism, and characterization of allele-specific bursting.
Here we describe SCALE, a bioinformatic and statistical framework for allele-specific gene expression
analysis by scRNA-seq. SCALE estimates genome-wide bursting kinetics at the allelic level while accounting
for technical bias and other complicating factors such as cell size. SCALE detects genes with significantly
different bursting kinetics between the two alleles, as well as genes where the two alleles exhibit non-
independent bursting processes. Here, we illustrate SCALE on a mouse blastocyst single-cell dataset with
step-by-step demonstration from the upstream bioinformatic processing to the downstream biological
interpretation of SCALE’s output.
Key words Single-cell RNA sequencing, Allele-specific expression, Transcriptional bursting, Techni-
cal variability
1 Introduction
Guo-Cheng Yuan (ed.), Computational Methods for Single-Cell Data Analysis, Methods in Molecular Biology, vol. 1935,
https://doi.org/10.1007/978-1-4939-9057-3_11, © Springer Science+Business Media, LLC, part of Springer Nature 2019
155
156 Meichen Dong and Yuchao Jiang
2 Materials
2.1 Data Input The data input for SCALE includes raw sequencing files from
scRNA-seq studies (as fastq), as well as the corresponding genome
assembly (as fasta). Based on the gene body coverage of
the sequenced reads, scRNA-seq can be classified into two cate-
gories: full-transcript method (e.g., Smart-seq [11] and Smart-seq2
[12]) and tag method (e.g., Drop-seq [13] and the 10X Genomics
Chromium Single Cell 30 Solution [14]). To study ASE at germline
heterozygous loci, full-transcript scRNA-seq protocol such as
Smart-seq2 is recommended due to its broad coverage. To account
for biases that are introduced during the library preparation and
sequencing step, SCALE by default relies on external spike-ins,
whose known concentration is used as ground truth for adjustment
[15]. When spike-ins are not readily available, imputation-based
methods such as SAVER [16] and scImpute [17] can be adopted to
recover true underlying expression distribution.
2.3 Software For read alignment, BWA [18] and STAR [19] are required for
Packages DNA and RNA sequencing, respectively. Picard Tools (http://bro
adinstitute.github.io/picard) and SAMtools [20] are required for
deduplication and quality controls. The Genome Analysis Toolkit
(GATK) [21] is adopted in our proposed pipeline by default to
identify germline heterozygous loci. R packages SCALE, as well as
its dependents—tsne (https://cran.r-project.org/package¼tsne)
and rje (https://cran.r-project.org/package¼rje), is required for
performing ASE analysis in R.
3 Methods
3.1 Bioinformatics For endogenous RNAs, germline heterozygous loci are called by
Pipeline bulk DNA- or RNA-seq of the same cells following the best prac-
tices for GATK [21]. If bulk sequencing from the same tissue is not
available, scRNA-seq data can be aggregated to generate a pseudo-
bulk RNA-seq sample. STAR [19] and BWA [18] are used for read
alignment for RNA-seq and DNA-seq, respectively, followed by a
deduplication and quality control procedure. We then force call
single-cell allele-specific read counts using the mpileup command
by SAMtools [20]. These allele-specific reads counts are further
used as input to SCALE.
For external spike-ins, SCALE takes as input the true concen-
trations and lengths of the spike-in molecules, as well as the depth
of coverage for each spike-in sequence across cells. The true con-
centration of each spike-in molecule is calculated according to the
known concentration (denoted as μ attomoles/μL) and the dilu-
tion factor (e.g., 40,000):
158 Meichen Dong and Yuchao Jiang
GATK
Germline heterozygous loci (vcf)
mpileup
Bursty Constitutive
depth
Allele A
Mean
Allele-specific Allelic
transcriptional kinetics Allele B imbalance
A B
Hypothesis testing
Differential allele-specific
Coordinated bursting
bursting frequency
Fig. 1 Overview of analysis pipeline of SCALE. SCALE takes as input allele-specific read counts at germline
heterozygous loci and carries out three major steps: gene classification, estimation of allele-specific bursting
kinetics, and hypothesis testing of differential and nonindependent allelic bursting
#################################################################
# 2. Profile allele-specific read counts by scRNA-seq
#################################################################
# 2.1. Get splice junction database:
wget http://labshare.cshl.edu/shares/gingeraslab/www-
data/dobin/STAR/STARgenomes/GENCODE/Old/gencode.v14.annotation.gt
f.sjdb
# 2.2. Generate the genome using STAR, 100bp paired-end
sequencing
genomeDir=directory_to_genome
STAR --runMode genomeGenerate --genomeDir $genomeDir --
genomeFastaFiles hg19.fa --sjdbFileChrStartEnd
gencode.v14.annotation.gtf.sjdb --sjdbOverhang 99 --runThreadN 4
# 2.3. Align reads
genomeDir=directory_to_genome
STAR --genomeDir genomeDir --readFilesIn samp_1.fastq
samp_2.fastq --outFilterIntronMotifs
RemoveNoncanonicalUnannotated --outFileNamePrefix samp_ --
runThreadN 4
# 2.4. Convert sam to bam, filter, and sort
samtools view -bS samp_Aligned.out.sam > samp_Aligned.out.bam
perl filter_sam_v2.pl samp_Aligned.out.bam
samp_Aligned.out.filtered.sam
samtools view -bS samp_Aligned.out.filtered.sam >
samp_Aligned.out.filtered.bam
java -Xmx30G -jar SortSam.jar INPUT=samp_Aligned.out.filtered.bam
OUTPUT=samp_Aligned.out.filtered.sorted.bam SORT_ORDER=coordinate
# 2.5. Add read group and index
java -Xmx30G -jar AddOrReplaceReadGroups.jar
Single-Cell Allele-Specific Gene Expression Analysis 161
INPUT=samp_Aligned.out.filtered.sorted.bam
OUTPUT=samp_Aligned.out.filtered.sorted.rg.bam RGID=LANE2
RGLB=LIB2 RGPL=ILLUMINA RGPU=UNIT2 RGSM=samp
samtools index samp_Aligned.out.filtered.sorted.rg.bam
# 2.6. Parse file: position.txt contains all the heterozygous
loci (chr + coordinate) returned by GATK HaplotypeCaller using
bulk DNA-seq
samtools mpileup -E -f hg19.fa -d 1000000 --position position.txt
samp_Aligned.out.filtered.sorted.rg.bam >
samp_Aligned.out.filtered.sorted.rg.mpileup
perl pileup2base_no_strand.pl
samp_Aligned.out.filtered.sorted.rg.mpileup 30
samp_Aligned.out.filtered.sorted.rg.parse30.txt
#################################################################
# 3. Generate input for exogenous spike-ins
#################################################################
# 3.1. Concatenate ERCC with genome (hg19) and index
cat ERCC92.fa hg19.fa > hg19_ERCC.fa
java -jar CreateSequenceDictionary.jar R= hg19_ERCC.fa O=
hg19_ERCC.dict
samtools faidx hg19_ERCC.fa
# 3.2. Generate the genome using STAR, 50bp paired-end sequencing
genomeDir=directory_to_ERCC_genome
STAR --runMode genomeGenerate --genomeDir $genomeDir --
genomeFastaFiles hg19_ERCC.fa --sjdbFileChrStartEnd
gencode.v14.annotation.gtf.sjdb --sjdbOverhang 99 --runThreadN 4
# 3.3. Align reads
genomeDir=directory_to_ERCC_genome
STAR --genomeDir genomeDir --readFilesIn samp_1.fastq
162 Meichen Dong and Yuchao Jiang
samp_2.fastq --outFilterIntronMotifs
RemoveNoncanonicalUnannotated --outFileNamePrefix samp_ --
runThreadN 4
# 3.4. Convert sam to bam, filter, and sort
samtools view -bS samp_Aligned.out.sam > samp_Aligned.out.bam
perl filter_sam_v2.pl samp_Aligned.out.bam
samp_Aligned.out.filtered.sam
samtools view -bS samp_Aligned.out.filtered.sam >
samp_Aligned.out.filtered.bam
java -Xmx30G -jar SortSam.jar INPUT=samp_Aligned.out.filtered.bam
OUTPUT=samp_Aligned.out.filtered.sorted.bam SORT_ORDER=coordinate
# 3.5. Add read group and index
java -Xmx30G -jar AddOrReplaceReadGroups.jar
INPUT=samp_Aligned.out.filtered.sorted.bam
OUTPUT=samp_Aligned.out.filtered.sorted.rg.bam RGID=LANE2
RGLB=LIB2 RGPL=ILLUMINA RGPU=UNIT2 RGSM=samp
samtools index samp_Aligned.out.filtered.sorted.rg.bam
# 3.6. Get total read counts as well as read counts for ERCC
samtools view -c samp_Aligned.out.filtered.sorted.rg.bam
while read ercc; do
echo $ercc
while read bam; do samtools view -c $bam $ercc; done <
rg.bam.list | cat > $ercc.txt
done < ercc.id
3.2 Installation The R package for SCALE can be installed directly from GitHub.
and Data Input The analysis by SCALE requires scRNA-seq data of cells from a
homogenous cell population (i.e., from the same cell types and the
same tissue). Each allele at heterozygous loci should have an expres-
sion matrix with rows as genes and columns as cells. In addition to
the read count matrices for the endogenous RNAs, an input matrix
for the spike-ins is needed for capturing technical variability. This
matrix should have rows as spike-ins, the first column as the true
number of molecules, the second column as the lengths of the
molecules, and the third column and on as the observed read
counts across cells. Note that spike-ins are not required for each
individual cell (see Note 1). Here, we demonstrate the analysis
Single-Cell Allele-Specific Gene Expression Analysis 163
3.3 Quality Control Quality control (QC) procedures are recommended to filter out
both poor-quality cells and extreme genes before applying SCALE.
Cell QC metrics may include library size factor, which can be
calculated by the following definition:
A
Q cg þ Q cg
B
ηc ¼ median h i1=C ,
g
∏cC∗ ¼1 Q cA∗ g þ Q cB∗ g
1.0
log(α) = −3.089 κ = −6.304
0.8
8
6
0.6
4
0.4
2
0.2
0
−2
0.0
2 4 6 8 10 12 0 2 4 6 8 10
Log(true number of molecules) Log(true expression)
Fig. 2 Modeling of technical variability and parameter estimation. Amplification and sequencing bias are
modeled and captured by parameter α and β. Estimation is carried out by log-linear regression. Probability of
dropout is modeled by κ and τ and depends on the logarithm of the true expression. Estimation is carried out
by the Nelder-Mead simplex algorithm
3.5 Gene SCALE adopts an empirical Bayes method that categorizes each
Classification gene into being silent, monoallelically expressed, and biallelically
expressed based on their ASE across cells. An expectation maximi-
zation algorithm is implemented for fast estimation of the
corresponding parameters. The result derived from the gene_classify
function is a list of four elements: gene category, proportion of cells
expressing allele A, proportion of cells expressing allele B, and
posterior assignment of cells for each gene. For the posterior
assignment for each gene in each cell, “A” corresponds to cells
expressing A allele only, “B” corresponds to cells expressing B allele
only, “AB” corresponds to cells expressing both alleles, and “Off”
corresponds to cells that are silent for the gene of interest.
# Gene classification
gene.class.obj <- gene_classify(alleleA = as.matrix(alleleA),
alleleB = as.matrix(alleleB))
gene.category <- gene.class.obj$gene.category
166 Meichen Dong and Yuchao Jiang
3.6 Allele-Specific When studying ASE in single cells, it is critical to consider tran-
Bursting Kinetics scriptional bursting due to its pervasiveness in various organisms
[27–30]. A two-state kinetic model has been proposed for gene
transcription, where genes switch between ON and OFF states with
activation rate kon and deactivation rate koff. When a gene is at the
ON state, DNA is transcribed to RNA at rate s, while RNA decays at
rate d. A Poisson–Beta stochastic model for transcriptional bursting
was proposed by Kepler and Elston [31]:
Y PoissonðspÞ, p Betaðkon ; koff Þ,
where Y is the number of RNA molecules and p is the fraction of
time that the gene spends in the active state. Note that the decay
rate d is set to 1 since only the stationary distribution is observed
[32]. This Poisson–Beta model is easy to fit mathematically, with its
parameters corresponding to biologically meaningful quantities—
burst size as s/koff and burst frequency as kon.
After gene classification, SCALE proceeds to infer allele-specific
bursting parameters for biallelic bursty genes (see Note 4) using a
hierarchical model:
A
Y cg Poisson ϕc s gA pcg A B
, Y cg Poisson ϕc s gB pcg B
,
A
pcg Beta konA
, g ; koff , g , pcg Beta kon, g ; koff , g ,
A B B B
A B
where Y cg and Y cg are the true ASE values for gene g in cell c. Note
that the two Poisson–Beta distributions have gene- and allele-
specific bursting parameters and share the same cell-size factor,
which has been shown to affect burst size [33]. When spike-ins
are available, cell size can be estimated by the ratio of the total
number of endogenous RNA reads over the total number of spike-
in reads [34]. Moreover, users can input the cell size factors ϕc if
they are experimentally measured (see Note 5 for details).
A B
Since Y cg and Y cg are not directly observable while the observed
A B
ASE levels Q cg and Q cg are confounded with technical bias, we use a
novel “histogram-repiling” method to derive the distribution of Ycg
from the observed
n distribution Qcg for each ogene. The allele-specific
A B A B
parameters s A ; s B ; kon ; kon ; koff ; koff are then estimated
using the moment estimator methods.
For real dataset analysis, SCALE’s function allelic_kinetics
returns an object allelic.kinetics.obj, which contains the estimated
bursting parameters. A pdf plot is generated by default and is shown
in Fig. 3, where each dot corresponds to a gene and the genes off the
diagonal indicate differential bursting kinetics between the two
alleles.
Single-Cell Allele-Specific Gene Expression Analysis 167
1 2 3 4 5 6 7 8
r = 0.864 r = 0.801
log(sB/koffB)
−1
log(konB)
−3 −5
−5 −3 −1 1 1 2 3 4 5 6 7 8
log(konA) log(sA/koffA)
3.7 Hypothesis Nonparametric hypothesis testing is carried out with the null
Testing hypothesis that the two alleles share the same burst frequency and
A B A B
burst size kon ¼ kon , s A =koff ¼ s B =koff . SCALE’s function diff_al-
lelic_bursting has two “modes”: the “raw” mode where the boot-
strap samples from the raw observed read counts; and the
“corrected” mode where the bootstrap samples from the adjusted
allelic read counts. Two vectors of p-values will be obtained for test
of burst frequency and burst size, respectively.
3.8 Plot and Output For each gene, a pdf plot can be generated with the estimated
parameters and testing results. As an example, Fig. 4 shows the
output by SCALE for gene Hvcn1, whose two alleles share similar
burst size and frequency but burst in a coordinated fashion with
nominal p-value less than 0.05.
# Generate a pdf output for a selected gene
i = which(genename == 'Hvcn1')
allelic_plot(alleleA = alleleA, alleleB = alleleB,
gene.class.obj = gene.class.obj,
allelic.kinetics.obj = allelic.kinetics.obj,
diff.allelic.obj = diff.allelic.obj,
non.ind.obj = non.ind.obj, i = i)
Gene Hvcn1
200 100
Adjusted reads
−200 −100 0
A allele B allele
Cell
Gene category: Biallelic.bursty
Fig. 4 SCALE output for a bursty gene Hvcn1. Bar plot shows the adjusted allelic
coverage across cells. Number of cells expressing A allele, B allele, both alleles,
and neither allele is reported, together with the inferred allelic bursting kinetics.
P-values from testing of shared bursting kinetics and nonindependent bursting
are also returned
frequency but not burst size. Our findings provide evidence that
allelic differences in the expression of bursty genes are achieved
through differential modulation of burst frequency than burst size.
Previous studies have shown that kinetic parameter that varies the
most—along cell cycle [33], between different genes [35], between
different growth conditions [36], or under regulation by a tran-
scription factor [37]—is the probabilistic rate of switching to the
active stat kon, while the rates of gene inactivation koff and of
transcription s vary much less.
4 Notes
After this, one can try again using the in silico estimated cell
size factors. The correlation between the bursting kinetics of
the two alleles can serve as a sanity check for good data quality
and accurate cell size estimation—On the genome-wide scale,
they should be correlated with a decent correlation coefficient.
Acknowledgment
References
1. Buckland PR (2004) Allele-specific gene transcriptomics. Int J Biochem Cell Biol
expression differences in humans. Hum Mol 90:155–160. https://doi.org/10.1016/j.bio
Genet 13(2):R255–R260. https://doi.org/ cel.2017.05.029
10.1093/hmg/ddh227 9. Kim JK, Marioni JC (2013) Inferring the kinet-
2. Deng Q, Ramskold D, Reinius B, Sandberg R ics of stochastic gene expression from single-
(2014) Single-cell RNA-seq reveals dynamic, cell RNA-sequencing data. Genome Biol 14:
random monoallelic gene expression in mam- R7. https://doi.org/10.1186/gb-2013-14-
malian cells. Science 343:193–196. https:// 1-r7
doi.org/10.1126/science.1245316 10. Levesque MJ, Ginart P, Wei Y, Raj A (2013)
3. Reinius B, Sandberg R (2015) Random mono- Visualizing SNVs to quantify allele-specific
allelic expression of autosomal genes: stochas- expression in single cells. Nat Methods
tic transcription and allele-level regulation. Nat 10:865–867. https://doi.org/10.1038/
Rev Genet 16:653–664. https://doi.org/10. nmeth.2589
1038/nrg3888 11. Goetz JJ, Trimarchi JM (2012) Transcriptome
4. Reinius B, Mold JE, Ramskold D, Deng Q, sequencing of single cells with smart-Seq. Nat
Johnsson P, Michaelsson J, Frisen J, Sandberg Biotechnol 30(8):763–765. https://doi.org/
R (2016) Analysis of allelic expression patterns 10.1038/nbt.2325
in clonal somatic cells by single-cell RNA-seq. 12. Picelli S, Faridani OR, Bjorklund AK,
Nat Genet 48:1430–1435. https://doi.org/ Winberg G, Sagasser S, Sandberg R (2014)
10.1038/ng.3678 Full-length RNA-seq from single cells using
5. Skelly DA, Johansson M, Madeoy J, smart-seq2. Nat Protoc 9(1):171–181.
Wakefield J, Akey JM (2011) A powerful and https://doi.org/10.1038/nprot.2014.006
flexible statistical framework for testing 13. Macosko EZ, Basu A, Satija R, Nemesh J,
hypotheses of allele-specific gene expression Shekhar K, Goldman M, Tirosh I, Bialas AR,
from RNA-seq data. Genome Res Kamitaki N, Martersteck EM, Trombetta JJ,
21:1728–1737. https://doi.org/10.1101/gr. Weitz DA, Sanes JR, Shalek AK, Regev A,
119784.110 McCarroll SA (2015) Highly parallel genome-
6. Leon-Novelo LG, McIntyre LM, Fear JM, wide expression profiling of individual cells
Graze RM (2014) A flexible Bayesian method using Nanoliter droplets. Cell 161
for detecting allelic imbalance in RNA-seq (5):1202–1214. https://doi.org/10.1016/j.
data. BMC Genomics 15:920. https://doi. cell.2015.05.002
org/10.1186/1471-2164-15-920 14. Zheng GX, Terry JM, Belgrader P, Ryvkin P,
7. Jiang Y, Zhang NR, Li M (2017) SCALE: Bent ZW, Wilson R, Ziraldo SB, Wheeler TD,
modeling allele-specific gene expression by McDermott GP, Zhu J, Gregory MT, Shuga J,
single-cell RNA sequencing. Genome Biol 18 Montesclaros L, Underwood JG, Masquelier
(1):74. https://doi.org/10.1186/s13059- DA, Nishimura SY, Schnall-Levin M, Wyatt
017-1200-8 PW, Hindson CM, Bharadwaj R, Wong A,
8. Benitez JA, Cheng S, Deng Q (2017) Reveal- Ness KD, Beppu LW, Deeg HJ, McFarland C,
ing allele-specific gene expression by single-cell Loeb KR, Valente WJ, Ericson NG, Stevens
Single-Cell Allele-Specific Gene Expression Analysis 173
EA, Radich JP, Mikkelsen TS, Hindson BJ, Reik W, Barahona M, Green AR, Hemberg M
Bielas JH (2017) Massively parallel digital tran- (2017) SC3: consensus clustering of single-cell
scriptional profiling of single cells. Nat Com- RNA-seq data. Nat Methods 14(5):483–486.
mun 8:14049. https://doi.org/10.1038/ https://doi.org/10.1038/nmeth.4236
ncomms14049 25. Wang B, Zhu J, Pierson E, Ramazzotti D, Bat-
15. Jia C, Hu Y, Kelly D, Kim J, Li M, Zhang NR zoglou S (2017) Visualization and analysis of
(2017) Accounting for technical noise in dif- single-cell RNA-seq data by kernel-based simi-
ferential expression analysis of single-cell RNA larity learning. Nat Methods 14(4):414–416.
sequencing data. Nucleic Acids Res 45 https://doi.org/10.1038/nmeth.4207
(19):10978–10988. https://doi.org/10. 26. Tsoucas D, Yuan GC (2018) GiniClust2: a
1093/nar/gkx754 cluster-aware, weighted ensemble clustering
16. Huang M, Wang J, Torre E, Dueck H, method for cell-type detection. Genome Biol
Shaffer S, Bonasio R, Murray JI, Raj A, Li M, 19(1):58. https://doi.org/10.1186/s13059-
Zhang NR (2018) SAVER: gene expression 018-1431-3
recovery for single-cell RNA sequencing. Nat 27. Chong S, Chen C, Ge H, Xie XS (2014) Mech-
Methods 15(7):539–542. https://doi.org/10. anism of transcriptional bursting in bacteria.
1038/s41592-018-0033-z Cell 158:314–326. https://doi.org/10.
17. Li WV, Li JJ (2018) An accurate and robust 1016/j.cell.2014.05.038
imputation method scImpute for single-cell 28. Blake WJ, Balazsi G, Kohanski MA, Isaacs FJ,
RNA-seq data. Nat Commun 9(1):997. Murphy KF, Kuang Y, Cantor CR, Walt DR,
https://doi.org/10.1038/s41467-018- Collins JJ (2006) Phenotypic consequences of
03405-7 promoter-mediated transcriptional noise. Mol
18. Li H, Durbin R (2009) Fast and accurate short Cell 24:853–865. https://doi.org/10.1016/j.
read alignment with burrows-wheeler trans- molcel.2006.11.003
form. Bioinformatics 25(14):1754–1760. 29. Fukaya T, Lim B, Levine M (2016) Enhancer
https://doi.org/10.1093/bioinformatics/ control of transcriptional bursting. Cell
btp324 166:358–368. https://doi.org/10.1016/j.
19. Dobin A, Davis CA, Schlesinger F, Drenkow J, cell.2016.05.025
Zaleski C, Jha S, Batut P, Chaisson M, Gingeras 30. Suter DM, Molina N, Gatfield D, Schneider K,
TR (2013) STAR: ultrafast universal RNA-seq Schibler U, Naef F (2011) Mammalian genes
aligner. Bioinformatics 29(1):15–21. https:// are transcribed with widely different bursting
doi.org/10.1093/bioinformatics/bts635 kinetics. Science 332:472–474. https://doi.
20. Li H, Handsaker B, Wysoker A, Fennell T, org/10.1126/science.1198817
Ruan J, Homer N, Marth G, Abecasis G, 31. Kepler TB, Elston TC (2001) Stochasticity in
Durbin R, Genome Project Data Processing transcriptional regulation: origins, conse-
Subgroup (2009) The sequence alignment/ quences, and mathematical representations.
map format and SAMtools. Bioinformatics 25 Biophys J 81:3116–3136. https://doi.org/
(16):2078–2079. https://doi.org/10.1093/ 10.1016/s0006-3495(01)75949-8
bioinformatics/btp352 32. Stegle O, Teichmann SA, Marioni JC (2015)
21. DePristo MA, Banks E, Poplin R, Garimella Computational and analytical challenges in
KV, Maguire JR, Hartl C, Philippakis AA, del single-cell transcriptomics. Nat Rev Genet
Angel G, Rivas MA, Hanna M, McKenna A, 16:133–145. https://doi.org/10.1038/
Fennell TJ, Kernytsky AM, Sivachenko AY, nrg3833
Cibulskis K, Gabriel SB, Altshuler D, Daly MJ 33. Padovan-Merhar O, Nair GP, Biaesch AG,
(2011) A framework for variation discovery Mayer A, Scarfone S, Foley SW, Wu AR,
and genotyping using next-generation DNA Churchman LS, Singh A, Raj A (2015) Single
sequencing data. Nat Genet 43(5):491–498. mammalian cells compensate for differences in
https://doi.org/10.1038/ng.806 cellular volume and DNA copy number
22. van der Maaten L, Hinton G (2008) Visualiz- through independent global transcriptional
ing data using t-SNE. J Mach Learn Res mechanisms. Mol Cell 58:339–352. https://
9:2579–2605 doi.org/10.1016/j.molcel.2015.03.005
23. Pierson E, Yau C (2015) ZIFA: dimensionality 34. Vallejos CA, Marioni JC, Richardson S (2015)
reduction for zero-inflated single-cell gene BASiCS: Bayesian analysis of single-cell
expression analysis. Genome Biol 16:241. sequencing data. PLoS Comput Biol 11:
https://doi.org/10.1186/s13059-015-0805-z e1004333. https://doi.org/10.1371/journal.
24. Kiselev VY, Kirschner K, Schaub MT, pcbi.1004333
Andrews T, Yiu A, Chandra T, Natarajan KN,
174 Meichen Dong and Yuchao Jiang
35. Skinner SO, Xu H, Nagarkar-Jaiswal S, Freire embryonic stem cells. Sci Rep 4:7125.
PR, Zwaka TP, Golding I (2016) Single-cell https://doi.org/10.1038/srep07125
analysis of transcription kinetics across the cell 37. Xu H, Sepulveda LA, Figard L, Sokac AM,
cycle. Elife 5:e12175. https://doi.org/10. Golding I (2015) Combining protein and
7554/eLife.12175 mRNA quantification to decipher transcrip-
36. Ochiai H, Sugawara T, Sakuma T, Yamamoto T tional regulation. Nat Methods 12:739–742.
(2014) Stochastic promoter activation affects https://doi.org/10.1038/nmeth.3446
Nanog expression variability in mouse
Chapter 12
Abstract
Single-cell RNA-seq (scRNA-seq) provides a comprehensive measurement of stochasticity in transcription,
but the limitations of the technology have prevented its application to dissect variability in RNA processing
events such as splicing. In this chapter, we review the challenges in splicing isoform quantification in
scRNA-seq data and discuss BRIE (Bayesian regression for isoform estimation), a recently proposed
Bayesian hierarchical model which resolves these problems by learning an informative prior distribution
from sequence features. We illustrate the usage of BRIE with a case study on 130 mouse cells during
gastrulation.
Key words Alternative splicing, Isoform quantification, Single-cell RNA-seq, Bayesian model
1 Introduction
Guo-Cheng Yuan (ed.), Computational Methods for Single-Cell Data Analysis, Methods in Molecular Biology, vol. 1935,
https://doi.org/10.1007/978-1-4939-9057-3_12, © Springer Science+Business Media, LLC, part of Springer Nature 2019
175
176 Yuanhua Huang and Guido Sanguinetti
2 Materials
3 Methods
3.1 High-Level Model Low coverage is a common problem for scRNA-seq data, and
Description particularly brings a big statistical challenge to splicing quantifica-
tion, as short sequenced RNA-seq reads can be aligned to multiple
isoforms, namely not immediately informative on splicing status,
and therefore it normally requires a high coverage for accurate
estimate. Recent work has shown that improved predictions at
lower coverage can be achieved with Bayesian methods by incor-
porating informative prior distributions within the probabilistic
splicing quantification algorithms, leveraging either aspects of the
experimental design, such as time series [21], or auxiliary data sets
such as measurements of PolII localization [22]. In addition, it has
also been demonstrated that splicing (in bulk cells) can be accu-
rately predicted from sequence-derived features [23]. This suggests
that overall patterns of read distribution may be associated with
specific sequence words, so that one may be able to construct
sequence-based informative prior distributions that may be learned
directly from data. This is the idea at the core of BRIE (Bayesian
Regression for Isoform Estimation), a statistical model that
achieves extremely high sensitivity at low coverage by the use of
informative priors learned directly from data via a (latent) regres-
sion model. The regression model couples the task of splicing
quantification across different genes, allowing a statistical transfer
of information from well-covered genes to lower covered genes,
achieving considerable robustness to noise in low coverage.
Figure 1 presents a schematic illustration of BRIE (see Meth-
ods in the original paper for precise definitions and details of the
estimation procedure). The bottom part of the figure represents the
standard mixture model approach to isoform estimation intro-
duced in MISO [24] and Cufflinks [25] and also used in many
recent methods, e.g., Kallisto [26], where reads are associated with
a latent, multinomially distributed isoform identity variable. This
module takes as input the scRNA-seq data (aligned reads) and
forms the likelihood of our Bayesian model. The multinomial
identity variables are then assigned an informative prior in the
form of a regression model (top half of Fig. 1), where the prior
probability of inclusion ratios is regressed against sequence-derived
features. Crucially, the regression parameters are shared across all
genes and can be learned across multiple single cells, thus regular-
izing the task and enabling robust predictions in the face of very
low coverage. While the class of regression models we employ is
different from the neural networks of [23], they still provide a
highly accurate supervised learning predictor of splicing on bulk
RNA-seq data sets.
This architecture effectively enables BRIE to simultaneously
trade-off two tasks: in the absence of data (drop-out genes), the
Single-Cell RNA Splicing 179
C1 A C2
Bayesian
regression
Prior
P( |W, X)
Posterior
P( |R, W, X)
Likelihood
P(R| )
Mixture
modeling
RNA-seq
reads
Fig. 1 A cartoon of the BRIE method for isoform estimation. BRIE combines a likelihood computed from
RNA-seq data (bottom part) and an informative prior distribution learned from 735 sequence-derived features
(top)
3.2 Data Preparation In order to quantify the exon-skipping splicing with BRIE, we need
with BRIE-Kit a set of good quality annotated splicing events, and also their
according sequence features. Besides using our preprocessed anno-
tations for human and mouse, BRIE-kit, a separate Python package
developed under the Python2 environment, can prepare the anno-
tation for more species, through three functions (1) briekit-event
for extracting the exon-skipping events from full gene annotation,
(2) briekit-event-filter for filtering out poor quality splicing events,
and (3) briekit-factor for defining and extracting the sequence
features. In order to perform these preprocessing steps on the
mouse gastrulation data (stored in the DATA_DIR directory), the
following command lines would be used
1. Generate exon-skipping events from gene annotation.
$ briekit-event -a gencode.vM17.annotation.gtf.gz -o
$DATA_DIR/AS_events
$ briekit-factor -a AS_events/SE.filtered.gff3.gz -r
GRCm38.p6.genome.fa \
-c mm10.60way.phastCons.bw -o mouse_features.csv -p 10 --
bigWigSummary ./bigWigSummary
3.3 Splicing Once we obtained a set of exon-skipping events and their according
Quantification sequence features, we can use BRIE to quantify their inclusion
Using BRIE probabilities from scRNA-seq data. First, we need to download
the raw scRNA-seq reads in fastq format, and align the reads to
the genome, by HISAT [27] or STAR [28] or other splice-aware
aligners. Then, for each cell, there will be a sorted and indexed
alignment file in bam/sam format, e.g., cell_n.sorted.bam.
Now, we can run BRIE to quantify the annotated exon-
skipping events on the aligned reads files by following command
line,
between any pair of cells (or cell clusters), and the command line to
run it as follows,
Then brie-diff will output two tsv files. The first file, in the
format of xxx.diff.tsv, contains all pairs of cells (or cell groups) that
passed the Bayes factor threshold. The other one, in the format of
xxx.diff.rank.tsv, ranks the splicing events by the number of differ-
entially spliced cell pairs, which can be used to select splicing marker
for cell type identity. When using brie-diff to detect differential
splicing events between two groups of cells, we provide some
discussion in Note 1.
3.5 Plotting Results Once a set of highly variable splicing events has been detected, it is
and Extracting very useful to visualize the raw reads and the quantification results.
Statistics from Sashimi plots [29], which were originally developed by Yarden Katz
the BRIE Output and colleagues, are a visually effective way to display reads densities
and junction reads. In BRIE, we adapted the sashimi plot to visua-
lize the results, including the reads density and the prior and
posterior distribution of splicing fraction. Sashimi_plot is included
in BRIE-kit GitHub repository (https://github.com/huangyh09/
briekit/tree/master/sashimi_plot), as a self-contained folder,
which can be executed as follows,
SASHIMI=~/MyGit/briekit/sashimi_plot/sashimi_plot.py
python $SASHIMI --plot-event ENSMUSG00000027478.AS2 $GFF_DIR
sashimi_setting.txt \
--output-dir $PLOT_DIR --plot-label DNMT3B-exon2.pdf --plot-
title DNMT3B-exon2
4 Notes
Fig. 2 Visualization of splicing quantification with sashimi plot and histogram. An example exon-skipping event
in DNMT3B in 3 mouse cells at 6.5 s days and 3 cells at 7.75 s days. The left panel is sashimi plot of the reads
density and the number of junction reads. The right panel is the prior distribution in blue curve and a histogram
of the posterior distribution in black, both learned by BRIE. For the histogram, the red line is the mean and the
dash lines are the 95% confidence interval
References
1. Grün D, van Oudenaarden A (2015) Design 16. Faigenbloom L, Rubinstein ND, Kloog Y et al
and analysis of single-cell sequencing experi- (2015) Regulation of alternative splicing at the
ments. Cell 163:799–810 single-cell level. Mol Syst Biol 11:845
2. Grün D, Lyubimova A, Kester L et al (2015) 17. Song Y, Botvinnik OB, Lovci MT et al (2017)
Single-cell messenger RNA sequencing reveals Single-cell alternative splicing analysis with
rare intestinal cell types. Nature 525:251–255 expedition reveals splicing dynamics during
3. Gaublomme JT, Yosef N, Lee Y et al (2015) neuron differentiation. Mol Cell 67:148–161
Single-cell genomics unveils critical regulators 18. La Manno G, Soldatov R, Hochgerner H et al
of Th17 cell pathogenicity. Cell (2018) RNA velocity of single cells. Nature
163:1400–1412 560.7719:494
4. Papalexi E, Satija R (2018) Single-cell RNA 19. Linker SM, Urban L, Clark S et al (2018)
sequencing to explore immune cell heteroge- Combined single cell profiling of expression
neity. Nat Rev Immunol 18:35 and DNA methylation reveals splicing regula-
5. Scialdone A, Tanaka Y, Jawaid W et al (2016) tion and heterogeneity. bioRxiv:328138
Resolving early mesoderm diversification 20. Huang Y, Sanguinetti G (2017) BRIE:
through single-cell expression profiling. transcriptome-wide splicing quantification in
Nature 535:289–293. https://doi.org/10. single cells. Genome Biol 18:123. https://doi.
1038/nature18633 org/10.1101/098517
6. Wagner DE, Weinreb C, Collins ZM et al 21. Huang Y, Sanguinetti G (2016) Statistical
(2018) Single-cell mapping of gene expression modeling of isoform splicing dynamics from
landscapes and lineage in the zebrafish embryo. RNA-seq time series data. Bioinformatics
Science 80:eaar4362 32:2965–2972
7. Stubbington MJT, Lönnberg T, Proserpio V 22. Liu P, Sanalkumar R, Bresnick EH et al (2016)
et al (2016) T cell fate and clonality inference Integrative analysis with ChIP-seq advances the
from single-cell transcriptomes. Nat Methods limits of transcript quantification from
13:329 RNA-seq. Genome Res 26:1124–1133
8. Lönnberg T, Svensson V, James KR et al 23. Xiong HY, Alipanahi B, Lee LJ et al (2015)
(2017) Single-cell RNA-seq and computa- The human splicing code reveals new insights
tional analysis using temporal mixture model- into the genetic determinants of disease. Sci-
ling resolves Th1/Tfh fate bifurcation in ence 1254806:347
malaria. Sci Immunol 2(9):eaal2192 24. Katz Y, Wang ET, Airoldi EM, Burge CB
9. Patel AP, Tirosh I, Trombetta JJ et al (2014) (2010) Analysis and design of RNA sequencing
Single-cell RNA-seq highlights intratumoral experiments for identifying isoform regulation.
heterogeneity in primary glioblastoma. Science Nat Methods 7:1009–1015
344:1396–1401 25. Trapnell C, Williams BA, Pertea G et al (2010)
10. Tirosh I, Izar B, Prakadan SM et al (2016) Transcript assembly and quantification by
Dissecting the multicellular exosystem of met- RNA-Seq reveals unannotated transcripts and
astatic melanoma by single-cell RNA-seq. Sci- isoform switching during cell differentiation.
ence 352:189–196. https://doi.org/10. Nat Biotechnol 28:511–515
1126/science.aad0501.Dissecting 26. Bray NL, Pimentel H, Melsted P, Pachter L
11. Wang ET, Sandberg R, Luo S et al (2008) (2016) Near-optimal probabilistic RNA-seq
Alternative isoform regulation in human tissue quantification. Nat Biotechnol 34:525
transcriptomes. Nature 456:470–476 27. Kim D, Langmead B, Salzberg SL (2015)
12. Baralle FE, Giudice J (2017) Alternative splic- HISAT: a fast spliced aligner with low memory
ing as a regulator of development and tissue requirements. Nat Methods 12:357
identity. Nat Rev Mol Cell Biol 18:437 28. Dobin A, Davis CA, Schlesinger F et al (2013)
13. Dillman AA, Hauser DN, Gibbs JR et al (2013) STAR: ultrafast universal RNA-seq aligner.
mRNA expression, splicing and editing in the Bioinformatics 29:15–21
embryonic and adult mouse cerebral cortex. 29. Katz Y, Wang ET, Silterra J et al (2015) Quan-
Nat Neurosci 16:499 titative visualization of alternative exon expres-
14. Scotti MM, Swanson MS (2016) RNA sion from RNA-seq data. Bioinformatics
mis-splicing in disease. Nat Rev Genet 17:19 31:2400–2402
15. Ziegenhain C, Vieth B, Parekh S et al (2017)
Comparative analysis of single-cell RNA
sequencing methods. Mol Cell 65:631–643
Chapter 13
Abstract
Recent technological developments have enabled the characterization of the epigenetic landscape of single
cells across a range of tissues in normal and diseased states and under various biological and chemical
perturbations. While analysis of these profiles resembles methods from single-cell transcriptomic studies,
unique challenges are associated with bioinformatics processing of single-cell epigenetic data, including a
much larger (10–1,000) feature set and significantly greater sparsity, requiring customized solutions.
Here, we discuss the essentials of the computational methodology required for analyzing common single-
cell epigenomic measurements for DNA methylation using bisulfite sequencing and open chromatin using
ATAC-Seq.
1 Introduction
Guo-Cheng Yuan (ed.), Computational Methods for Single-Cell Data Analysis, Methods in Molecular Biology, vol. 1935,
https://doi.org/10.1007/978-1-4939-9057-3_13, © Springer Science+Business Media, LLC, part of Springer Nature 2019
187
188 Caleb Lareau et al.
CpGs
Fig. 1 Overview of single-cell DNA methylation data. For a heterogeneous group of cells (middle), variability in
CpG methylation can occur both between classically defined cell types (left) and within cell types (right)
Bulk
ATAC-seq
Sum of
scATAC-seq
chr19:36.10-36.19 Mb
Genes
−
− − − −− −
− − −
− −
− − −−
−
−
− − − − −
− − −
−
− −−
−
−
−
− − − − −− −
Hundreds of
− − − − − − − −
− −
− − − −
single cells
−
− − −−− −
− − − − −
− − − − −
− − −−
−
− − − − −
− − − −
− −
− − − −− −
− − −
−
−
− − − −
− − −
− −
−
− − −−
− − − − −
−
−
− − −
− − Fragments
−
−
−
−
− − − − −
−
−
− − −
−
−
− − − −−− − − −− − −
− − −
− −
−
−
−
− −
−−
−−−
− −
− − −
− − − 2
− − − −−
− − −− − −
− − −− 1
−
− − − −
− − − −−
− −
− − − −
−
− − −−− −
− − −− −
− −
−
− − − −
− −
− 0
Chromatin Accessibility Peaks
Fig. 2 Overview of single-cell ATAC data. The sum of single cells’ chromatin accessibility profiles resemble
that of a bulk experiment (top) though each cell has a varied open chromatin epigenome. For a diploid
organism, the number of accessible chromatin counts does not generally exceed 2
Fig. 4 Overview of scATAC-seq data processing. Steps associated with data processing are shown on the left
while a representative data allocation for a given cell is shown on the right. Roughly only about 10% of the raw
data for a given scATAC-seq cell is used in downstream analyses, such as cell clustering
Sum of
single cells
Peak 1 Peak 2 Peak 3
TF 1 O X O
TF 2 X O X
TF 3 O O X X = predicted
TF 4 X O X binding
.
.
.
...
...
...
..
..
..
X’ * M = Z
Fig. 6 Biologically-motivated scATAC-seq dimensionality reduction. A binary matrix of motifs by peaks is
multiplied by an integer-value matrix of peaks by cells to yield reduced set of features (real valued motif
deviation scores) per cell. These can be utilized for downstream analyses, including visualization with t-SNE,
and cell type identification
2 Materials
7. R (https://cran.r-project.org).
8. Bioconductor (https://www.bioconductor.org) [4].
9. bedtools (http://bedtools.readthedocs.io/en/latest/index.
html) [5].
10. bedgraphToBigwig (http://hgdownload.soe.ucsc.edu/
admin/exe/).
3 Methods
The protocols described below share the first step (Subheading 3.1)
and are then described separately for single-cell DNA ATAC-Seq
(Subheading 3.2) and single-cell DNA methylation (Subheading
3.3). Graphical overviews of these steps for each method are shown
in Figs. 4 and 5.
3.2 scATAC-Seq 1. Trim adapter sequences from fastq files using tools such as
TrimGalore (see Note 2).
3.3 scMethylation 1. Trim sequencing adapter sequences from fastq files. It is also
advisable to additionally trim of poor quality bases at the ends
of reads that can lead to alignment errors and/or incorrect
methylation calls. For example, to perform both quality and
adapter trimming in a single step one can use TrimGalore (see
Note 12 for RRBS libraries):
trim_galore --paired read1.fastq.gz read2.fastq.gz
4 Notes
library(chromVAR)
library(motifmatchr)
library(BSgenome.Hsapiens.UCSC.hg19) # change based on
reference genome
library(SummarizedExperiment)
library(chromVARmotifs)
shiny = FALSE)
tsne_plots <- plotDeviationsTsne(dev, tsne_results, anno-
tation = "CTCF",
sample_column = "source",
shiny = FALSE)
bismark_genome_preparation /path/to/genomes/grch38/
library(Biostrings)
library(bsseq)
library(BSgenome.Hsapiens.NCBI.GRCh38)
library(readr)
library(HDF5Array)
Acknowledgments
References
1. Schep AN, Wu B, Buenrostro JD, Greenleaf RM, Brown M, Li W, Liu XS (2008) Model-
WJ (2017) chromVAR: inferring transcription- based analysis of ChIP-Seq (MACS). Genome
factor-associated accessibility from single-cell Biol 9(9):R137. https://doi.org/10.1186/
epigenomic data. Nat Methods 14 gb-2008-9-9-r137
(10):975–978. https://doi.org/10.1038/ 7. Buenrostro JD, Wu B, Litzenburger UM,
nmeth.4401 Ruff D, Gonzales ML, Snyder MP, Chang
2. Krueger F, Andrews SR (2011) Bismark: a flex- HY, Greenleaf WJ (2015) Single-cell chroma-
ible aligner and methylation caller for Bisulfite- tin accessibility reveals principles of regulatory
Seq applications. Bioinformatics 27 variation. Nature 523(7561):486–490.
(11):1571–1572. https://doi.org/10.1093/ https://doi.org/10.1038/nature14590
bioinformatics/btr167 8. Cusanovich DA, Daza R, Adey A, Pliner HA,
3. Langmead B, Salzberg SL (2012) Fast gapped- Christiansen L, Gunderson KL, Steemers FJ,
read alignment with Bowtie 2. Nat Methods 9 Trapnell C, Shendure J (2015) Multiplex single
(4):357–359. https://doi.org/10.1038/ cell profiling of chromatin accessibility by com-
nmeth.1923 binatorial cellular indexing. Science 348
4. Huber W, Carey VJ, Gentleman R, Anders S, (6237):910–914. https://doi.org/10.1126/
Carlson M, Carvalho BS, Bravo HC, Davis S, science.aab1601
Gatto L, Girke T, Gottardo R, Hahne F, Han- 9. Buenrostro JD, Giresi PG, Zaba LC, Chang
sen KD, Irizarry RA, Lawrence M, Love MI, HY, Greenleaf WJ (2013) Transposition of
MacDonald J, Obenchain V, Oles AK, native chromatin for fast and sensitive epige-
Pages H, Reyes A, Shannon P, Smyth GK, nomic profiling of open chromatin,
Tenenbaum D, Waldron L, Morgan M DNA-binding proteins and nucleosome posi-
(2015) Orchestrating high-throughput geno- tion. Nat Methods 10(12):1213–1218.
mic analysis with bioconductor. Nat Methods https://doi.org/10.1038/nmeth.2688
12(2):115–121. https://doi.org/10.1038/ 10. Lareau CA, Ulirsch JC, Bao EL, Ludwig LS,
nmeth.3252 Guo MH, Benner C, Satpathy AT, Salem R,
5. Quinlan AR, Hall IM (2010) BEDTools: a Hirschhorn JN, Finucane HK, Aryee MJ,
flexible suite of utilities for comparing genomic Buenrostro JD, Sankaran VG (2018) Interro-
features. Bioinformatics 26(6):841–842. gation of human hematopoiesis at single-cell
https://doi.org/10.1093/bioinformatics/ and single-variant resolution. bioRxiv.
btq033 https://doi.org/10.1101/255224
6. Zhang Y, Liu T, Meyer CA, Eeckhoute J, John-
son DS, Bernstein BE, Nusbaum C, Myers
Chapter 14
Abstract
Transcriptional enhancers drive cell-type-specific gene expression patterns, and thus play key roles in
development and disease. Large-scale consortia have extensively cataloged >one million putative enhancers
encoded in the human genome. But few enhancers have been endogenously tested for function. For almost
all enhancers, it remains unknown what genes they target and how much they contribute to target gene
expression. We have previously developed a method called Mosaic-seq, which enables the high-throughput
interrogation of enhancer activity by performing pooled CRISPRi-based epigenetic suppression of enhan-
cers with a single-cell transcriptomic readout. Here, we describe an optimized version of this method,
Mosaic-seq2. We have made several key improvements that have significantly simplified the library prepara-
tion process and increased the overall sensitivity and throughput of the method.
1 Introduction
Guo-Cheng Yuan (ed.), Computational Methods for Single-Cell Data Analysis, Methods in Molecular Biology, vol. 1935,
https://doi.org/10.1007/978-1-4939-9057-3_14, © Springer Science+Business Media, LLC, part of Springer Nature 2019
203
204 Shiqi Xie and Gary C. Hon
2 Materials
2.1 Cell Culture 1. K562 cells and HEK293T cells (both from ATCC).
2. Phosphate-buffered saline (PBS), pH 7.4.
3. 0.25% Trypsin-EDTA.
4. Complete cell culture medium: Iscove’s Modified Dulbecco’s
Medium (IMDM, for K562) or Dulbecco’s Modified Eagle’s
Medium (DMEM, for HEK293T) with 10% FBS, 100 U/mL
Penicillin-Streptomycin (Thermo Fisher Scientific).
5. 10 cm cell culture dishes.
6. Hemocytometer.
Basic Procedures for Mosaic-Seq 205
Infected by sgRNA
library targeting
enhancers
K562
dCas9-KRAB
Cells
Oil
Beads
T
T
sgRNA
Library Prep
Fig. 1 Overview of Mosaic-seq2. Preparation of single-cell RNA-seq libraries generally follows the 10 library
preparation procedures except that an sgRNA enrichment library is amplified from full length cDNA
Table 1
List of oligos (see Note 1)
Name Sequence
Oligo Amp Fwd TAACTTGAAAGTATTTCGATTTCTTGGCTTTATATATCTTG
TGGAAAGGACGAAACACCG
sgRNA_amp TTGGCCTAGCTCTAAAAC
Table 1
(continued)
Name Sequence
10-sgRNA i7-N723 CAAGCAGAAGACGGCATACGAGATGAGCGCTAGTGACTGGAGTTCAGACG
TGTGCTCTTCCGATCTTGGAAAGGACGAAACACC
5. Puromycin dihydrochloride.
6. Blasticidin hydrochloride.
7. Centrifuge with swinging bucket rotor and heating function.
8. BsmBI (10 U/μL).
9. Gibson assembly master mix (NEB).
10. NEBNext High-Fidelity 2 PCR Master Mix (NEB).
11. Endura ElectroCompetent cells (Lucigen).
12. MicroPulser Electroporator (Bio-Rad).
13. Gene Pulser Electroporation Cuvette: 0.1 cm gap (Bio-Rad).
14. Standard lysogeny broth (LB) medium with 100 μg/mL
Ampicillin.
15. MinElute gel extraction kit (Qiagen).
16. Buffer EB (Qiagen).
17. ZymoPURE plasmid maxiprep kit (Zymo Research).
2.3 Virus Packaging, 1. OPTI-MEM reduced serum media (Thermo Fisher Scientific).
Titration, and Infection 2. Linear polyethylenimine (PEI, Polysciences).
3. Hexadimethrine bromide (Polybrene, Sigma).
4. 24-well cell culture plates.
5. 0.45 μm filter and syringe.
6. Lenti-X lentivirus concentrator (Clontech).
7. CellTiter-Glo Luminescent Cell Viability Assay Kit (Promega).
8. 96-well plates with white wall and clear bottom.
9. Luminescence plate reader.
208 Shiqi Xie and Gary C. Hon
3 Methods
3.1 Construction To maximize the suppression efficiency and minimize the off-target
of sgRNA Library of CRISPR system, we use the following steps to design our sgRNA
library.
3.1.1 Design
and Synthesis of sgRNA 1. Select enhancers of interest based on epigenetic features (such
Oligos as histone marks, eRNA transcription, P300 binding etc.)
(Fig. 3). Overlap the enhancers of interests with the DNase-
seq signals. The CRISPRi-target region is 200-bp from the
summit of the DNase-seq peak.
Basic Procedures for Mosaic-Seq 209
... ...
Lib 1 Lib N Lib 1 Lib N
Combined
s in
Read Count
each cell by saturation curve
Matrix
Data
sgRNA ... sgRNA
Matrix 1 Matrix N
Virtual FACS
analysis
Annotation Combined
sgRNA
Matrix
changed
Fig. 2 Overview of the analysis pipeline. Fastq files from 10 libraries and sgRNA enrichment libraries are
processed separately, and then applied for differential gene expression analysis
10K
chr11 HBB HBD HBBP1 HBG1 HBG2 HBE1 HS2
180
DNase-seq
0
100
H3K4me1
0
180
H3K4me3
0
180
H3K27ac
0
HBG2
HBG1
HBD
Fig. 3 An example of Mosaic-seq2 analysis. Here, we use the beta-globin LCR in K562 cells as an example to
show the expected results of the Virtual FACS analysis. The upper panel shows profiles of DNase I
hypersensitivity and histone modifications from the ENCODE Project. The bottom panel is a Manhattan Plot
of FDR-corrected p-values for all genes. Genes are ordered based on their relative position on the chromo-
somes. Two different colors indicate different chromosomes. Note that HBG2 and HBG1 are the primary target
genes of the HS2 enhancer in this region, which have the most significant p-values in this unbiased test
3.1.2 PCR Amplification 1. PCR amplify the sgRNA library oligo to make it double
and Gibson Assembly stranded (Table 1, see Note 3).
Reagents:
PCR Condition:
2. Run the PCR product on a 1% Agarose gel. Cut and purify the
PCR product (~120 bp) by using a Qiagen MinElute kit. Elute
the Oligo in 2 8 μL of EB buffer.
3. Combine all the reactions into one tube and measure DNA
concentration with a Qubit fluorometer (Broad range DNA
kit).
4. Digest 30 μg CROP-Guide-puro-bar plasmid with BsmBI for
10 h at 55 C. To get the maximized cutting efficiency, 6 reac-
tions are performed in parallel. Each reaction contains 5 μg of
plasmid and 3 μL of enzyme.
5. Gel purify the linearized plasmid. Measure the concentration
with a Qubit fluorometer (Broad range DNA kit).
6. Ligate the barcode oligo to the linearized plasmid by Gibson
Assembly (GA). Use 2 μg PCR product and 3 μg digested
plasmid in a 400 μL of GA reaction.
7. Incubate the mixture at 50 C for 1 h.
8. Perform Qiagen MinElute cleanup and elute in 15 μL H2O.
9. Measure the concentration with a Qubit fluorometer (Broad
range DNA kit). Adjust the concentration to 1 μg/μL. If the
concentration is lower than 1 μg/μL, perform SpeedVac to
decrease the total volume.
212 Shiqi Xie and Gary C. Hon
3.1.4 NGS Evaluation 1. Amplify the sgRNA fragment from the plasmid library, by using
of Library Complexity the following conditions (see Table 1 for the primer
sequences):
Reagents:
PCR Condition:
PCR Condition:
3.2 Lentivirus 1. One day before transfection, seed 8 106 293T cells to 10 cm
Packaging, Titration, dish in order to get about 80% confluence in the next day. Each
and Infection dish should have 10 mL complete medium.
3.2.1 Virus Packaging 2. One hour before transfection, change the medium to 10 mL
fresh complete medium.
3. Transfect 293T cells with lentivirus package plasmids in the
following ratio:
(a) 4.5 μg pMD2.G, 7 μg psPAX2 and 9 μg CROP-seq
sgRNA library per dish.
(b) For transfection, add all plasmids to 1 mL OPTI-MEM,
and then add 61 μL PEI (1 mg/mL). Mix up gently.
(c) Incubate at room temperature for 15 min.
(d) And add the mixture into cells drop-wisely. Mixing by
shaking the dish gently.
4. Place the plate back to an incubator.
5. Twelve hours after transfection, change the medium to 10 mL
complete medium.
6. About 72 h after transfection, collect the supernatant from the
plate.
7. Filter the supernatant by using 0.45 μm filter to remove cells
and debris.
8. For concentration and purification of the virus, combine the
medium from different dishes, add 1/3 volume of Lenti-X
Concentrator reagent. Incubate at 4 C overnight.
9. Centrifuge the sample at 1500 g for 45 min at 4 C. After
centrifuging, an off-white pellet should be visible.
10. Carefully remove all the supernatant, resuspend the pellet with
1/10 or 1/100 of the original volume.
11. Aliquote the virus into 1.5 mL tubes and freeze the virus at
80 C.
Basic Procedures for Mosaic-Seq 215
3.2.2 Virus Titration 1. Seed 2 105 K562 cells to a 96-well plate (see Note 7).
2. Thaw one virus aliquot. In a separate 96-well plate, serial dilute
20 μL virus into 180 μL complete medium with 8 μg/μL
polybrene five times with dilution factor of 10 (101–105),
each dilution has three replicates.
3. Briefly centrifuge the 96-well plate with cells, gently aspirate
the supernatant while keeping the K562 cells at the bottom.
4. Transfer the diluted virus to the plate containing cells. Also
keep several wells uninfected as the negative control.
5. Centrifuge at 1000 g, 36 C for 60 min.
6. Return the plate to the incubator.
7. The next day, change the medium in each well to the complete
medium with 1 μg/mL puromycin. Incubate for another
3–4 days.
8. Monitor the cell growth in the next 3–4 days until all the cells
in the uninfected wells are dead.
9. Dilute the CellTiter-Glo reagent with equal volume of PBS.
10. Aspire the medium in each well while keeping most of the cells
in the well. Add 50 μL of the diluted CellTiter-Glo in each well.
Incubate at room temperature for 10 min with mild shaking.
11. Transfer the mixture to a new 96-well plate with white wall and
clear bottom for plate reading. Read the luminescence intensity
of the whole plate. Instrument settings depend on the manu-
facturer. An integration time of 0.25–1 s per well is
recommended.
12. Calculate the virus titer by using the dilution that has cell
survival rate between 0% and 20%. Based on the Poisson Dis-
tribution, with the survival rate at this range, >90% of the
infected cells will be infected by only one virus. Therefore,
the survival cell rate can be used to estimate the virus titer.
3.2.3 Virus Infection 1. Plate 2 105 K562 cells into one well of a 24-well plate, which
contains 500 μL of complete medium and 8 μg/mL polybrene
(see Note 8).
2. Add desired amount of virus into each well (see Note 9).
3. Centrifuge the plate at 36 C, 1000 g for 1 h.
4. Return the plate to the incubator.
5. Change the medium to a fresh complete medium with anti-
biotics. For K562 cells, we use 20 μg/mL for blasticidin and
1 μg/mL for puromycin. Combine and transfer the cells to a
75 cm2 flask (see Note 10).
216 Shiqi Xie and Gary C. Hon
6. Keep the cells growing for the next 5–7 days. Split the cells
once it reaches confluency. The cells should grow robustly in
the presence of antibiotics.
3.3 Construction The library construction follows the standard protocol of 10
of Single-Cell RNA-Seq Genomics Single Cell 30 Library kit with several modifications. For
Libraries the following steps, we will primarily focus on the modified steps.
The standard steps are labeled as “Std.” For more details, please
check the 10 genomics website (see Note 11). https://support.
10xgenomics.com/single-cell-gene-expression/library-prep/.
1. Infect K562 cells with dCas9-KRAB virus, culture the cells for
2 weeks under the selection of blasticidin to generate a stable
cell line expressing dCas9-KRAB. The cells can be frozen for
future experiments.
2. Infect K562-dCas9-KRAB cells with the virus containing
sgRNA library and NC sgRNAs (sgNC), respectively. Culture
the cells for 5–7 days under selection of puromycin.
3. Transfer the infected K562-dCas9-KRAB cells to 50 mL coni-
cal tubes and centrifuge at 500 g for 4 min. Aspirate the
medium and wash once with PBS. Process the sgNC cells and
library infected cells separately (see Note 12).
4. Resuspend the cells in a small volume (see Note 13) of
PBS + 0.01% BSA (PBS-BSA, non-acetylated). Gently resus-
pend the cells by using a wide bore 1 mL tip.
5. Pass the cells through 40 μL strainer to get rid of debris (see
Note 14).
6. Count the cell concentration by using Trypan Blue. Dilute the
cell appropriately before counting since they may be over 1000
cells/μL. Take both live and dead cells into account. The cells
must be in single-cell suspension and the survival rate should be
over 90% (this is very important if adherent cells are used).
7. (Std) For generating a library of 10,000 cells, dilute the cells to
1000 cells/μL (see Note 15) by using PBS-BSA.
8. Mix the 5% NC cells with library infected cells (1:20). Keep the
cells on ice while processing other reagents.
9. (Std) Load the cells to the Single Cell A Chip, as well as the
partition oil and beads.
10. (Std) Run the Chromium Controller following the instruction.
Transfer the GEMs to new strip PCR tubes.
11. (Std) Perform reverse transcription reactions for GEMs.
12. (Std) Clean up the cDNA by using the Silane DynaBeads.
13. (Std) Perform the cDNA amplification PCR.
Basic Procedures for Mosaic-Seq 217
14. (Std) Perform SPRI Beads cleanup for the cDNA product.
Elute in 40.5 μL of Buffer EB.
15. (Std) Run 1 μL of sample on Agilent TapeStation D5000
ScreenTape to determine the cDNA size distribution. Also
use 1 μL of sample in the Qubit Fluorometer (High Sensitivity
DNA kit) to accurately quantify concentration.
16. Save 15 ng of cDNA for the sgRNA enrichment PCR (about
5 μL). Add 5 μL of Buffer EB to the original cDNA sample and
continue the standard 10 library prep.
17. (Std) Finish the standard 10 library preparation.
18. Use the 15 ng cDNA for the enrichment PCR of sgRNA.
Choose primers with different i7 indices for different libraries.
Reagents:
PCR Condition:
3.4 Data Analysis We use the Cell Ranger software from 10 genomics for prepro-
cessing of the single-cell sequencing data. For more information,
3.4.1 Mapping of 10
please check the 10 website (https://support.10xgenomics.com/
Libraries
single-cell-gene-expression/software/pipelines/latest/what-is-cell-
ranger).
1. Use “cellranger mkfastq” command to create FASTQ files
from the raw imaging files with the flag “--ignore-dual-index.”
2. Use “cellranger count” command to map and preprocess the
data, with the flag “--expect-cells ¼ 10,000.”
3. To combine multiple 10 libraries from different sequencing
runs, use “cellranger aggr” command with the flag “--
normalize ¼ mapped.”
4. The combined HDF5 file named “filtered_gene_bc_matri-
ces_h5.h5,” which contains all the read count matrices, gene
names, and cell barcodes, will be used for the following
analysis.
3.4.3 Virtual FACS 1. We only consider the genes expressed at least in 5% of the cells.
Analysis 2. Define M as the total number of cells in the population.
3. Define Eg as the median expression (in units of cpm) of a given
gene g.
4. Define N as the number of cells with expression less than Eg.
Note that because of the zero-inflated nature of scRNA-Seq
data, N is often not M/2.
5. Define Ks as the number of cells expressing sgRNA s (see Note
17).
6. Define Xg,s as the number of cells expressing s that have expres-
sion of g less than Eg.
7. For each gene g and sgRNA s, calculate a p-value using a
hypergeometric test with parameters M, N, Xg,s and Ks.
8. Adjust the p-values of each sgRNA/enhancer region based on
Benjamini-Hochberg procedure (FDR) (Fig. 3).
4 Notes
Acknowledgments
References
1. ENCODE Project Consortium (2012) An 8. Jaitin DA, Weiner A, Yofe I et al (2016) Dis-
integrated encyclopedia of DNA elements in secting immune circuits by linking CRISPR-
the human genome. Nature 489:57–74 pooled screens with single-cell RNA-seq. Cell
2. Kundaje A, Meuleman W, Roadmap Epige- 167:1883–1896.e15
nomics Consortium et al (2015) Integrative 9. Datlinger P, Rendeiro AF, Schmidl C et al
analysis of 111 reference human epigenomes. (2017) Pooled CRISPR screening with single-
Nature 518:317–330 cell transcriptome readout. Nat Methods
3. Long HK, Prescott SL, Wysocka J (2016) 14:297–301
Ever-changing landscapes: transcriptional 10. Hill AJ, McFaline-Figueroa JL, Starita LM et al
enhancers in development and evolution. Cell (2018) On the design of CRISPR-based single-
167:1170–1187 cell molecular screens. Nat Methods
4. Xie S, Duan J, Li B et al (2017) Multiplexed 15:271–274
engineering and analysis of combinatorial 11. Xie S, Cooley A, Armendariz D et al (2018)
enhancer activity in single cells. Mol Cell Frequent sgRNA-barcode recombination in
66:285–299.e5 single-cell perturbation assays. PLoS One 13:
5. Gilbert LA, Larson MH, Morsut L et al (2013) e0198635
CRISPR-mediated modular RNA-guided reg- 12. Xu H, Xiao T, Chen C-H et al (2015)
ulation of transcription in eukaryotes. Cell Sequence determinants of improved CRISPR
154:442–451 sgRNA design. Genome Res 25:1147–1157
6. Thakore PI, D’Ippolito AM, Song L et al 13. Smith T, Heger A, Sudbery I (2017)
(2015) Highly specific epigenome editing by UMI-tools: modeling sequencing errors in
CRISPR-Cas9 repressors for silencing of distal unique molecular identifiers to improve quan-
regulatory elements. Nat Methods tification accuracy. Genome Res 27:491–499
12:1143–1149 14. Macosko EZ, Basu A, Satija R et al (2015)
7. Dixit A, Parnas O, Li B et al (2016) Perturb- Highly parallel genome-wide expression
seq: dissecting molecular circuits with scalable profiling of individual cells using nanoliter dro-
single-cell RNA profiling of pooled genetic plets. Cell 161:1202–1214
screens. Cell 167:1853–1866.e17
Chapter 15
Abstract
In this chapter, we describe TraCeR and BraCeR, our computational tools for reconstruction of paired full-
length antigen receptor sequences and clonality inference from single-cell RNA-seq (scRNA-seq) data. In
brief, TraCeR reconstructs T-cell receptor (TCR) sequences from scRNA-seq data by extracting sequencing
reads derived from TCRs by aligning the reads from each cell against synthetic TCR sequences.
TCR-derived reads are then assembled into full-length recombined TCR sequences. BraCeR builds on
the TraCeR pipeline and accounts for somatic hypermutations (SHM) and isotype switching. Here we
discuss experimental design, use of the tools, and interpretation of the results.
Key words TCR, BCR, Immunoglobulin, Single cell, RNA-seq, scRNA-seq, Antigen receptor recon-
struction, Tracer, Bracer
1 Introduction
Guo-Cheng Yuan (ed.), Computational Methods for Single-Cell Data Analysis, Methods in Molecular Biology, vol. 1935,
https://doi.org/10.1007/978-1-4939-9057-3_15, © Springer Science+Business Media, LLC, part of Springer Nature 2019
223
224 Ida Lindeman and Michael J. T. Stubbington
2 Materials
2.1 Sequencing Data The following aspects should be taken into consideration when
choosing a library preparation protocol and sequencing platform
for the generation of data as input to [TB]raCeR.
1. Choose a library preparation protocol that generates sequenc-
ing reads from the full length of mRNA transcripts (see Note 1).
2. Paired-end (PE) reads provide the maximum reconstruction
rate and accuracy compared with single-end (SE) reads.
3. Sequence your library with a minimum read length of 50 bases
(see Note 2).
4. The read depth required to reconstruct TCRs or BCRs from a
single cell depends on the cell type and activation state (see
Note 3).
5. Make sure the reads are demultiplexed according to the cell of
origin after sequencing.
6. Perform basic quality control of the raw reads (see Note 4).
7. [TB]raCeR accepts FASTQ files (fastq or fastq.gz) as input.
BraCeR also accepts assembled BCR sequences in FASTA for-
mat for clonality inference (see Subheading 3.2.4).
2.3.2 Running [TB]raCeR Alternatively, [TB]raCeR can be run as a standalone Docker image
as a Standalone Docker on DockerHub, with all of the dependencies installed and config-
Image ured appropriately (see Note 14).
1. Pull the Docker container from DockerHub with docker pull
teichlab/[tb]racer.
2. Increase the memory limit for Docker to 6–8 GB (see Note
15).
3. Run the following command, followed by any appropriate
arguments, from your input data directory: docker run -it --
rm -v $PWD:/scratch -w /scratch teichlab/[tb]racer.
2.4 Testing [TB] 1. Run [tb]racer test with optional arguments (see Note 16).
raCeR 2. Compare the output in test_data/results/filtered_[TB]
CR_summary with the expected results in test_data/expected_-
summary (see Note 17).
3 Methods
3.1 [TB]raCeR The [TB]raCeR pipelines consist of two main steps (Fig. 1):
Pipeline
1. Reconstruction of TCR/BCR sequences from each cell (assem-
ble command).
2. Creation of clonal networks (summarise command).
3.2.2 Preparing Input [TB]raCeR takes as input FASTQ files containing sequencing reads
for [TB]raCeR generated from a single cell. Thus, data must be demultiplexed such
that reads from each cell are identified and written to separate files.
Single-Cell Antigen Receptor Sequence Reconstruction 227
Fig. 1 Overview of the [TB]raCeR pipelines. Adapted from Stubbington et al. (ref. 9)
228 Ida Lindeman and Michael J. T. Stubbington
Fig. 2 Illustration of the two alignment steps for IgH when reads are 50 bases or shorter
3.2.3 Running [TB]raCeR Run [TB]raCeR assemble with the main arguments described
in Assemble Mode below. Note that the two tools have slightly different usage (see
with Default Settings below), but many of the arguments are the same.
tracer assemble [options] <file_1> [<file_2>] <cell_name>
<output_directory>
bracer assemble [options] <cell_name> <output_directory>
[<file_1>] [<file_2>]
1. <file_1> is the FASTQ file providing #1 mates from PE
sequencing or all of the reads from SE sequencing. May be
left blank if running BraCeR with --assembled_file.
2. <file_2> is the FASTQ file providing #2 mates for PE reads.
3. <cell_name> is a name that will be used for references to
the cell.
4. <output_directory> is the directory for output, and should be
identical for cells to be summarized together.
3.2.4 Running [TB]raCeR The following optional arguments can be passed to either TraCeR
in Assemble Mode or BraCeR:
with Optional Arguments
1. -s/--species: Species from which the cells were derived. The
default is mouse (Mmus) for TraCeR and human (Hsap) for
BraCeR.
Single-Cell Antigen Receptor Sequence Reconstruction 229
Fig. 3 Identification of clonally related productive BCR sequences for each locus
3.3.2 Generation We use custom scripts to assess the clonal groups and generate
of Clonal Networks network graphs as follows.
1. Each single cell is represented by a node in the graph.
2. Reconstructed sequences are represented within nodes by hor-
izontal lines colored according to locus and productivity or by
the sequence identifier.
3. Edges between the nodes represent clonally related TCR/BCR
sequences, and are color coded according to locus. Edges
between B cells are only drawn if they share a clonally related
productive IgH and a clonally related productive Igκ or Igλ (see
Note 30).
4. Edge thickness is proportional to the number of shared
sequences for a locus.
5. Nonproductively rearranged BCR sequences are determined to
be shared within a clone group and included as edges in the
graph if they have overlapping V- and J-gene assignments. If
the cells only share a nonproductive chain for a specific locus
(Igκ or Igλ), this is shown with a dotted instead of a solid line in
the clonal network.
232 Ida Lindeman and Michael J. T. Stubbington
3.3.3 Construction BraCeR offers a complete pipeline based on both heavy and light
of Immunoglobulin Lineage chains for construction of lineage trees through Change-O, Alaka-
Trees zam [27], and PHYLIP [28], consisting of the following:
1. Build IgBLAST reference databases using IMGT-gapped
sequences (see Note 31).
2. Run IgBLAST on all sequences belonging to a clone group.
3. Parse IgBLAST output and create Change-O database.
4. Add clone number, isotype and cell name to the Change-O
database for each sequence.
5. Reconstruct the germline sequences (with masked junction) in
each clone group with Change-O CreateGermlines.
6. Concatenate productive heavy and light chain shared in each
clone group (see Note 32).
7. Run the appropriate Alakazam commands through our lineage.
R script (see Note 33).
3.3.4 Running [TB]raCeR Run the [TB]raCeR summarise command with options as
in Summarise Mode described below. <input_dir> is the directory containing subdir-
ectories of each cell you want to summarise (see Note 34).
[tb]racer summarise [options] <input_dir>
The following optional arguments can be passed to either
TraCer or BraCeR:
1. -c/--config_file: Path to the configuration file (see Note 13).
Default ¼ ~/.[tb]racerrc.
2. -u/--use_unfiltered: Set this option to run summarise with all
reconstructed recombinants without filtering cases where more
than two sequences are detected for a particular locus.
3. --resource_dir: Path to directory containing resources required
for alignment. Use if you wish to use other resources contained
somewhere other than the default resources directory.
4. -s/--species: Species of origin. Default ¼ Mmus (mouse) for
TraCeR and Hsap (human) for BraCeR (see Note 35).
5. --loci: Space-separated list of loci to summarize (see Note 36).
6. -g/--graph_format: Output format of clone networks
(see Note 37).
7. --no_networks: Do not draw clonotype network graphs (see
Note 38).
The following optional arguments are specific to TraCeR:
1. --receptor_name: Specify if other than “TCR” when using the
Build module.
2. -i/--keep_invariant: Set this option to keep invariant cells (see
Note 39).
Single-Cell Antigen Receptor Sequence Reconstruction 233
3.3.5 Output The output of the TraCeR summarise step is written to filter-
of the TraCeR ed_TCR<loci>_summary or unfiltered_TCR<loci>_summary.
Summarise Step The following output files are generated:
1. TCR_summary.txt: TCR reconstruction summary
statistics file.
2. recombinants.txt: File listing the identifier, lengths, and pro-
ductivity of each reconstructed TCR for each cell.
3. reconstructed_lengths_TCR[A|B].[pdf|txt]: Distribution plots
and underlying data displaying reconstructed VDJ region
lengths for each locus.
4. clonotype_sizes.[pdf|txt]: Distribution of clonotype sizes shown
as bar plots and underlying data.
5. clonotype_network_[with|without]_identifiers.<graph_format>:
Clonotype networks in graphical format with recombinant
identifiers or with lines representing the presence of recombi-
nants for a locus in a cell.
6. clonotype_network_[with|without]_identifiers.dot: Clonotype
networks described in the Graphviz DOT language.
3.3.6 Output The following output files and subdirectories may be generated
of the BraCeR (depending on options):
Summarise Step
1. BCR_summary.txt: BCR reconstruction summary
statistics file.
2. changeodb.tab: Database file describing all reconstructed
sequences in single cells. Recombinants in suspected multiplets
are included if run with --include_multiplets.
3. filtered_multiplets_changeodb.tab: Database file with recon-
structed recombinants from suspected multiplets unless run
with --include_multiplets.
4. IMGT_gapped.tab: Database file for all reconstructed
sequences based on IMGT-gapped reference sequences.
234 Ida Lindeman and Michael J. T. Stubbington
3.4 Quality Control A current challenge of scRNA-seq is being able to detect and filter
of Output out reads that are not in fact derived from a single cell, but rather
from unintentional cell multiplets or cross-contamination due to
PCR chimeras or free RNA from lysed cells [29]. The number of
reconstructed chains for a locus may be used to filter out multiple
captures or potential contaminations because a single B- or T-cell
should not have more than two recombined antigen receptor
chains for a given locus. It is important to filter out such cells
from the dataset as they otherwise could hinder correct clonotype
inference. Furthermore, TraCeR and BraCeR are built on the
assumption that each cell contains a maximum of two recon-
structed sequences for each BCR/TCR locus, and BCR/TCR
reconstruction from bulk samples or unintentional cell multiplets
may therefore potentially give rise to some incorrectly recon-
structed sequences. Filtering of suspected cell multiplets is done
automatically for BraCeR, and can be employed manually for
TraCeR.
3.4.2 Manual Inspection The frequency of cell multiplets may vary from dataset to dataset,
of Potential Cell Multiplets and the importance of removing potential cell multiplets versus the
risk of filtering out false potential multiplets may also vary depend-
ing on the biological question and experimental setup. Filtering of
potential multiplets could therefore be done with several degrees of
strictness, and should be determined by the user for each individual
dataset. Our general recommendations for manual inspection and
removal of potential cell multiplets are (from more permissive to
more restrictive filtering):
Single-Cell Antigen Receptor Sequence Reconstruction 235
Fig. 4 Example of a clonotype network from a mouse 14 days after infection with Salmonella. Each node
represents a T cell (a). Horizontal bars represent reconstructed TCR sequences, with dark colors being
productive and light colors nonproductive. Edges between nodes indicate sharing of one or more TCRs for
each locus with edge thickness being proportional to the number of sequences shared between two nodes.
The clone groups have different node background colors for visualization purposes. An example of a network
with identifiers showing only one of the clone groups from the same mouse is shown in (b). Figures are
adapted from Stubbington [9]
3.6 Interpreting As for TraCeR, the clonal inference based on reconstructed BCRs is
BraCeR Clonotype represented as graphical output with horizontal lines indicating
Output whether a clonally related recombinant for each locus is present
(Fig. 5a) or full recombinant identifiers (Fig. 5b). Here, we discuss
3.6.1 Clonotype a few patterns that could be observed in the BraCeR clonotype
Networks networks, exemplified by clone groups in Fig. 5a.
1. Unless BraCeR is run with --IGH_networks, all the cells in each
clone group are required to share at least one productive IgH
and one productive Igκ or Igλ. This requirement makes the
clonal assignments fairly certain.
2. If BraCeR is run with --IGH_networks, larger clone groups with
cells sharing a clonally related IgH may consist of subclones
sharing a specific light chain. This cannot be exemplified by
Fig. 5a as, in this instance, BraCeR was not run with --
IGH_networks.
3. Sharing of additional reconstructed BCR sequences, either
productive or nonproductive, within a clone group strengthens
the evidence of correct clonal assignments due to the extremely
small likelihood that two independent cells would undergo the
same complete set of recombination events during develop-
ment in the bone marrow.
4. Some cells have additionally reconstructed chains that are not
shared within the clone group (e.g., one cell in the largest clone
group having two productive Igλ). This could be due to vary-
ing expression levels and hence differences in reconstruction
sensitivity, technical issues such as contamination or misassem-
blies, or display true biological variability (see Note 43).
5. Clone groups spanning different isotypes and subtypes of main
isotypes may be observed (e.g., two of the clone groups span-
ning IgA1 and IgG1), indicating that members of the clone
have undergone class-switching.
238 Ida Lindeman and Michael J. T. Stubbington
Fig. 5 Example of a clonotype network from a human donor (a). Each node represents a plasmablast.
Horizontal bars indicate reconstructed BCR sequences in each cell, with dark colors denoting productive and
light colors nonproductive chains. Sharing of one or more clonally related BCRs for each locus is visualized as
edges between the nodes, with edge thickness being proportional to the number of shared sequences. The
background color of each node indicates the isotype of the cell. An example of a network with identifiers for
one of the clone groups is shown in (b). Figure a is reproduced from Lindeman [10], and the figures are based
on raw scRNA-seq data published in [14]
3.6.2 Lineage Trees The lineage trees resulting from running BraCeR with --infer_line-
age are useful to acquire more information about the similarity of
the clonally related sequences within a clone group and how they
may have evolved through affinity maturation. These lineage trees
are built using maximum parsimony (see Note 44) with the inferred
combined heavy and light chain germline sequence as outgroup
(black node) and inferred intermediate sequences not observed in
the sample as white nodes (Fig. 6). The larger nodes in each lineage
tree are labeled with the cell name(s) containing the sequence
representing each node, and the background color of each node
corresponds to the isotype(s) of the IgH. The size of each node is
proportional to the number of cells in which the sequence was
reconstructed.
3.6.3 Further Repertoire The output of BraCeR may be further analyzed with other available
Analysis Using External tools for BCR repertoire analysis such as the Change-O suit for
Tools analysis of SHM, lineage reconstruction, and repertoire diversity.
BraCeR aims to follow common data standards as they are being
outlined by the Adaptive Immune Receptor Repertoire (AIRR)
community [31] in order to facilitate use of external tools (see
Note 45). Most current tools for BCR repertoire analysis are
designed for high-throughput repertoire sequencing data of bulk
Single-Cell Antigen Receptor Sequence Reconstruction 239
Fig. 6 Example of lineage trees constructed for the two largest clone groups in Fig. 5. Each node represents
the combined productive IgH and light chain sequence of a plasmablast. Edges between nodes indicate the
edit distance between the sequences. The background color of each node indicates the isotype of the IgH. The
figure is adapted from Lindeman [10]
samples, and do not automatically deal with paired heavy and light
chain sequences. Some practical guidelines for BCR sequencing
repertoire analysis have been reviewed in [32].
3.7 Building The Build mode of [TB]raCeR creates the required resources from
Resources with Build user-specified reference sequences. It can be used to run [TB]
raCeR for species other than human or mouse, or to use a particular
3.7.1 Introduction
set of reference sequences. The Build mode creates synthetic refer-
ence sequences called combinatorial recombinomes, consisting of
every combination of V-alleles and J-alleles, with a masked junc-
tional region and leader sequence to allow mapping of TCR- or
BCR-derived reads to a reference. For TCRs, the first ~260 nucleo-
tides of the constant region gene are then appended to each syn-
thetic sequence for the locus to allow mapping of reads running
into the C region (see Note 46 for BCRs). Build also creates
databases compatible with IgBLAST and BLAST.
4 Notes
References
1. Bashford-Rogers RJM, Palser AL, Huntly BJ, Genom 17:209–219. https://doi.org/10.
Rance R, Vassiliou GS, Follows GA, Kellam P 1093/bfgp/elx025
(2013) Network properties derived from deep 8. Stubbington MJT, Rozenblatt-Rosen O,
sequencing of human B-cell receptor reper- Regev A, Teichmann SA (2017) Single-cell
toires delineate B-cell populations. Genome transcriptomics to explore the immune system
Res 23(11):1874–1884. https://doi.org/10. in health and disease. Science 358
1101/gr.154815.113 (6359):58–63. https://doi.org/10.1126/sci
2. Weinstein JA, Jiang N, White RA 3rd, Fisher ence.aan6828
DS, Quake SR (2009) High-throughput 9. Stubbington MJT, Lonnberg T, Proserpio V,
sequencing of the zebrafish antibody reper- Clare S, Speak AO, Dougan G, Teichmann SA
toire. Science 324(5928):807–810. https:// (2016) T cell fate and clonality inference from
doi.org/10.1126/science.1170020 single-cell transcriptomes. Nat Methods 13
3. Wang C, Sanders CM, Yang Q, Schroeder HW (4):329–332. https://doi.org/10.1038/
Jr, Wang E, Babrzadeh F, Gharizadeh B, Myers nmeth.3800
RM, Hudson JR Jr, Davis RW, Han J (2010) 10. Lindeman I, Emerton G, Mamanova L, Snir O,
High throughput sequencing reveals a complex Polanski K, Qiao SW et al (2018) BraCeR: B-
pattern of dynamic interrelationships among cell-receptor reconstruction and clonality infer-
human T cell subsets. Proc Natl Acad Sci U S ence from single-cell RNA-seq. Nat Methods
A 107(4):1518–1523. https://doi.org/10. 15(8):563–565
1073/pnas.0913939107 11. Eltahla AA, Rizzetto S, Pirozyan MR, Betz-
4. Rosati E, Dowds CM, Liaskou E, Henriksen Stablein BD, Venturi V, Kedzierska K, Lloyd
EKK, Karlsen TH, Franke A (2017) Overview AR, Bull RA, Luciani F (2016) Linking the T
of methodologies for T-cell receptor repertoire cell receptor to the single cell transcriptome in
analysis. BMC Biotechnol 17:61. https://doi. antigen-specific human T cells. Immunol Cell
org/10.1186/s12896-017-0379-9 Biol 94(6):604–611
5. Bashford-Rogers RJM, Palser AL, Idris SF, 12. Rizzetto S, Koppstein DNP, Samir J, Singh M,
Carter L, Epstein M, Callard RE, Douek DC, Reed JH, Cai CH et al (2018) B-cell receptor
Vassiliou GS, Follows GA, Hubank M, Kellam reconstruction from single-cell RNA-seq with
P (2014) Capturing needles in haystacks: a VDJPuzzle. Bioinformatics 34
comparison of B-cell receptor sequencing (16):2846–2847
methods. BMC Immunol 15:29. https://doi. 13. Afik S, Yates KB, Bi K, Darko S, Godec J,
org/10.1186/s12865-014-0029-0 Gerdemann U, Swadling L, Douek DC,
6. Han A, Glanville J, Hansmann L, Davis MM Klenerman P, Barnes EJ, Sharpe AH, Haining
(2014) Linking T-cell receptor sequence to WN, Yosef N (2017) Targeted reconstruction
functional phenotype at the single-cell level. of T cell receptor sequence from single cell
Nat Biotechnol 32:684. https://doi.org/10. RNA-seq links CDR3 length to T cell differen-
1038/nbt.2938 tiation state. Nucleic Acids Res 45(16):e148.
7. Kolodziejczyk AA, Lönnberg T (2017) Global https://doi.org/10.1093/nar/gkx615
and targeted approaches to single-cell tran- 14. Canzar S, Neu KE, Tang Q, Wilson PC, Khan
scriptome characterization. Brief Funct AA (2017) BASIC: BCR assembly from single
248 Ida Lindeman and Michael J. T. Stubbington
cells. Bioinformatics 33(3):425–427. https:// 22. Bray NL, Pimentel H, Melsted P, Pachter L
doi.org/10.1093/bioinformatics/btw631 (2016) Near-optimal probabilistic RNA-seq
15. Upadhyay AA, Kauffman RC, Wolabaugh AN, quantification. Nat Biotechnol 34
Cho A, Patel NB, Reiss SM, Havenar- (5):525–527. https://doi.org/10.1038/nbt.
Daughton C, Dawoud RA, Tharp GK, Sanz I, 3519
Pulendran B, Crotty S, Lee FE-H, 23. Patro R, Duggal G, Love MI, Irizarry RA,
Wrammert J, Bosinger SE (2018) BALDR: a Kingsford C (2017) Salmon provides fast and
computational pipeline for paired heavy and bias-aware quantification of transcript expres-
light chain immunoglobulin reconstruction in sion. Nat Methods 14:417. https://doi.org/
single-cell RNA-seq data. Genome Med 10 10.1038/nmeth.4197
(1):20. https://doi.org/10.1186/s13073- 24. Camacho C, Coulouris G, Avagyan V, Ma N,
018-0528-3 Papadopoulos J, Bealer K, Madden TL (2009)
16. Lönnberg T, Svensson V, James KR, BLAST+: architecture and applications. BMC
Fernandez-Ruiz D, Sebina I, Montandon R, bioinformatics 10:421. https://doi.org/10.
Soon MSF, Fogg LG, Nair AS, Liligeto UN, 1186/1471-2105-10-421
Stubbington MJT, Ly L-H, Bagger FO, 25. Ilicic T, Kim JK, Kolodziejczyk AA, Bagger
Zwiessele M, Lawrence ND, Souza-Fonseca- FO, McCarthy DJ, Marioni JC, Teichmann
Guimaraes F, Bunn PT, Engwerda CR, Heath SA (2016) Classification of low quality cells
WR, Billker O, Stegle O, Haque A, Teichmann from single-cell RNA-seq data. Genome Biol
SA (2017) Single-cell RNA-seq and computa- 17(1):29. https://doi.org/10.1186/s13059-
tional analysis using temporal mixture model- 016-0888-1
ing resolves TH1/TFH fate bifurcation in 26. Gupta NT, Vander Heiden JA, Uduman M,
malaria. Sci Immunol 2(9). https://doi.org/ Gadala-Maria D, Yaari G, Kleinstein SH
10.1126/sciimmunol.aal2192 (2015) Change-O: a toolkit for analyzing
17. Patil VS, Madrigal A, Schmiedel BJ, Clarke J, large-scale B cell immunoglobulin repertoire
O’Rourke P, de Silva AD, Harris E, Peters B, sequencing data. Bioinformatics 31
Seumois G, Weiskopf D, Sette A, Vijayanand P (20):3356–3358. https://doi.org/10.1093/
(2018) Precursors of human CD4(+) cytotoxic bioinformatics/btv359
T lymphocytes identified by single-cell tran- 27. Stern JN, Yaari G, Vander Heiden JA,
scriptome analysis. Sci Immunol 3(19). Church G, Donahue WF, Hintzen RQ, Hutt-
https://doi.org/10.1126/sciimmunol. ner AJ, Laman JD, Nagra RM, Nylander A,
aan8664 Pitt D, Ramanan S, Siddiqui BA, Vigneault F,
18. Miragaia RJ, Gomes T, Chomka A, Jardine L, Kleinstein SH, Hafler DA, O’Connor KC
Riedel A, Hegazy AN, Lindeman I, (2014) B cells populating the multiple sclerosis
Emerton G, Krausgruber T, Shields J, brain mature in the draining cervical lymph
Haniffa M, Powrie F, Teichmann SA (2017) nodes. Sci Transl Med 6(248):248ra107.
Single cell transcriptomics of regulatory T https://doi.org/10.1126/scitranslmed.
cells reveals trajectories of tissue adaptation. 3008879
bioRxiv. https://doi.org/10.1101/217489 28. Felsenstein J (1989) PHYLIP - phylogeny
19. Langmead B, Salzberg SL (2012) Fast gapped- inference package (version 3.2). Cladistics
read alignment with bowtie 2. Nat Methods 9 5:164–166 doi:citeulike-article-id:2344765
(4):357–359. https://doi.org/10.1038/ 29. Goldstein LD, Chen Y-JJ, Dunne J, Mir A,
nmeth.1923 Hubschle H, Guillory J, Yuan W, Zhang J,
20. Grabherr MG, Haas BJ, Yassour M, Levin JZ, Stinson J, Jaiswal B, Pahuja KB, Mann I,
Thompson DA, Amit I, Adiconis X, Fan L, Schaal T, Chan L, Anandakrishnan S, Lin
Raychowdhury R, Zeng Q, Chen Z, C-w, Espinoza P, Husain S, Shapiro H,
Mauceli E, Hacohen N, Gnirke A, Rhind N, Swaminathan K, Wei S, Srinivasan M,
di Palma F, Birren BW, Nusbaum C, Lindblad- Seshagiri S, Modrusan Z (2017) Massively par-
Toh K, Friedman N, Regev A (2011) Full- allel nanowell-based single-cell gene expression
length transcriptome assembly from RNA-Seq profiling. BMC Genomics 18(1):519. https://
data without a reference genome. Nat Biotech- doi.org/10.1186/s12864-017-3893-1
nol 29(7):644–652. https://doi.org/10. 30. Brady BL, Steinel NC, Bassing CH (2010)
1038/nbt.1883 Antigen receptor allelic exclusion: an update
21. Ye J, Ma N, Madden TL, Ostell JM (2013) and reappraisal. J Immunol 185
IgBLAST: an immunoglobulin variable (7):3801–3808. https://doi.org/10.4049/
domain sequence analysis tool. Nucleic Acids jimmunol.1001158
Res 41(Web Server issue):W34–W40. https:// 31. Breden F, Luning Prak ET, Peters B, Rubelt F,
doi.org/10.1093/nar/gkt382 Schramm CA, Busse CE, Vander Heiden JA,
Single-Cell Antigen Receptor Sequence Reconstruction 249
Christley S, Bukhari SAC, Thorogood A, Mat- 39. Cui A, Di Niro R, Vander Heiden JA, Briggs
sen Iv FA, Wine Y, Laserson U, Klatzmann D, AW, Adams K, Gilbert T, O’Connor KC,
Douek DC, Lefranc MP, Collins AM, Vigneault F, Shlomchik MJ, Kleinstein SH
Bubela T, Kleinstein SH, Watson CT, Cowell (2016) A model of somatic hypermutation tar-
LG, Scott JK, Kepler TB (2017) Reproducibil- geting in mice based on high-throughput ig
ity and reuse of adaptive immune receptor rep- sequencing data. J Immunol 197
ertoire data. Front Immunol 8:1418. https:// (9):3566–3574. https://doi.org/10.4049/
doi.org/10.3389/fimmu.2017.01418 jimmunol.1502263
32. Yaari G, Kleinstein SH (2015) Practical guide- 40. Brennan PJ, Brigl M, Brenner MB (2013)
lines for B-cell receptor repertoire sequencing Invariant natural killer T cells: an innate activa-
analysis. Genome Med 7(1):121. https://doi. tion scheme linked to diverse effector func-
org/10.1186/s13073-015-0243-2 tions. Nat Rev Immunol 13(2):101–117.
33. Picelli S, Faridani OR, Bjorklund AK, https://doi.org/10.1038/nri3369
Winberg G, Sagasser S, Sandberg R (2014) 41. Dias J, Leeansyah E, Sandberg JK (2017) Mul-
Full-length RNA-seq from single cells using tiple layers of heterogeneity and subset diver-
smart-seq2. Nat Protoc 9(1):171–181. sity in human MAIT cell responses to distinct
https://doi.org/10.1038/nprot.2014.006 microorganisms and to innate cytokines. Proc
34. Svensson V, Natarajan KN, Ly L-H, Miragaia Natl Acad Sci 114(27):E5434–E5443.
RJ, Labalette C, Macaulay IC, Cvejic A, Teich- https://doi.org/10.1073/pnas.1705759114
mann SA (2017) Power analysis of single-cell 42. Li W, Cowley A, Uludag M, Gur T,
RNA-sequencing experiments. Nat Methods McWilliam H, Squizzato S, Park YM, Buso N,
14:381. https://doi.org/10.1038/nmeth. Lopez R (2015) The EMBL-EBI bioinformat-
4220 ics web and programmatic tools framework.
35. Rizzetto S, Eltahla AA, Lin P, Bull R, Lloyd Nucleic Acids Res 43(W1):W580–W584.
AR, Ho JWK, Venturi V, Luciani F (2017) https://doi.org/10.1093/nar/gkv279
Impact of sequencing depth and read length 43. Liu S, Velez M-G, Humann J, Rowland S,
on single cell RNA sequencing data of T cells. Conrad FJ, Halverson R, Torres RM, Pelanda
Sci Rep 7(1):12781. https://doi.org/10. R (2005) Receptor editing can lead to allelic
1038/s41598-017-12989-x inclusion and development of B cells that retain
36. Martin M (2011) Cutadapt removes adapter antibodies reacting with high avidity autoanti-
sequences from high-throughput sequencing gens. J Immunol 175(8):5067–5076. https://
reads. EMBnetjournal 17(1):10–12. https:// doi.org/10.4049/jimmunol.175.8.5067
doi.org/10.14806/ej.17.1.200 44. Lang J, Ota T, Kelly M, Strauch P, Freed BM,
37. Rock EP, Sibbald PR, Davis MM, Chien YH Torres RM, Nemazee D, Pelanda R (2016)
(1994) CDR3 length in antigen-specific Receptor editing and genetic variability in
immune receptors. J Exp Med 179 human autoreactive B cells. J Exp Med 213
(1):323–328. https://doi.org/10.1084/jem. (1):93–108. https://doi.org/10.1084/jem.
179.1.323 20151039
38. Yaari G, Vander Heiden JA, Uduman M, 45. Pelanda R (2014) Dual immunoglobulin light
Gadala-Maria D, Gupta N, Stern JNH, chain B cells: Trojan horses of autoimmunity?
O’Connor KC, Hafler DA, Laserson U, Curr Opin Immunol 27:53–59. https://doi.
Vigneault F, Kleinstein SH (2013) Models of org/10.1016/j.coi.2014.01.012
somatic hypermutation targeting and substitu- 46. Hoehn KB, Lunter G, Pybus OG (2017) A
tion based on synonymous mutations from phylogenetic codon substitution model for
high-throughput immunoglobulin sequencing antibody lineages. Genetics 206(1):417–427.
data. Front Immunol 4:358. https://doi.org/ https://doi.org/10.1534/genetics.116.
10.3389/fimmu.2013.00358 196303
Chapter 16
Abstract
Cells in complex tissues are organized by distinct microenvironments and anatomical structures. This spatial
environment of cells is thought to be important for division of labor and other specialized functions of
tissues. Recently developed spatial transcriptomic technologies enable the quantification of expression of
hundreds of genes while accounting for cells’ spatial coordinates, providing an opportunity to study
spatially organized structures. Here, we describe a computational pipeline for detecting the spatial organi-
zation of cells based on a hidden Markov random field model. We illustrate this pipeline with data generated
from multiplexed smFISH from the adult mouse visual cortex.
Key words Hidden Markov random field, Spatial organization, Sequential fluorescence in situ hybri-
dization, Multiplexed fluorescence in situ hybridization
1 Introduction
Guo-Cheng Yuan (ed.), Computational Methods for Single-Cell Data Analysis, Methods in Molecular Biology, vol. 1935,
https://doi.org/10.1007/978-1-4939-9057-3_16, © Springer Science+Business Media, LLC, part of Springer Nature 2019
251
252 Qian Zhu
2 Materials
2.1 Prerequisites We require R version 3 and Python 2.7. The following Python
prerequisite packages need to be installed: seaborn (0.7.0 or up),
pandas, numpy, scipy, matplotlib. Use pip --user --install <package
name> to install any missing packages. The following R packages
are also required: lattice, misc3d, oro.nifti, pracma, Matrix,
mvtnorm. We require JAVA (version 7 or 8) and GraphColoring
package.
2.3 Python This contains the wrapper and interface functions for interacting
smfishHmrf Package with the R smfishHmrf package and is required for running HMRF
and downstream visualizations. Install by:
Fig. 1 Example of imaged sections in the mouse visual cortex. Each blue box shows a section, also called a
field. Sections are stitched together to form a global tissue view
2.4 Spatial Single- Our pipeline is general to all types of spatial transcriptomic data.
Cell Data Set For this chapter, we focus on a mouse visual cortex data set gener-
ated by the seqFISH technology. This input data set is a mouse
coronal brain slice that has been imaged and is composed of various
sections of the hippocampus and visual cortex tissues (Fig. 1). Each
section, also called field, measures 1020 1020 units (each unit is
equivalent to 220 nm). These sections are imaged and processed
together with two pieces of information provided: (1) cell coordi-
nates (two versions: relative to each field, and relative to a stitched
image which stitches adjacent fields in cortex), and (2) cellular gene
expression for 125 genes.
In general, we require the cell coordinate file and the gene
expression file to carry out a spatial domain inference analysis.
The specifications of these two files are as follows.
The cell coordinate file is made up of 4 headerless columns,
separated by space: cell_index, field_ID, x-coord, y-coord. All fields
must be numerical. Cell index ¼ 1. . .N where N is the number of
cells. Field ID specifies the field of view the cell is located
in. X-coord, y-coord specify the coordinates of the cell in the
respective field of view. Coordinates can be floating point decimals
and can be negative. See a snippet of this file below:
1 0 675.080 -37.330
2 0 265.760 -231.140
3 0 753.460 -261.140
4 0 290.480 -261.520
5 0 991.430 -482.350
6 0 926.420 -675.880
7 0 414.500 -688.670
254 Qian Zhu
8 0 607.180 -773.680
9 0 715.720 -822.110
10 0 654.580 -896.760
11 0 472.450 -952.830
12 0 257.120 -133.350
13 0 700.010 -169.050
14 0 415.630 -252.450
1 1.08 0.60 0.95 0.51 -1.67 0.65 1.14 0.79 0.61 0.18 0.74 0.62
0.86 0.61 0.59 0.34 -2.15 0.20 0.87 0.68 0.00 0.21 -0.00 0.49
-0.43 1.07 0.55 -1.32 0.10 1.31 1.28 0.18 0.63 0.03 0.52 0.47 0.55
0.98 0.33 0.64 -0.18 1.21 1.67 -0.37 1.04 0.11 -0.63 0.89 -0.39
-2.28 -0.02 0.66 0.92 -0.81 -0.39 -1.02 0.61 -0.05 -0.01 0.61
-0.01 -0.38 -0.05 1.52 -1.25 0.13 0.38 0.70 -0.02 0.11 -0.49 -0.35
-1.47 0.30 0.13 0.39 0.45 -2.69 0.14 0.74 0.24 -0.60 -1.08 -0.65
0.70 0.03 -1.56 -0.01 0.30 0.28 1.55 -0.25 -0.33 -0.10 -1.55 0.06
-0.77 0.41 -0.75 0.17 -1.35 -0.65 -2.06 1.42 -0.73 -2.55 -1.04
-1.35 -0.71 -0.76 -1.01 -1.69 -1.70 3.26 0.44 -1.65 -0.72 -1.40
-1.27 0.83 -0.74 -0.70 -0.05 2.65 1.76
2 1.82 1.60 2.38 -0.02 -0.26 1.94 0.08 0.07 0.91 -0.20 1.57 -0.08
0.06 1.53 0.40 0.18 0.41 1.44 0.19 0.01 0.19 -0.17 -0.26 0.06
1.18 0.02 0.01 0.06 -0.31 -0.05 -0.64 -0.01 -0.17 -0.58 1.31 0.94
0.59 1.58 -0.01 -0.10 -0.13 0.05 -0.20 0.00 -0.07 -0.07 0.19
-0.23 -0.02 -0.35 -0.68 1.27 -0.21 -0.23 -0.49 0.87 -0.33 0.41
1.25 -0.15 -0.70 -0.29 -0.47 1.98 0.01 -0.19 0.06 0.96 -0.23
-0.21 -0.19 -0.39 -0.85 -0.25 0.06 -0.40 -0.63 -0.30 -0.27 -0.29
-0.61 -0.08 -0.61 0.28 0.30 0.03 1.02 0.22 1.27 1.13 2.08 0.18
-0.22 1.31 -0.17 0.82 -0.15 1.74 -0.40 -0.40 -0.23 -2.36 -1.55
0.06 -0.97 -1.79 -1.38 -0.95 -1.69 -1.91 -2.38 -2.60 -1.49 2.32
0.16 0.05 -1.41 -2.14 -1.38 0.23 -2.20 -1.60 -1.40 2.42 0.54
3 Methods
3.1.1 Gene Selection Selection of spatially coherent genes can aid the HMRF modeling.
To help us find spatially coherent genes, we define a score as
follows. For each gene we divide all the cells based on its bias-
corrected gene expression into two classes: 1—expressed class
which corresponds to 90th percentile in the gene’s expression
distribution, and 0—the remaining cells. We use silhouette metric
to measure how spatially coherent is the 1-marked cells. Here, we
used the rank-normalized, exponentially transformed distance to
emphasize the local physical distance between cells. For a pair of
cells, si and sj, this distance is defined as:
r s i ; s j ¼ 1 prankd ðs i ;s j Þ1 ð3Þ
256 Qian Zhu
where rankd(si, sj) is the mutual rank [17] of si and sj in the vectors
of Euclidean distances {Euc(si, ∗)} and {Euc(sj, ∗)}. p is a rank-
weighting constant set between 0.95 and 0.99. The spatial coher-
ence of the gene is calculated as the Silhouette coefficient of the
spatial distance between the two cell sets:
X
S g ¼ 1=jL 1 j s ∈L ðmi ni Þ=maxðmi ; ni Þ ð4Þ
i 1
3.1.3 Number of Clusters We initialize HMRF according to the K-means clustering results.
The value of K is selected on the basis of gap-statistics.
We next model the bias pattern of all genes using PCA (do_pca
function). The contributions of the top principal components
are subtracted from the expression matrix (expr). The result is
contained in corrected_expr.
Fig. 2 Top four principal components associated with the illumination bias. Each panel is a 50 bins by 50 bins
heatmap showing the bias level associated with each region of the microscope field of view. Left to right: PC1,
2, 3, 4
258 Qian Zhu
directory = "workdir"
genes = reader.read_genes("%s/genes" % directory)
Xcen, field =
reader.read_coordinates("%s/fcortex.coordinates.txt" % directory)
expr =
reader.read_expression_matrix("%s/fcortex.expression.txt" %
directory)
this_dset.calc_independent_region()
This gives more emphasis in the distances between cells that are
close to each other (Eq. 4).
and removed these genes from the spatial gene list from the
previous step. This results in a 69-gene list that is used for
HMRF (see file HMRF.genes).
Optionally, users may also select genes based on which genes
are correlated to top principal components from PCA analysis.
See the spatial.pc_genes() function, which computes the signif-
icant genes of each component with the jacksaw algorithm
[19]. Users can combine this criterion and the spatial criterion
in defining suitable genes for HMRF.
9. Once we decide on the genes to use for HMRF, we load this list
(HMRF.genes file in the source data package), and we create a
data subset using these genes with the subset_genes function.
Init() and run() are wrapper functions for R scripts where the
core of HMRF is implemented. Init() determines initial con-
ditions by running K-means to determine cluster centroids and
covariance matrices, and initializing HMRF to these settings.
Within init(), nstart is the number of random starts
(a parameter of kmeans in R); seed allows the initial configura-
tion to be fixed (default is 1 or unset). Run() performs the
HMRF modeling, including an expectation-maximization pro-
cedure to iteratively estimate the parameters of the HMRF
model, until convergence criteria is met. Run() will iterate
over all K’s and all beta’s. At the end, files will be automatically
generated in the output directory (see Note 6), with a copy of
the results loaded in the class instance.
12. Next, we visualize the spatial clusters in 2D. Figure 4 shows the
result for K ¼ 9, beta ¼ 9.0 and indicates a resemblance of the
structure to the visual cortex layers.
Fig. 4 The spatial domain organization of the visual cortex, revealed by our HMRF modeling of the spatial data.
Each color indicates cells belonging to a spatial domain. There are nine spatial domains (K ¼ 9) and beta is set
to 9.0
k=9
betas = np.array(range(0, 90, 5) + range(90, 150, 10)) / 10.0
lik_data, diff_data = [], []
for b in betas:
lik_data.append((b, "observed", this_hmrf.likelihood[(k,b)]))
lik_data.append((b, "random",
HMRF for Detecting Domain Organizations 263
perturbed_hmrf.likelihood[(k,b)]))
diff_data.append((b, "obs - rand",
this_hmrf.likelihood[(k,b)] – \
perturbed_hmrf.likelihood[(k, b)]))
a_lik = pd.DataFrame(data={"label":[v[1] for v in lik_data],
"beta":[v[0] for v in lik_data], "log-likelihood":[v[2] for v in
lik_data]})
d_lik = pd.DataFrame(data={"label":[v[1] for v in diff_data],
"beta":[v[0] for v in diff_data], "log-likelihood":[v[2] for v in
diff_data]})
axn = sns.lmplot(x="beta", y="log-likelihood", hue="label",
data=a_lik, fit_reg=False)
axn = sns.lmplot(x="beta", y="log-likelihood", hue="label",
data=d_lik, fit_reg=False)
Fig. 5 Spatial perturbation analysis. We fully shuffled the spatial positions such that 100% of cells’ positions
are exchanged. (a) Log-likelihood of the HMRF model is calculated and plotted for each value of beta from 0 to
20.0. (b) Difference in log-likelihood between observed and randomly shuffled data is plotted
264 Qian Zhu
4 Notes
1. Output:
4. Silhouette output:
gene sil.score p-value
amigo2 0.0555237 0.0
cldn5 0.0262171 0.0
calb1 0.0189745 0.0
kcnip 0.0187823 0.0
tbr1 0.0173751 0.0
pax6 0.0169212 0.0
nes 0.0156901 0.0
gda 0.0150629 0.0
col5a1 0.0148561 0.0
loxl1 0.0120843 0.0
sox2 0.011049 0.0
slc5a7 0.00993408 0.0
nov 0.00985005 0.0
itpr2 0.00915686 0.0
cpne5 0.00913211 0.0
Nell1 0.00875134 0.0
mrc1 0.00864791 0.0
rhob 0.00830748 0.0
acta2 0.00802404 0.0
...
Foxa1 0.000510314 0.28
Zfp715 0.000449031 0.34
Galnt3 0.000186029 0.49
Blzf1 -0.000113328 0.64
Laptm5 -0.000492621 0.81
Gm6377 -0.000751389 0.87
Zfp90 -0.00092005 0.91
References
A D
Allele-specific expression (ASE) .........155–157, 165, 166 Data preprocessing........................................................ 189
Alternative splicing........................................................ 183 Data quality ..................................................2, 3, 5–8, 172
Antigen receptor reconstruction .................................. 224 Demultiplexing.................. 184, 187, 189, 191, 214, 226
Area under the curve (AUC) ........................................... 6 Developmental trajectory ...................................... 26, 164
ATAC-seq ...................................................................... 194 Differential expression (DE) ............................16, 17, 19,
20, 65, 87, 89, 97–100, 102, 103, 105, 137, 265
B Differential pathway analysis ..................................97–113
Barcodes ....................................................... 26, 187, 191, Differential splicing.............................176, 177, 181–183
Differentiation potency........................................ 125–137
204, 209, 218
Batch effects ..................................... 39, 48, 55, 127, 184 Digital gene expression (DGE) ......................... 92–94, 96
Bayesian model.............................................................. 178 Dimension reduction .......... 85, 117–119, 123, 147, 151
DiPhiSeq............................................................... 143, 146
Bayesian regression for isoform estimation
(BRIE) ............................................. 175, 179, 183 DNA methylation ................................................ 187–191
B-cell receptors (BCRs) ...................................... 223–226, Docker .................................................226, 242, 243, 246
228–231, 233, 234, 236, 238, 239, 241–245 Dropouts ...........................................................1, 26, 136,
156, 164, 165, 169, 171, 176
Biallelic expression ............................................... 156, 165
Bifurcations.................................................................... 126 Drop-seq.................................................. 80, 96, 184, 220
Bioconductor.............................................. 12, 27, 29, 49,
E
99, 117, 127, 129, 143, 190, 191, 194, 200
Biological variability................................ 25–42, 236, 246 Embryonic stem cells (ES cells) ........................ 2, 7, 8, 18
Bisulfite sequencing ...................................................... 194 Energy potential ................................................... 125, 255
Blacklist regions............................................................. 192 Enhancers ............................................189, 195, 203–220
Bonferroni correction ................................................... 102 Entropy ................................................................ 126, 127,
BraCeR ................................................................ 224–226, 129, 132–134, 136, 137
228–230, 232–234, 236, 238, 241–246 Epigenetic landscape ..................................................... 125
Epigenetics .............................................................. 46, 47,
C 176, 177, 187, 197, 208
Euclidean distances ........................................60, 256, 258
Cancer............................................. 48, 79, 126–128, 137
Cell differentiation ....................... 99, 115, 119, 126, 176 External RNA control consortium
Cell states......................................... v, 125, 187, 203, 263 (ERCC).......................................... 17, 27, 29, 149
Cell taxonomy ................................................................. 46
F
Cell-to-cell variability......................................... 25, 26, 28
Cell type identification...............................v, 49, 182, 190 False discovery rates (FDRs) .................................. 30, 37,
Cell types ................................................................ v, 2, 32, 120, 148, 149, 167, 210, 219
45, 79, 91, 97, 126, 162, 175, 187, 220, 225, 251 False positive rate (FPR)...........................................2, 6, 8
Clonality inference ............................................... 223–247 FASTQ files ......................................................... 189, 191,
Clonal networks ................................................... 226, 231 192, 209, 225, 226, 228, 244
Clustering ................................................................ 11, 26, Feature selection ..................................... 54, 72, 126, 137
46, 84, 91, 105, 116, 127, 147, 157, 177, 195, Fluidigm C1 system ........................................................ 27
244, 252 Fluorescence activated cell sorting (FACS) .................. 47,
CRISPRi ............................................................... 204, 208 126, 128, 210, 219
Cross-species comparison of cell-types .......................... 46 Fluorescence in situ hybridization (FISH) .................. 251
Guo-Cheng Yuan (ed.), Computational Methods for Single-Cell Data Analysis, Methods in Molecular Biology, vol. 1935,
https://doi.org/10.1007/978-1-4939-9057-3, © Springer Science+Business Media, LLC, part of Springer Nature 2019
269
COMPUTATIONAL METHODS FOR SINGLE-CELL DATA ANALYSIS
270 Index
Fold changes....................................................... 14, 54, 89 Mixture models ..................... 54, 55, 105, 130, 178, 179
Fragments Per Kilobase of transcript per Monoallelic expression.................................................. 171
Million (FPKM) ...................................93, 96, 143 Mosaic-seq ............................................................ 204, 219
Mouse cell atlas (MCA) ..................................... 91–94, 96
G Multiple hypothesis testing .......................................... 111
Gap-statistics ................................................................. 256
N
Gene co-expression networks (GCN)................. 141–152
Gene expression .................................................... 1–8, 11, Neighborhood graph ........................................... 255–258
25–27, 36, 39, 46, 50, 52, 55, 56, 58, 79, 80, 88, Neurons ................................................................... 47, 63,
91, 93, 96–98, 100, 102, 105, 106, 115, 117, 65, 69, 71, 79, 176, 265
118, 120, 121, 126, 127, 131, 142, 147, Normalization ...................................................... v, 11–23,
155–172, 176, 203, 209, 252–255, 257, 259, 260 29, 36, 53, 93, 96, 107, 127, 132, 136, 143, 145,
Gene expression omnibus (GEO).............................3, 18, 147, 149, 150, 254
27, 51, 127
Gene filtering.......................................................... 13, 144 O
Gene ontology (GO) .............................................. 68, 69, Open chromatin ............................................................ 188
100, 103, 106, 110, 126, 195 Overdispersion ......................................99, 106, 109, 110
Gene set enrichment analysis (GSEA) .....................98, 99
GiniClust ...................................................................79–88 P
Gini index ........................................................... 79, 84, 87
GitHub ..............................................................81, 92, 94, Paired-end sequencing......................................... 194, 214
117, 123, 162, 177, 179, 182, 218, 225, 243 Pathway and gene set overdispersion analysis
(PAGODA)....................... 99, 105–107, 110, 113
H Pathways .........................................................97–113, 128
Pearson correlation ...................................................93, 94
Hidden Markov random field (HMRF) ............. 251–266 Poisson process................................................................ 28
Highly variable genes...............................................26–34,
Principal component analysis (PCA)............................. 56,
36–42, 54, 56, 72, 105 57, 60, 106, 118, 123, 164, 171, 195, 257, 260
Protein-protein interaction (PPI) network.................128,
I
130, 133, 134, 136
Illumina.................................................................. 2, 4, 27, Pseudotime .......................................................... 115–123,
191, 208, 213, 214 126, 142, 143, 146–148, 151, 152, 224
Illumination bias correction ......................................... 256 p-values ................................................................ 5, 69, 87,
Immunoglobulin ......................................... 232, 242, 246 88, 98, 103, 111, 120, 134, 135, 169, 170, 210,
Imputations .........................................156, 169, 179, 184 219, 259, 265
Independent component analysis (ICA).......56, 147, 152 Python .............................................................79–81, 177,
In vitro transcription (IVT)......................................25, 47 179, 182–184, 218, 225, 241, 242, 244, 252
Isoform quantification ......................................... 176, 178
Isotype switching .......................................................... 224 Q
Quality control (QC)............................................ 1–8, 22,
K
26, 27, 129, 157, 163, 225, 228, 234, 235, 241
k-means..........................60, 88, 123, 252, 256, 261, 263 Quantitative polymerase chain reaction
(qPCR)..........................2, 80, 81, 84, 86, 87, 217
L
R
LEAP .......................................... 143, 147–149, 151, 152
Lineage trees...................... 230–234, 238, 239, 244, 246 Rare cell types........................................................ v, 79–89
Linear regression ..............................................55, 56, 164 Read counts ......................................................... 4, 11, 27,
29–31, 84, 106, 143, 156, 157, 159, 162, 164,
M 167, 171, 218
Metagenes........................................................................ 56 Reads Per Kilobase Million (RPKM) .......................93, 96
Receiver operating characteristic (ROC) ......................... 6
Microwell-seq ............................................................92, 96
Minimum spanning tree (MST)......................... 116–119, Refseq ................................................................................ 4
121, 122, 147 Replicates .......................................... 27, 29, 48, 183, 215
COMPUTATIONAL METHODS FOR SINGLE-CELL DATA ANALYSIS
Index 271
RNA- sequencing (RNA-seq) ..................... 1, 11, 25, 47, Spearman correlation ........................................................ 4
79, 92, 97, 115, 126, 141, 155, 175, 204, 224 Specificity ...................................................................8, 236
Spike-ins.............................................................14, 15, 17,
S 22, 23, 26–36, 38, 39, 41, 42, 149, 156, 157, 162,
164, 166, 169, 171
Scale factors .....................11, 13, 14, 16, 17, 20, 21, 131
Scater .................................................................27, 29, 31,
T
35, 39, 42, 129, 131, 133
SCnorm .............................................................. 12–20, 22 T-cell receptors (TCRs) ...................................... 223–226,
Scran ......................................................27, 29, 30, 36, 42 228–237, 239, 241, 243–245
Sensitivity...........................................................8, 88, 178, t-Distributed stochastic neighbor embedding
204, 213, 214, 217, 228, 235, 236, 241, 247 (t-SNE) ........................................................ 39, 40,
Sequence alignment ...................157, 192, 194, 229, 246 42, 61, 88, 164, 171, 190, 193
Sequencing depths ............................................11, 13, 14, Technical variability....................................... v, 11, 26–29,
18, 21, 26, 30, 35, 96, 105, 110, 143, 146, 171, 35, 157, 162, 164, 165
176, 194, 241 TraCeR ................................................................ 224, 228,
Sequential FISH (seqFISH) ....................... 252, 253, 265 229, 232–237, 240–245
sgRNAs ................................................................ 204–210, Transcriptional bursting ........................... 1, 26, 156, 166
212–214, 216–220 Transcripts Per Kilobase Millions (TPMs) ...................... 7
Signaling entropy rate................ 129, 130, 132–134, 136 TSCAN ................................................................. 115–123
Single-cell allelic expression (SCALE)156–159, 162–166,
168–171 U
Single-cell ATAC-sequencing (scATAC-seq) .............187, Unique molecular identifiers (UMIs) .............. 17, 26–28,
189–194, 196 32, 35, 36, 47, 53, 54, 72, 80, 84, 93, 96, 218
Single cell entropy (SCENT) .............................. 125–137
Single-cell Mouse Cell Atlas (scMCA) ....................91–96 V
Single-cell perturbation ......................203–205, 209, 210
Single-cell RNA-sequencing (scRNA-seq) ................. 1–8, Variance................................................... 1, 26, 28–30, 36,
11–23, 25–42, 46–50, 54–56, 60, 62, 88, 91–94, 37, 39, 57, 60, 105, 106, 109–111, 116, 129,
96, 115–118, 120, 126–132, 134, 136, 137, 184, 197
141–152, 155–157, 159, 162–164, 175–184, Visual cortex (VC) ............................................50, 51, 53,
204, 205, 208, 216, 220, 223–247 65, 71, 72, 253, 256–266
Single-end sequencing ........................................... 47, 228
W
Single molecule FISH (smFISH) .......................... 88, 251
SinQC ............................................................................ 6, 8 Wilcoxon signed-rank test ................................................ 4
Smart-seq..........................................................71, 72, 156
Somatic hypermutations (SHMs) ...............................224, X
238, 244, 245
10X Genomics............................................. 156, 204, 216
Spatial domains ...................................251–253, 261, 262
Spatial transcriptomics .......................................v, 49, 253,
257, 260, 262, 263, 266