Next Generation Sequencing Data Analysis
Next Generation Sequencing Data Analysis
Next-Generation Sequencing
Data Analysis
For each NGS application, this book covers topics from experimental design,
sample processing, sequencing strategy formulation, to sequencing read quality
control, data preprocessing, read mapping or assembly, and more advanced
stages that are specific to each application. Major applications include:
Before detailing the analytic steps for each of these applications, the book
presents introductory cellular and molecular biology as a refresher mostly
for data scientists, the ins and outs of widely used NGS platforms, and an
overview of computing needs for NGS data management and analysis. The
book concludes with a chapter on the changing landscape of NGS technolo-
gies and data analytics.
The second edition of this book builds on the well-received first edition
by providing updates to each chapter. Two brand new chapters have been
added to meet rising data analysis demands on single-cell RNA-seq and clin-
ical sequencing. The increasing use of long-reads sequencing has also been
reflected in all NGS applications. This book discusses concepts and principles
that underlie each analytic step, along with software tools for implementa-
tion. It highlights key features of the tools while omitting tedious details to
provide an easy-to-follow guide for practitioners in life sciences, bioinfor-
matics, biostatistics, and data science. Tools introduced in this book are open
source and freely available.
iii
Next-Generation
Sequencing Data
Analysis
Second Edition
Xinkun Wang
iv
Contents
v
vi Contents
4.2.1.2 Implementation.......................................................... 60
4.2.1.3 Error Rate, Read Length, Data Output,
and Cost....................................................................... 63
4.2.1.4 Sequence Data Generation........................................ 63
4.2.2 Pacific Biosciences Single-Molecule Real-Time
(SMRT) Long-Read Sequencing............................................... 64
4.2.2.1 Sequencing Principle.................................................. 64
4.2.2.2 Implementation.......................................................... 64
4.2.2.3 Error Rate, Read Length, Data Output,
and Cost....................................................................... 65
4.2.2.4 Sequence Data Generation........................................ 65
4.2.3 Oxford Nanopore Technologies (ONT) Long-Read
Sequencing.................................................................................. 67
4.2.3.1 Sequencing Principle.................................................. 67
4.2.3.2 Implementation.......................................................... 68
4.2.3.3 Error Rate, Read Length, Data Output,
and Cost....................................................................... 68
4.2.3.4 Sequence Data Generation........................................ 69
4.2.4 Ion Torrent Semiconductor Sequencing.................................. 69
4.2.4.1 Sequencing Principle.................................................. 69
4.2.4.2 Implementation.......................................................... 70
4.2.4.3 Error Rate, Read Length, Date Output,
and Cost....................................................................... 70
4.2.4.4 Sequence Data Generation........................................ 72
4.3 A Typical NGS Workflow...................................................................... 72
4.4 Biases and Other Adverse Factors That May Affect NGS Data
Accuracy.................................................................................................. 74
4.4.1 Biases in Library Construction................................................. 74
4.4.2 Biases and Other Factors in Sequencing................................. 75
4.5 Major Applications of NGS................................................................... 76
4.5.1 Transcriptomic Profiling (Bulk and Single-Cell
RNA-Seq)..................................................................................... 76
4.5.2 Genetic Mutation and Variation Identification...................... 77
4.5.3 De Novo Genome Assembly...................................................... 77
4.5.4 Protein-DNA Interaction Analysis (ChIP-Seq)....................... 77
4.5.5 Epigenomics and DNA Methylation Study
(Methyl-Seq)................................................................................ 77
4.5.6 Metagenomics............................................................................. 78
12. De Novo Genome Assembly with Long and/or Short Reads............... 271
12.1 Genomic Factors and Sequencing Strategies for
De Novo Assembly................................................................................ 272
Contents xi
xv
newgenprepdf
xvi
Author
xvii
1
Part I
DOI: 10.1201/9780429329180-2 3
4 Next-Generation Sequencing Data Analysis
1.3 Molecules in Cells
Different types of molecules are needed to carry out the various cellular
processes. In a typical cell, water is the most abundant representing 70% of
the total cell weight. Besides water, there are a large variety of small and large
molecules. The major categories of small molecules include inorganic ions
The Cellular System and the Code of Life 5
(Na+, K+, Ca2+, Cl-, Mg2+, etc.), monosaccharides, fatty acids, amino acids, and
nucleotides. Major varieties of large molecules are polysaccharides, lipids,
proteins, and nucleic acids (DNA and RNA). Among these components,
the inorganic ions are important for signaling (e.g., waves of Ca2+ represent
important intracellular signal), cell energy storage (e.g., in the form of Na+
/K+ cross-membrane gradient), or protein structure/function (e.g., Mg2+ is
an essential cofactor for many metalloproteins). Carbohydrates (including
monosaccharides and polysaccharides), fatty acids, and lipids are major
energy-providing molecules in the cell. Lipids are also the major component
of cell membrane. Proteins, which are assembled from 20 types of amino acids
in different order and length, underlie almost all cellular activities, including
metabolism, signal transduction, DNA replication, and cell division. They
are also the building blocks of many subcellular structures, such as cytoskel-
eton (see next section). Nucleic acids carry the code of life in their nearly
endless nucleotide permutations, which not only provides instructions on the
assembly of all proteins in cells but also exerts control on how such assembly
is carried out based on environmental conditions.
1.4.1 Nucleus
Since DNA stores the code of life, it must be protected and properly maintained
to avoid possible damage and ensure accuracy and stability. As proper execu-
tion of the genetic information embedded in the DNA is critical to the normal
functioning of a cell, gene expression must also be tightly regulated under
6 Next-Generation Sequencing Data Analysis
Nucleus
Nuclear Envelope
(with nuclear pores)
Cell Membrane
Chromatin
Peroxisome
Ribosome
Nucleolus Microtubule
Lysosome
Mitochondrion
Golgi Apparatus
Rough ER
Smooth ER
Intermediate
Filament
Centrosome
Cytoplasm Endosome
Microfilament
FIGURE 1.1
The general structure of a typical eukaryotic cell. Shown here is an animal cell.
all conditions. The nucleus, located in the center of most cells in eukaryotes,
offers a well-protected environment for DNA storage, maintenance, and gene
expression. The nuclear space is enclosed by nuclear envelope consisting of
two concentric membranes. To allow movement of proteins and RNAs across
the nuclear envelope, which is essential for gene expression, there are pores
on the nuclear envelope that span the inner and outer membrane. The mech-
anical support of the nucleus is provided by the nucleoskeleton, a network
of structural proteins including lamins and actin among others. Inside the
nucleus, long strings of DNA molecules, through binding to certain proteins
called histones, are heavily packed to fit into the limited nuclear space. In
prokaryotic cells, a nucleus-like irregularly shaped region that does not have
a membrane enclosure called the nucleoid provides a similar but not as well-
protected space for DNA.
1.4.2 Cell Membrane
The cell membrane serves as a barrier to protect the internal structure of a
cell from the outside environment. Biochemically, the cell membrane, as well
as all other intracellular membranes such as the nuclear envelope, assumes
a lipid bilayer structure. While offering protection to their internal structure,
The Cellular System and the Code of Life 7
the cell membrane is also where cells exchange materials, and concurrently
energy, with the outside environment. Since the membrane is made of lipids,
most water-soluble substances, including ions, carbohydrates, amino acids,
and nucleotides, cannot directly cross it. To overcome this barrier, there are
channels, transporters, and pumps, all of which are specialized proteins, on
the cell membrane. Channels and transporters facilitate passive movement,
that is, in the direction from high to low concentration, without consumption
of cellular energy. Pumps, on the other hand, provide active transportation of
the molecules, since they transport the molecules against the concentration
gradient and therefore consume energy.
The cell membrane is also where a cell receives most incoming signals from
the environment. After signal molecules bind to their specific receptors on the
cell membrane, the signal is relayed to the inside, usually eliciting a series of
intracellular reactions. The ultimate cellular response that the signal induces
is dependent on the nature of the signal, as well as the type and condition
of the cell. For example, upon detecting insulin in the blood via the insulin
receptor in their membrane, cells in the liver respond by taking up glucose
from the blood for storage.
1.4.3 Cytoplasm
Inside the cell membrane, cytoplasm is the thick solution that contains the
majority of cellular substances, including all organelles in eukaryotic cells
but excluding the nucleus in eukaryotic cells and the DNA in prokary-
otic cells. The general fluid component of the cytoplasm that excludes the
organelles is called the cytosol. The cytosol makes up more than half of the
cellular volume and is where many cellular activities take place, including
a large number of metabolic steps such as glycolysis and interconversion of
molecules, and most signal transduction steps. In prokaryotic cells, due to
the lack of the nucleus and other specialized organelles, the cytosol is almost
the entire intracellular space and where most cellular activities take place.
Besides water, the cytosol contains large amounts of small and large
molecules. Small molecules, such as inorganic ions, provide an overall bio-
chemical environment for cellular activities. In addition, ions such as Na+,
K+, and Ca2+ also have substantial concentration differences between the
cytosol and the extracellular space. Cells spend a lot of energy maintaining
these concentration differences, and use them for signaling and metabolic
purposes. For example, the concentration of Ca2+ in the cytosol is normally
kept very low at ~10−7 M whereas in the extracellular space it is ~10−3 M. The
rushing in of Ca2+ under certain conditions through ligand-or voltage-
gated channels serves as an important messenger, inducing responses in a
number of signaling pathways, some of which lead to altered gene expres-
sion. Besides small molecules, the cytosol also contains large numbers of
macromolecules. Far from being simply randomly diffusing in the cytosol,
these large molecules form molecular machines that collectively function as
8 Next-Generation Sequencing Data Analysis
1.4.5 Ribosome
Ribosome is the protein assembly factory in cells, translating genetic infor-
mation carried in messenger RNAs (mRNAs) into proteins. There are vast
The Cellular System and the Code of Life 9
1.4.6 Endoplasmic Reticulum
As indicated by the name, ER is a network of membrane-enclosed spaces
throughout the cytosol. These spaces interconnect and form a single internal
environment called the ER lumen. There are two types of ERs in cells: rough
ER and smooth ER. The rough ER is where all cell membrane proteins, such
as ion channels, transporters, pumps, and signal molecule receptors, as well
as secretory proteins, such as insulin, are produced and sorted. The charac-
teristic surface roughness of this type of ER comes from the ribosomes that
bind to them on the outside. Proteins destined for cell membrane or secre-
tion, once emerging from these ribosomes, are threaded into the ER lumen.
This ER-targeting process is mediated by a signal sequence, or “address
tag,” located at the beginning part of these proteins. This signal sequence
is subsequently cleaved off inside ER before the protein synthesis process is
complete. Functionally different from the rough ER, the smooth ER plays an
important role in lipid synthesis for the replenishment of cellular membranes.
Besides membrane and secretory protein preparation and lipid synthesis,
one other important function of ER is to sequester Ca2+ from the cytosol. In
Ca2+-mediated cell signaling, shortly after entry of the calcium wave into the
cytosol, most of the incoming Ca2+ needs to be pumped out of the cell and/or
sequestered into specific organelles such as ER and mitochondria.
1.4.7 Golgi Apparatus
Besides ER, the Golgi apparatus also plays an indispensable role in sorting
as well as dispatching proteins to the cell membrane, extracellular space,
or other subcellular destinations. Many proteins synthesized in the ER are
sent to the Golgi apparatus via small vesicles for further processing before
being sent to their final destinations. Therefore the Golgi apparatus is
10 Next-Generation Sequencing Data Analysis
sometimes metaphorically described as the “post office” of the cell. The pro-
cessing carried out in this organelle includes chemical modification of some
of the proteins, such as adding oligosaccharide side chains, which serves as
“address labels.” Other important functions of the Golgi apparatus include
synthesizing carbohydrates and extracellular matrix materials, such as the
polysaccharide for the building of the plant cell wall.
1.4.8 Cytoskeleton
Cellular processes like the trafficking of proteins in vesicles from ER to the
Golgi apparatus, or the movement of a mitochondrion from one intracellular
location to another, are not simply based on diffusion. Rather, they follow
certain protein-made skeletal structure inside the cytosol, that is, the cyto-
skeleton, as tracks. Besides providing tracks for intracellular transport, the
cytoskeleton, like the skeleton in the human body, plays an equally important
role in maintaining cell shape, and protecting the cell framework from phys-
ical stresses as the lipid bilayer cell membrane is fragile and vulnerable to
such stresses. In eukaryotic cells, there are three major types of cytoskeletal
structures: microfilament, microtubule, and intermediate filament. Each type
is made of distinct proteins and has their own unique characteristics and
functions. For example, microfilament and microtubule are assembled from
actins and tubulins, respectively, and have different thickness (the diameter is
around 6 nm for microfilament and 23 nm for microtubule). While biochem-
ically and structurally different, both the microfilament and the microtubule
have been known to provide tracks for mRNA transport in the form of large
ribonucleoprotein complexes to specific intracellular sites, such as the distal
end of a neuronal dendrite, for targeted protein translation [4]. Besides its role
in intracellular transportation, the microtubule also plays a key role in cell
division through attaching to the duplicated chromosomes and moving them
equally into two daughter cells. In this process, all microtubules involved are
organized around a small organelle called a centrosome. Previously thought
to be only present in eukaryotic cells, cytoskeletal structure has also been
discovered in prokaryotic cells [5].
1.4.9 Mitochondrion
The mitochondrion is the “powerhouse” in eukaryotic cells. While some
energy is produced from the glycolytic pathway in the cytosol, most
energy is generated from the Krebs cycle and the oxidative phosphor-
ylation process that take place in the many mitochondria contained in a
cell. The number of mitochondria in a cell is ultimately dependent on its
energy demand. The more energy a cell needs, the more mitochondria
it has. Structurally, the mitochondrion is an organelle enclosed by two
membranes. The outer membrane is highly permeable to most cytosolic
The Cellular System and the Code of Life 11
molecules, and as a result the intermembrane space between the outer and
inner membranes is similar to the cytosol. Most of the energy releasing
process occurs in the inner membrane and in the matrix, that is, the space
enclosed by the inner membrane. For the energy release, high-energy elec-
tron carriers generated from the Krebs cycle in the matrix are fed into an
electron transport chain embedded in the inner membrane. The energy
released from the transfer of high-energy electrons through the chain to
molecular oxygen (O2), the final electron acceptor, creates a proton gra-
dient across the inner membrane. This proton gradient serves as the
energy source for the synthesis of ATP, the universal energy currency in
cells. In prokaryotic cells, since they do not have this organelle, ATP syn-
thesis takes place on their cytoplasmic membrane instead.
The origin of the mitochondrion, based on the widely accepted endo-
symbiotic theory, is an ancient α-Proteobacterium. So not surprisingly, the
mitochondrion carries its own DNA, but the genetic information contained
in the mitochondrial DNA (mtDNA) is extremely limited compared to the
nuclear DNA. The human mitochondrial DNA, for example, is 16,569 bp
in size coding for 37 genes, including 22 for transfer RNAs (tRNAs), 2
for rRNAs, and 13 for mitochondrial proteins. While it is much smaller
compared to the nuclear genome, there are multiple copies of mtDNA
molecules in each mitochondrion. Since cells usually contain hundreds
to thousands of mitochondria, there are a large number of mtDNA
molecules in each cell. In comparison, most cells only contain two copies
of the nuclear DNA. As a result, when sequencing cellular DNA samples,
sequences derived from mitochondrial DNA usually comprise a notable,
sometimes substantial, percentage of total generated reads. Although
small, the mitochondrial genomic system is fully functional and has the
entire set of protein factors for mtDNA transcription, translation, and
replication. As a result of its activity, when cellular RNA molecules are
sequenced, those transcribed from the mitochondrial genome also gen-
erate significant amounts of reads in the sequence output.
The many copies of mtDNA molecules in a cell may not all have the same
sequence due to mutations in individual molecules. Heteroplasmy occurs
when cells contain a heterogeneous set of mtDNA molecules. In general, mito-
chondrial DNA has a higher mutation rate than its nuclear counterpart. This
is because the transfer of high-energy electrons along the electron transport
chain can produce reactive oxygen species as byproducts, which can oxidize
and cause mutations in mtDNA. To make this situation even worse, the DNA
repair capability in mitochondria is rather limited. Increased heteroplasmy
has been associated with higher risk of developing aging-related diseases,
including Alzheimer’s disease, heart disease, and Parkinson’s disease [6].
Furthermore, mitochondrial DNA mutations have been known to underlie
aging and cancer development [7]. Certain hereditary mtDNA mutations
also underlie maternally inherited diseases that mostly affect the nervous
system and muscle, both of which are characterized by high energy demand.
12 Next-Generation Sequencing Data Analysis
1.4.10 Chloroplast
In animal cells, the mitochondrion is the only organelle that contains an
extranuclear genome. Plant and algae cells have another extranuclear genome
besides the mitochondrion, the plastid genome. Plastid is an organelle that can
differentiate into various forms, the most prominent of which is the chloroplast.
The chloroplast carries out photosynthesis through capturing the energy in sun-
light and fixing it into carbohydrates using carbon dioxide as substrate, and
releasing oxygen in the same process. For energy capturing, the green pigment
called chlorophyll first absorbs energy from sunlight, which is then transferred
through an electron transport chain to build up a proton gradient to drive the
synthesis of ATP. Despite the energy source, the buildup of proton gradient for
ATP synthesis in the chloroplast is very similar to that for ATP synthesis in the
mitochondrion. The chloroplast ATP derived from the captured light energy is
then spent on CO2 fixation. Similar to the mitochondrion, the chloroplast also
has two membranes: a highly permeable outer membrane and a much less per-
meable inner membrane. The photosynthetic electron transport chain, however,
is not located in the inner membrane, but in the membrane of a series of sac-like
structures called thylakoids located in the chloroplast stroma (analogous to the
mitochondrial matrix).
Plastid is believed to be evolved from an endosymbiotic cyanobaterium,
which has gradually lost the majority of its genes in its genome over millions of
years. The current size of most plastid genomes is 120–200 kb, coding for rRNAs,
tRNAs, and proteins. In higher plants there are around 100 genes coding for
various proteins of the photosynthetic system [8]. The transmission of plastid
DNA (ptDNA) from parent to offspring is more complicated than the maternal
transmission of mtDNA usually observed in animals. Based on the transmis-
sion pattern, it can be classified into three types: 1) maternal, inheritance only
through the female parent; 2) paternal, inheritance only through the male parent;
or 3) bioparental, inheritance through both parents [9]. Similar to the situation
in mitochondrion, there exist multiple copies of ptDNA in each plastid, and as
a result there are large numbers of ptDNA molecules in each cell with potential
heteroplasmy. Transcription from these ptDNA also generates copious amounts
of RNAs in the organelle. Therefore, sequence reads from ptDNA or RNA com-
prise part of the data when sequencing plant and algae DNA or RNA samples,
along with those from mtDNA or RNA.
References
1. Vale RD. The molecular motor toolbox for intracellular transport. Cell 2003,
112(4):467–480.
2. de Duve C. Peroxisomes and related particles in historical perspective. Ann N
Y Acad Sci 1982, 386:1–4.
3. Gabaldon T. Evolution of the peroxisomal proteome. Subcell Biochem 2018,
89:221–233.
4. Das S, Vera M, Gandin V, Singer RH, Tutucci E. Intracellular mRNA transport
and localized translation. Nat Rev Mol Cell Biol 2021, 22(7):483–504.
5. Mayer F. Cytoskeletons in prokaryotes. Cell Biol Int 2003, 27(5):429–438.
6. Chocron ES, Munkacsy E, Pickering AM. Cause or casualty: the role of mito-
chondrial DNA in aging and age-associated disease. Biochim Biophys Acta Mol
Basis Dis 2019, 1865(2):285–297.
7. Smith ALM, Whitehall JC, Greaves LC. Mitochondrial DNA mutations in
ageing and cancer. Mol Oncol 2022, 16(18):3276–3294.
8. de Vries J, Archibald JM. Plastid genomes. Curr Biol 2018, 28(8):R336–R337.
9. Harris SA, Ingram R. Chloroplast DNA and biosystematics: the effects of
intraspecific diversity and plastid transmission. Taxon 1991:393–412.
10. Roy U, Grewal RK, Roy S. Complex Networks and Systems Biology. In:
Systems and Synthetic Biology. Springer; 2015: 129–150.
2
DNA Sequence: The Genome Base
DOI: 10.1201/9780429329180-3 17
18 Next-Generation Sequencing Data Analysis
5’ A P C P G P A 3’
3’ T P G P C P T P A P G P C P C 5’
Template
P P + H+ Extension
C P P P
5’ A P C P G P A P T 3’
3’ T P G P C P T P A P G P C P C 5’
Extension T P P P
P P + H+
5’ A P C P G P A P T P C 3’
3’ T P G P C P T P A P G P C P C 5’
P P + H+ Mis-incorporation
5’ A P C P G P A P T P C P T 3’
3’ T P G P C P T P A P G P C P C 5’
Error Correction
P T G P P P
5’ A P C P G P A P T P C 3’
3’ T P G P C P T P A P G P C P C 5’
Extension
P P + H+
5’ A P C P G P A P T P C P G 3’
3’ T P G P C P T P A P G P C P C 5’
FIGURE 2.1
The DNA replication process. To initiate the process, a primer, which is a short DNA sequence
complementary to the start region of the DNA template strand, is needed for DNA polymerase
to attach nucleotides and extend the new strand. The attachment of nucleotides is based on
complementary base-pairing with the template. If an error occurs due to mis-pairing, the DNA
polymerase removes the mis-paired nucleotide using its proofreading function. Due to the
biochemical structure of the DNA molecule, the direction of the new strand elongation is from
its 5’ end to 3’ end (the template strand is in the opposite direction; the naming of the two
ends of each DNA strand as 5’ and 3’ is from the numbering of carbon atoms in the nucleotide
sugar ring).
20 Next-Generation Sequencing Data Analysis
Promoter 3’-UTR
DNA (Gene)
RNA
(Primary Transcript)
mRNA
Exon
Protein
Intron
FIGURE 2.2
The central dogma.
DNA Sequence: The Genome Base 21
2.4.2 Genome Sizes
For the least sophisticated organisms, such as Mycoplasma genitalium, a min-
imal genome is sufficient. For increased organismal complexity, more genetic
information and, therefore, a larger genome is needed. As a result, there
is a positive correlation between organismal complexity and genome size,
especially in prokaryotes. In eukaryotes, however, this correlation becomes
much weaker, largely due to the existence of non-coding DNA elements in
varying amounts in different eukaryotic genomes (for details on non-coding
22 Next-Generation Sequencing Data Analysis
TABLE 2.1
Genome Sizes and Total Gene Numbers in Major Model Organisms* (Ordered by
Genome Size)
Organism Genome Size (bp) Number of Coding Genes
(17,106 bp) among all currently known exons. In aggregate the total number
of currently known exons in the human genome is around 180,000. With
a combined size of 30 Mb, they constitute 1% of the human genome.
This collection of all exons in the human genome, or in other eukaryotic
genomes, is termed as the exome. Different from the transcriptome, which
is composed of all actively transcribed mRNAs in a particular sample, the
exome includes all exons contained in a genome. While it only covers a very
small percentage of the genome, the exome represents the most important
and the best annotated part of the genome. Sequencing of the exome has
been used as a popular alternative to whole genome sequencing. While it
lacks on coverage, exome sequencing is more cost effective, faster, and easier
for data interpretation.
Exons
(1.5%)
Repetitive
Sequences Not
Related to Introns
Transposons
(20%)
(14%)
Regulatory
Sequences
(5%)
Repetitive
Sequences Other Unique
Related to Non -Coding DNA
Transposons Sequences
(44.5%) (15%)
FIGURE 2.3
The composition of the human genome.
elements, or “jumping genes”) are DNA sequences that move from one gen-
omic location to another. Repeat sequence units of this type are usually 100
bp to over 10 kb in length, and may appear in over 1 million loci dispersed
across the genome.
Many highly repetitive DNA sequences exist in inert parts of chromosomes,
such as the centromere and telomere. The centromere, the region where two
sister chromatids are linked together before cell division, contains tandem
repeat sequences. The telomere, existing at the ends of chromosomes, is
also composed of highly repetitive DNA sequences. The telomeric structure
protects chromosomal integrity and thereby maintains genomic stability.
Besides being essential in maintaining the chromosomal structure, repeat
sequences have other functions in the genome, e.g., they play an architec-
tonic role in higher order physical genome structuring [5]. Despite their
abundance and function, because sequences associated with repeat regions
are not unique, they create a major hurdle for assembling a genome de novo
from sequencing reads, or mapping reads originated from these regions to a
pre-assembled genome.
DNA Sequence: The Genome Base 25
2.5.2 Sequence Access
Since different DNA sequences in the genome are constantly being
transcribed, instead of being permanently locked into the compacted
form, DNA sequences at specific loci need to be dynamically exposed to
allow transcriptional access to protein factors such as transcription factors
and coactivators. Furthermore, DNA replication and repair also require
chromatin unpackaging. This unpackaging of the chromatin structure is
carried out through two principal mechanisms. One is through histone
modification, such as acetylation of lysine residues on histones by histone
acetyltransferases, which reduces the positive charge on histones and there-
fore decreases the electrostatic interactions between histones and DNA.
Deacetylation by histone deacetylases, on the other hand, restricts DNA
access and represses transcription. The other unpackaging mechanism is
through the actions of chromatin remodeling complexes. These large pro-
tein complexes consume ATP and use the released energy to expose DNA
26 Next-Generation Sequencing Data Analysis
2.5.3 DNA-Protein Interactions
While DNA is the carrier of the code of life, the DNA code cannot be executed
without DNA-interacting proteins. Nearly all of the processes mentioned
above, including DNA packaging/ unpackaging, transcription, repair,
and replication, rely on such proteins. Besides histones, examples of these
proteins include transcription factors, RNA polymerases, DNA polymerases,
and nucleases (for DNA degradation). Many of these proteins, such as
histones and DNA/RNA polymerases, interact with DNA regardless of their
sequence or structure. Some DNA-interacting proteins bind to DNA of spe-
cial structure/conformation, e.g., high-mobility group (HMG) proteins that
have high affinity for bent or distorted DNA. Some other DNA-interacting
proteins bind only to regions of the genome that have certain characteristics
such as having damage, the examples of which are DNA repair enzymes such
as BRCA1, BRCA2, RAD51, RAD52, and TDG.
The most widely studied DNA- interacting proteins are transcription
factors, which bind to specific DNA sequences. Through binding to their
specific recognition sequences in the genome, transcription factors regu-
late transcription of gene targets that contain such sequences in their pro-
moter region. Since they bind to more than one gene locations in the genome,
transcription factors regulate the transcription of a multitude of genes in
a coordinated fashion, usually as a response to certain internal or external
environmental change. For instance, NRF2 is a transcription factor that is
activated in response to oxidative stress. Upon activation, it binds to a short
segment of specific DNA sequence called the anti-oxidant response element
(ARE) located in the promoter region of those genes that are responsive to
oxidative stress. Through binding to this sequence element in many regions
of the genome, NRF2 regulates the transcription of its target genes and
thereby elicits coordinated responses to counteract the damaging effects of
oxidative stress.
Study of DNA- protein interactions provides insights into how the
genome responds to various conditions. For example, determination of
transcription factor binding sites, such as those of NRF2, across the genome
can unravel what genes might be responsive to the conditions that activate
the transcription factors. While such sites can be predicted computationally,
only wet-lab experiment can determine where a transcription factor actu-
ally binds in the genome under a certain condition. ChIP-seq, or chromatin
immunoprecipitation coupled with sequencing, is one application of NGS
that is developed to study genomic binding of transcription factors and
other DNA-interacting proteins. Chapter 13 will focus on ChIP-seq data
analysis.
DNA Sequence: The Genome Base 27
genome, SNPs are often used as flagging markers to cover the entire genome
in high resolution when scanning for genomic region(s) that are associated
with a phenotype or disease of interest.
Besides single nucleotide substitutions, indels are another common type
of mutation. Most indels involve small numbers of nucleotides. In protein-
coding regions, small indels lead to the shift of ORF (unless the number of
nucleotides involved is a multiple of three), resulting in the formation of a
vastly different protein product. Indels that involve large regions lead to
alterations of genomic structure and are usually considered as a form of SV.
Besides large indels, SVs, defined as changes encompassing at least 50 bp [7],
also include inversions, translocations, or duplications. Copy number vari-
ation (CNV) is a subcategory of SV, usually caused by large indel or seg-
mental duplication. Although they affect larger genomic region(s) and some
lead to observable phenotypic changes or diseases, many CNVs, or SVs in
general, have no detectable effects. The frequency of SVs in the genome was
underestimated previously due to technological limitations. The emergence
of NGS has greatly enabled SV detection, which has led to the realization of
its wide existence [8].
2.7 Genome Evolution
The spontaneous mutations that lead to sequence variation and poly-
morphism in a population are also the fundamental force behind the evo-
lution of genomes and eventually the Darwinian evolution of the host
organisms. Gradual sequence change and diversification of early genomes,
over billions of years, have evolved into the extremely large number of
genomes that had existed or are functioning in varying complexity today. In
this process, existing DNA sequences are constantly modified, duplicated,
and reshuffled. Most mutations in protein-coding or regulatory sequences
disrupt the protein’s normal function or alter its amount in cells, causing cel-
lular dysfunction and affecting organismal survival. Under rare conditions,
however, a mutation can improve existing protein function or lead to the
emergence of new functions. If such a mutation offers its host a competitive
advantage, it is more likely to be selected and passed on to future generations.
Gene duplication provides another major mechanism for genome evolu-
tion. If a genomic region containing one or multiple gene(s) is duplicated
resulting in the formation of an SV, the duplicated region is not under selec-
tion pressure and therefore becomes substrate for sequence divergence and
new gene formation. Although there are other ways of adding new genetic
information to a genome such as inter-species gene transfer, DNA duplica-
tion is believed to be a major source of new genetic information generation.
DNA Sequence: The Genome Base 29
Gene duplication often leads to the formation of gene families. Genes in the
same family are homologous, but each member has their specific function and
expression pattern. As an example, in the human genome there are 339 genes
in the olfactory receptor gene family. Odor perception starts with the binding
of odorant molecules to olfactory receptors located on olfactory neurons
inside the nose epithelium. To detect different odorants, combination of
different olfactory receptors that are coded by genes in this family is required.
Based on their sequence homology, members of this large family can be even
further grouped into different subfamilies [9]. The existence of pseudogenes
in the genome is another result of gene duplication. After duplication, some
genes may lose their function and become inactive from additional mutation.
Pseudogenes may also be formed in the absence of duplication by the disab-
ling of a functional gene from mutation. A pseudogene called GULO mapped
to the human chromosome 8p21 provides such an example. The functional
GULO gene in other organisms codes for an enzyme that catalyzes the last
step of ascorbic acid (vitamin C) biosynthesis. This gene is knocked out in
primates, including human, and becomes a pseudogene. As a result, we have
to get this essential vitamin from food. The inactivation of this gene is pos-
sibly due to the insertion to the gene’s coding sequence of a retrotransposon-
type repetitive sequence called Alu element [10].
DNA recombination, or reshuffling of DNA sequences, also plays an
important role in genome evolution. Although it does not create new genetic
information, by breaking existing DNA sequences and re-joining them DNA
recombination changes the linkage relationships between different genes
and other important regulatory sequences. Without recombination, once a
harmful mutation is formed in a gene, the mutated gene will be permanently
linked to other nearby functional genes, and impossible to regroup all the
functional genes back together into the same DNA molecule. Through this
regrouping, DNA recombination makes it possible to avoid gradual accumu-
lation of harmful gene mutations. Most DNA recombination events happen
during meiosis in the formation of gametes (sperm or eggs) as part of sexual
reproduction.
AD, involves a large number of genes [15]. In this type of complex disease,
the contribution of each gene is modest, and it is the combined effects of
mutations in these genes that predispose an individual to these diseases.
Besides genetic factors, lifestyle and environmental factors often also play
a role. For example, history of head trauma, lack of mentally stimulating
activities, and high cholesterol levels are all risk factors for developing AD.
Because of the number of genes involved and their interactions with non-
genetic factors, complex multi-gene diseases are more challenging to study
than single-gene diseases.
2.9.4 Epigenomic/Epigenetic Diseases
Besides gene mutations and genome instability, abnormal epigenomic/epi-
genetic pattern can also lead to diseases. Examples of diseases in this cat-
egory include fragile X syndrome, ICF (immunodeficiency, centromeric
instability and facial anomalies) syndrome, Rett syndrome, and Rubinstein-
Taybi syndrome. In ICF syndrome, for example, the gene DNMT3B is
DNA Sequence: The Genome Base 33
References
1. Fraser CM, Gocayne JD, White O, Adams MD, Clayton RA, Fleischmann RD,
Bult CJ, Kerlavage AR, Sutton G, Kelley JM et al. The minimal gene comple-
ment of Mycoplasma genitalium. Science 1995, 270(5235):397–403.
2. Hutchison CA, 3rd, Chuang RY, Noskov VN, Assad-Garcia N, Deerinck TJ,
Ellisman MH, Gill J, Kannan K, Karas BJ, Ma L et al. Design and synthesis of a
minimal bacterial genome. Science 2016, 351(6280):aad6253.
3. Bennett GM, Moran NA. Small, smaller, smallest: the origins and evolution
of ancient dual symbioses in a Phloem-feeding insect. Genome Biol Evol 2013,
5(9):1675–1688.
4. Pellicer J, Fay MF, Leitch IJ. The largest eukaryotic genome of them all? Bot J
Linn Soc 2010, 164(1):10–15.
5. Shapiro JA, von Sternberg R. Why repetitive DNA is essential to genome
function. Biol Rev Camb Philos Soc 2005, 80(2):227–250.
6. Roach JC, Glusman G, Smit AF, Huff CD, Hubley R, Shannon PT, Rowen
L, Pant KP, Goodman N, Bamshad M et al. Analysis of genetic inher-
itance in a family quartet by whole- genome sequencing. Science 2010,
328(5978):636–639.
7. Mahmoud M, Gobet N, Cruz-Davalos DI, Mounier N, Dessimoz C, Sedlazeck
FJ. Structural variant calling: the long and the short of it. Genome Biol 2019,
20(1):246.
8. Ebert P, Audano PA, Zhu Q, Rodriguez-Martin B, Porubsky D, Bonder MJ,
Sulovari A, Ebler J, Zhou W, Serra Mari R et al. Haplotype-resolved diverse
human genomes and integrated analysis of structural variation. Science 2021,
372(6537):eabf7117.
9. Malnic B, Godfrey PA, Buck LB. The human olfactory receptor gene family.
Proc Natl Acad Sci U S A 2004, 101(8):2584–2589.
10. Inai Y, Ohta Y, Nishikimi M. The whole structure of the human nonfunctional
L-gulono-gamma-lactone oxidase gene—the gene responsible for scurvy—
and the evolution of repetitive sequences thereon. J Nutr Sci Vitaminol 2003,
49(5):315–319.
34 Next-Generation Sequencing Data Analysis
11. Law JA, Jacobsen SE. Establishing, maintaining and modifying DNA methyla-
tion patterns in plants and animals. Nat Rev Genet 2010, 11(3):204–220.
12. Cedar H, Bergman Y. Linking DNA methylation and histone modifica-
tion: patterns and paradigms. Nat Rev Genet 2009, 10(5):295–304.
13. Guo W, Chung WY, Qian M, Pellegrini M, Zhang MQ. Characterizing the
strand-specific distribution of non-CpG methylation in human pluripotent
cells. Nucleic Acids Res 2014, 42(5):3009–3016.
14. Wu H, Zhang Y. Reversing DNA methylation: mechanisms, genomics, and
biological functions. Cell 2014, 156(1–2):45–68.
15. Shademan B, Biray Avci C, Nikanfar M, Nourazarian A. Application of next-
generation sequencing in neurodegenerative diseases: opportunities and
challenges. Neuromolecular Med 2021, 23(2):225–235.
16. Nishiyama A, Nakanishi M. Navigating the DNA methylation landscape of
cancer. Trends Genet 2021, 37(11):1012–1027.
17. Pappalardo XG, Barra V. Losing DNA methylation at repetitive elements and
breaking bad. Epigenetics Chromatin 2021, 14(1):25.
3
RNA: The Transcribed Sequence
DOI: 10.1201/9780429329180-4 35
36 Next-Generation Sequencing Data Analysis
FIGURE 3.1
How the two strands of DNA template match the transcribed mRNA in sequence, and the
genetic code in mRNA sequence corresponds to peptide amino acid sequence.
3.3.1 DNA Template
To initiate transcription, a gene’s DNA sequence is first exposed through
altering its packing state. In order to transcribe the DNA sequence, the two
DNA strands in the region are first unwound and only one strand is used as
the template strand for transcription. Since it is complementary to the RNA
transcript in base pairing (A, C, G, and T in the DNA template are transcribed
to U, G, C, and A, respectively, in the RNA transcript), this DNA template
strand is also called the antisense or negative (–) strand (Figure 3.1). The
other DNA strand has the same sequence as the mRNA (except with T’s in
DNA being replaced with U’s in RNA) and is called the coding strand, sense,
or positive (+) strand. It should be noted that either strand of the genomic
DNA can be potentially used as the template, and which strand is used as the
template for a gene depends on the orientation of the gene along the DNA. It
should also be noted that the triplet nucleotide genetic code that determines
how amino acids are assembled in proteins refers to the triplet sequence in
the mRNA sequence.
the core enzyme. The core RNA polymerase, unlike DNA polymerase, does
not need a primer, but otherwise the enzyme catalyzes the attachment of
nucleotides to the nascent RNA molecule one at a time in the 5’→3’ direction.
At a speed of approximately 30 nucleotides/second, the RNA polymerase
slides through the DNA template carrying the elongating RNA molecule.
Although the attachment of new nucleotides to the elongating RNA is
based on base pairing with the DNA template, the new elongating RNA
does not remain associated with the template DNA via hydrogen bonding.
On the same template multiple copies of RNA transcripts can be simul-
taneously synthesized by multiple RNA polymerases one after another.
During transcript elongation, these polymerases hold onto the tem-
plate tightly and do not disassociate from the template until stop signal
is transcribed. The stop signal is provided by a segment of palindromic
sequence located at the end of the transcribed sequence. Right after tran-
scription, the inherent self-complementarity in the palindromic sequence
leads to the spontaneous formation of a hairpin structure. Additional stop
signal is also provided by a string of four or more uracil residues after the
hairpin structure, which forms weak associations with the complementary
A’s on the DNA template. The hairpin structure pauses further elongation
of the transcript, and the weak associations between the U’s on the RNA
and the A’s on the DNA dissociate the enzyme and the transcript from the
template.
Regulation of prokaryotic transcription is conferred by promoters and
protein factors such as repressors and activators. Promoter strength, that
is, the number of transcription events initiated per unit time, varies widely
in different operons. For example, in E. coli, genes in operons with weak
promoters can be transcribed once in 10 minutes, while those with strong
promoters can be transcribed 300 times in the same amount of time. The
strength of an operon’s promoter is based on the host cell’s demand for its
protein products, and dictated by its sequence. Specific protein factors may
also regulate gene transcription. Repressors, the best known among these
factors, prevents RNA polymerase from initiating transcription through
binding to an intervening sequence between promoter and TSS called oper-
ator. Activators exert an opposite effect and induce higher levels of transcrip-
tion. The sigma factor, being the initiation factor of the prokaryotic RNA
polymerase, provides another mechanism for regulation. There are different
forms of this factor in prokaryotic cells, each of which mediates sequence spe-
cific transcription. Differential use of these sigma factors, therefore, provides
another level of transcriptional regulation in prokaryotic cells.
3.3.4 Maturation of mRNA
In prokaryotic cells, there is no post- transcription RNA processing, and
transcripts are immediately ready for protein translation after transcription.
In fact, while mRNAs are still being transcribed, ribosomes already bind to
the transcribed portions of the elongating mRNAs synthesizing peptides. In
eukaryotic cells, however, primary transcripts undergo several steps of pro-
cessing in the nucleus to become mature mRNAs. These steps are (1) capping
at the 5’ end, (2) splicing of exons and introns, and (3) addition of a poly-A
tail at the 3’ end.
The first step, adding a methylated guanosine triphosphate cap to the 5’
end of nascent pre-mRNAs, takes place shortly after the initiation of tran-
scription when the RNA chains are still less than 30 nucleotides long. This
step is carried out by adding a guanine group to the 5’ end of the transcripts,
followed by methylation of the group. This cap structure marks the transcripts
for subsequent transport to the cytoplasm, protects them from degradation,
and promotes efficient initiation of protein translation. Once formed, the cap
is bound by a protein complex called cap-binding complex.
The second step, splicing of exons and introns, is the most complicated
among the three steps. As introns are non-coding intervening sequences,
they need to be spliced out while exons are retained to generate mature
mRNAs. The molecular machinery that carries out the splicing, called the
spliceosome, is assembled from as many as 300 proteins and 5 small nuclear
RNAs (snRNAs). The spliceosome identifies and removes introns from pri-
mary transcripts, using three positions within each intron: the 5’ end (starts
with the consensus sequence 5’-GU, serving as the splice donor), the 3’ end
(ends with the consensus sequence AG-3’, as the splice acceptor), and the
branch point, which starts around 30 nucleotides upstream of the splicing
RNA: The Transcribed Sequence 41
acceptor and contains an AU-rich region. The actual excision of each intron
and the concomitant joining of the two neighboring exons are a three-step
process: (1) cleavage at the 5’ end splice donor site; (2) attachment of the
cleaved splice donor site to the branch point to form a lariat or loop structure;
and (3) cleavage at the 3’ end splicing acceptor site to release the intron and
join the two exons.
Beyond simply removing introns from primary transcripts, the splicing
process also employs differential use of exons, and sometimes even includes
some introns, to create multiple mature mRNA forms from the same primary
transcript. This differential splicing, also called alternative splicing
(Figure 3.2), provides an additional regulatory step in the production of
mRNA populations. When it was first reported in 1980, alternative splicing
was considered to be an exception rather than the norm. It is now well
established that primary transcripts from essentially all multi-exon genes are
alternatively spliced [5, 6]. The biological significance of alternative splicing
is obvious: by enabling production of multiple mRNAs and thereby proteins
from the same gene, it greatly augments protein and consequently functional
diversity in an organism without significantly increasing the number of genes
in the genome, and offers explanation to the question why more evolved
organisms do not contain many more genes in their genomes (Chapter 2,
Table 2.1).
Exon Skipping
Intron Retention
FIGURE 3.2
Varying forms of RNA transcript splicing.
42 Next-Generation Sequencing Data Analysis
In the third step, once the new primary transcript passes the termination
signal sequence, it is bound by several termination-related proteins. One of
the proteins cleaves the RNA at a short distance downstream of the termin-
ation signal to generate the 3’ end. This is followed by a polyadenylation step
that adds 50–200 A’s to the 3’ end by an enzyme called poly-A polymerase.
This poly-A tail, like the 5’ end cap, increases the stability of the resulting
mRNA. This tail is bound and protected by poly(A)-binding protein, which
also promotes its transport to the cytoplasm.
Besides these three major constitutive processing steps, some transcripts
may undergo additional processing steps. RNA editing, although considered
to be rare, is among the best known of these steps. RNA editing refers to
the change in RNA nucleotide sequence after it is transcribed. The most
common types of RNA editing are conversions from A to I (inosine, read as G
during translation), which are catalyzed by enzymes such as ADARs (adeno-
sine deaminases that act on RNA), or from C to U, catalyzed by cytidine
deaminases. As a result of these conversions, an edited RNA transcript no
longer fully matches the sequence on the template DNA. RNA editing has the
potential to change genetic codons, introduce new or remove existing stop
codons, or alter splicing sites [7]. Evidence shows that RNA editing and other
RNA processing events such as splicing can be coordinated [8].
DNA
1 - Transcriptional control
Pre-mRNA
mRNA
Nucleus
mRNA
4 – Translational control
Protein
Inactive mRNA/
Degraded mRNA
FIGURE 3.3
The regulation of eukaryotic gene expression at multiple levels.
3.4.1 Ribozyme
Similar to proteins, RNAs can form complicated three- dimensional
structures, and some RNA molecules carry out catalytic functions. These
catalytic RNAs are called ribozymes. A classic example of ribozyme is one
type of intron called group I intron, which splices itself out of the pre-mRNA
that contains it. This self-splicing process, involving two transesterification
steps, is not catalyzed by any protein. Group I intron is about 400 nucleotides
in length and mostly found in organelles, bacteria, and the nucleus of lower
eukaryotes. When a precursor RNA that contains group I intron is incubated
in a test tube, the intron splices itself out of the precursor RNA autonomously.
Despite variations in their internal sequences, all group I introns share a char-
acteristic spatial structure, which provides active sites for catalyzing the two
steps. Another example of ribozyme is the 23S rRNA contained in the large
subunit of the prokaryotic ribosome. This rRNA catalyzes the peptide bond
formation between an incoming amino acid and the existing peptide chain.
Although the large subunit contains over 30 proteins, rRNA is the catalytic
component while the proteins only provide structural support and stabiliza-
tion [18].
Also similar to protein catalysts, the dynamics of the reactions catalyzed
by ribozymes follows the same characteristics as those of protein enzyme-
catalyzed reactions, which are usually described by the Michaelis–Menten
equation. Further similarities of ribozymes to protein enzymes include that
ribozyme activity can also be regulated by ligands, usually small molecules,
the binding of which leads to structural change in the ribozyme. For instance,
46 Next-Generation Sequencing Data Analysis
a ribozyme may contain a riboswitch, which as part of the ribozyme can bind
to a ligand to turn on or off the ribozyme activity.
3.4.4.1 miRNA
Mature miRNA, at around 22 nucleotides in size, induces gene silencing
through mRNA translational repression or decay. The precursor of miRNA is
usually transcribed from non-protein-coding genes in the genome (Figure 3.4).
The primary transcript, called pri-miRNA, contains internal hairpin structure
and is much longer than mature miRNA. For initial processing, the pri-miRNA
is first trimmed in the nucleus by a ribonuclease called Drosha that exists
as part of a protein complex called the microprocessor, to an intermediate
molecule called pre-miRNA, about 70-nucleotide in size. Alternatively, some
miRNA precursors originate from introns spliced out from protein-coding
transcripts. These precursors, to be processed for the generation of mirtrons
(miRNAs derived from introns), bypass the microprocessor complex in the
nucleus. For further processing, the pre-miRNA and the mirtron precursor
are exported out of the nucleus into the cytoplasm, where they are cleaved
by the endoribonuclease Dicer to form double-stranded miRNA. The double-
stranded miRNA is subsequently loaded into RISC. Argonaute, the core
protein component of RISC, unwinds the two miRNA strands and discard
one of them [22]. The remaining strand is used by Argonaute as the guide
sequence to identify related mRNA targets through imperfect base pairing
48 Next-Generation Sequencing Data Analysis
miRNA
Gene
Pri-miRNA
Drosha
Pre-miRNA
Exportin 5
Nucleus
Cytoplasm dsRNA
Dicer
miRNA: siRNA
miRNA* duplex
duplex Unwind
RIS C RISC
Ribosome
Target
mRNA
ORF
RISC RISC RISC
Translational repression m RNA cleavage
FIGURE 3.4
The generation and functioning of miRNA and siRNA in suppressing target mRNA activity.
Genomic regions that code for miRNAs are first transcribed into pri-miRNAs, which are processed
into smaller pre-miRNAs in the nucleus by Drosha. The pre-miRNAs are then transported by
exportin 5 into the cytoplasm, where they are further reduced to miRNA:miRNA* duplex by Dicer.
While both strands of the duplex can be functional, only one strand is assembled into the RNA-
induced silencing complex (RISC), which induces translational repression or cleavage of target
mRNAs. Long double-stranded RNA can also be processed by Dicer to generate siRNA duplex,
which also uses RISC to break down target mRNA molecules. (Adapted by permission from
Macmillan Publishers Ltd: Nature Reviews Genetics, He L. and Hannon G.J. (2004) MicroRNAs: small
RNAs with a big role in gene regulation. Nature Reviews Genetics 5, 522–531, ©2004.)
with seed sequence usually located in the 3’-UTR of mRNAs. Through this
miRNA-mRNA interaction, RISC induces silencing of target genes through
repressing translation of the mRNAs and/or their deadenylation and degrad-
ation. Because the base pairing is imperfect, one miRNA can target multiple
RNA: The Transcribed Sequence 49
3.4.4.2 siRNA
While being similar in size and using basically the same system for gene
silencing (Figure 3.4), siRNA differs from miRNA in a number of aspects. On
origin, siRNA is usually exogenously introduced, such as from viral inva-
sion or artificial injection. But they can also be generated endogenously, e.g.,
from repeat-sequence-generated transcripts (such as those from telomeres
or transposons), or RNAs synthesized from convergent transcription (in
which both strands of a DNA sequence are transcribed from the two opposite
orientations with corresponding promoters), or other naturally occurring
sense-antisense transcript pairs [23]. To generate mature siRNA, exogenously
introduced double-stranded RNA, or endogenously transcribed precursor
that is transported from the nucleus to the cytoplasm, is cleaved by Dicer.
The mature siRNA is then loaded into RISC for silencing target mRNAs by
Argonaute. On target mRNA identification, siRNA differs from mRNA in that
it has perfect or nearly perfect sequence complementarity with their target.
On the mechanism of gene silencing, siRNA usually leads to endonucleolytic
cleavage, also called slicing, of the mRNAs.
3.4.4.3 piRNA
As a relatively newer class of small non-coding RNA, piRNAs are between
23 and 31 nucleotides in length, and have functions mostly found in animal
germline tissues as a defense mechanism against transposons (or transpos-
able elements, “selfish” DNA elements that have the capability to move
around in the genome). While functioning with a similar basic RNAi mech-
anism, piRNA is different from miRNA and siRNA in two major aspects.
One is that its biogenesis does not involve Dicer, and the other is that, for
target gene silencing, it specifically interacts with PIWI proteins, a different
clade in the Argonaute protein family. The biogenesis of piRNA, independent
Dicer activity, starts from transcription of long RNAs from specific loci of the
genome called piRNA clusters. With regard to these clusters, it has been found
that while their locations in the genome do not show much change in related
species, their sequences are not conserved even in closely related species,
indicating that they are derived from invading transposable elements serving
as an adaptive genome immunity mechanism. After transcription, the long
piRNA precursor is transported out of the nucleus to the cytoplasm for pro-
cessing into mature piRNA. To induce target gene silencing, mature piRNA
is loaded into RISC that contains PIWI, which uses the piRNA sequence as
guide to silence target mRNAs by slicing. Besides this post-transcriptional
silencing, piRNA- loaded mature RISC can also be transported into the
nucleus, where it finds and silences target mRNAs that are still in the process
50 Next-Generation Sequencing Data Analysis
species introduced so far, circRNAs have their 5’ and 3’ ends joined forming
a loop structure. This structure makes them less vulnerable to attacks from
RNases and expectedly more stable. Because their widespread existence was
only unveiled with the use of RNA-seq, the functions of most circRNAs are
still being investigated. Among currently established functions are the roles
they play in sequestering miRNA and RNA-binding proteins from their
targets, and regulating transcription, splicing, and translation events [33].
Besides the major non-coding RNAs introduced in this chapter, there are also
other classes of non-coding RNA species in cells that perform a remarkable
array of functions [34]. It is highly possible that new classes of non-coding
RNAs will continue to be discovered through RNA sequencing.
References
1. Bedard AV, Hien EDM, Lafontaine DA. Riboswitch regulation
mechanisms: RNA, metabolites and regulatory proteins. Biochim Biophys Acta
Gene Regul Mech 2020, 1863(3):194501.
52 Next-Generation Sequencing Data Analysis
22. Kawamata T, Tomari Y. Making RISC. Trends Biochem Sci 2010, 35(7):368–376.
23. Carthew RW, Sontheimer EJ. Origins and mechanisms of miRNAs and
siRNAs. Cell 2009, 136(4):642–655.
24. Liu X, Hao L, Li D, Zhu L, Hu S. Long non-coding RNAs and their biological
roles in plants. Genomics Proteomics Bioinformatics 2015, 13(3):137–147.
25. Derrien T, Johnson R, Bussotti G, Tanzer A, Djebali S, Tilgner H, Guernec G,
Martin D, Merkel A, Knowles DG et al. The GENCODE v7 catalog of human
long noncoding RNAs: analysis of their gene structure, evolution, and expres-
sion. Genome Res 2012, 22(9):1775–1789.
26. Gupta RA, Shah N, Wang KC, Kim J, Horlings HM, Wong DJ, Tsai MC, Hung
T, Argani P, Rinn JL et al. Long non-coding RNA HOTAIR reprograms chro-
matin state to promote cancer metastasis. Nature 2010, 464(7291):1071–1076.
27. Zhao J, Sun BK, Erwin JA, Song JJ, Lee JT. Polycomb proteins targeted by a short
repeat RNA to the mouse X chromosome. Science 2008, 322(5902):750–756.
28. Li W, Notani D, Ma Q, Tanasa B, Nunez E, Chen AY, Merkurjev D, Zhang
J, Ohgi K, Song X et al. Functional roles of enhancer RNAs for oestrogen-
dependent transcriptional activation. Nature 2013, 498(7455):516–520.
29. Yoon JH, Abdelmohsen K, Srikantan S, Yang X, Martindale JL, De S, Huarte
M, Zhan M, Becker KG, Gorospe M. LincRNA-p21 suppresses target mRNA
translation. Mol Cell 2012, 47(4):648–655.
30. Gong C, Maquat LE. lncRNAs transactivate STAU1-mediated mRNA decay
by duplexing with 3’ UTRs via Alu elements. Nature 2011, 470(7333):284–288.
31. Yarmishyn AA, Kurochkin IV. Long noncoding RNAs: a potential novel class
of cancer biomarkers. Front Genet 2015, 6:145.
32. Ni YQ, Xu H, Liu YS. Roles of Long Non-coding RNAs in the development
of aging- related neurodegenerative diseases. Front Mol Neurosci 2022,
15:844193.
33. Nisar S, Bhat AA, Singh M, Karedath T, Rizwan A, Hashem S, Bagga P, Reddy
R, Jamal F, Uddin S et al. Insights into the role of CircRNAs: biogenesis,
characterization, functional, and clinical impact in human malignancies. Front
Cell Dev Biol 2021, 9:617281.
34. Cech TR, Steitz JA. The noncoding RNA revolution—trashing old rules to
forge new ones. Cell 2014, 157(1):77–94.
35. Carninci P, Kasukawa T, Katayama S, Gough J, Frith MC, Maeda N, Oyama R,
Ravasi T, Lenhard B, Wells C et al. The transcriptional landscape of the mam-
malian genome. Science 2005, 309(5740):1559–1563.
36. Djebali S, Davis CA, Merkel A, Dobin A, Lassmann T, Mortazavi A, Tanzer
A, Lagarde J, Lin W, Schlesinger F et al. Landscape of transcription in human
cells. Nature 2012, 489(7414):101–108.
55
Part II
Introduction to
Next-Generation
Sequencing (NGS) and
NGS Data Analysis
4
Next-Generation Sequencing (NGS)
Technologies: Ins and Outs
DOI: 10.1201/9780429329180-6 57
58 Next-Generation Sequencing Data Analysis
Next-Generation Sequencing (NGS) Technologies: Ins and Outs 59
FIGURE 4.1
The Sanger sequencing method as originally proposed. This method involves a step for new DNA
strand synthesis using the sequencing target DNA as template, followed by sequence deduction
through resolution of the newly synthesized DNA strands. In the first step (A), the new strand
synthesis reaction mixture contains denatured DNA template, primer, DNA polymerase, and
dNTPs. Besides the dNTPs, the Sanger method is characterized by the use of dideoxynucleotides
(ddG, ddA, ddT, and ddC; the inset illustrates the structural difference between ddATP and
dATP) that are labeled with different fluorochromes. The DNA polymerase in the reaction
mixture incorporates dideoxynucleotides into the elongating DNA strand along with regular
nucleotides, but once a dideoxynucleotide is incorporated, the strand elongation terminates.
In this sequencing scheme, the ratio of these dideoxynucleotides to their regular counterparts
is controlled so that the polymerization can randomly terminate at each base position. The
end product is a population of DNA fragments with different lengths, with the length of each
fragment dependent on where the dideoxynucleotide is incorporated. These fragments are then
separated using capillary electrophoresis, in which smaller fragments migrate faster than larger
ones and as a result pass through the laser detector sooner. The fluorochrome labels they carry
enable computational deduction of the specific sequence of the original DNA. (Image adapted
from https://commons.wikimedia.org/w/index.php?curid=23264166 by Estevezj. Used under
the Creative Commons Attribution-Share Alike 3.0 Unported (CC BY-SA 3.0) license (https://
creativecommons.org/licenses/by-sa/3.0/deed.en).)
sequencing cost, largely due to the segregation of its DNA synthesis pro-
cess and the subsequent DNA chain separation/detection process. Its prin-
ciple of sequencing-by-synthesis, however, becomes the basis of many NGS
technologies, including Illumina’s reversible terminator sequencing, Pacific
Biosciences’ single-molecule real-time (SMRT) sequencing, ThermoFisher’s
Ion Torrent semiconductor sequencing, and the discontinued 454/Roche’s
pyrosequencing. Different from the first-generation method, these technolo-
gies use nucleotides with reversible terminator or other cleavable chemical
modifications, or regular unmodified nucleotides, so the new DNA strand
synthesis is not permanently terminated and therefore can be monitored as
or after each base is incorporated.
Not all NGS technologies are based on the principle of sequencing-by-
synthesis. For example, Oxford Nanopore sequencing and the discontinued
SOLiD sequencing from Life Technologies use nanopore sensing and
sequencing-by-ligation, respectively. Despite the differences in how different
NGS technologies work in principle, there is one common denominator
among them that separate them from first-generation sequencing, which is
their massive data throughput by sequencing millions to billions of DNA
molecules simultaneously. Besides the ingenuity in the development of new
sequencing chemistries or detection schemes to be detailed next, the success
of NGS technologies in achieving extremely high throughput is also due to
modern engineering and computing feats. Advancements in microfluidics
and microfabrication make signal detection from micro-volume of sequen-
cing reaction possible. Developments in modern optics and imaging tech-
nology enable tracking of sequencing reactions in high resolution, high
60 Next-Generation Sequencing Data Analysis
fidelity, and high speed. Some NGS platforms also rely on the decades of
progress in the semiconductor industry or more recent but rapid devel-
opment in nanopore technology (such as the Ion Torrent and Nanopore
platforms, respectively). High-performance computing makes it possible
to process and deconvolve the torrent of signals recorded from millions of
these processes.
As different NGS technologies employ different mechanisms and imple-
mentation strategies, in the next section the specifics of some of the most
adopted NGS platforms at the time of writing (early 2022) are detailed. As
NGS technologies continue to evolve, new platforms will appear while some
current technologies become obsolete. While an overview of NGS platforms
usually becomes outdated fairly soon, the guiding principles on the analysis
of NGS data introduced in this book will remain.
4.2.1.2 Implementation
The sequencing reaction in an Illumina NGS system takes place in a flow
cell (Figure 4.2). The fluidic channels in the flow cell, often called lanes, are
Next-Generation Sequencing (NGS) Technologies: Ins and Outs 61
FIGURE 4.2
An Illumina sequencing flow cell. It is a special glass slide that contains fluidic channels inside
(called lanes). Sequencing libraries are loaded into the lanes for massively parallel sequencing
after template immobilization and cluster generation. In each step of the sequencing process,
DNA synthesis mixture, including DNA polymerase and modified dNTPs, is pumped into and
out of each of the lanes through their inlet and outlet ports located at the two ends.
where sequencing reactions take place and sequencing signals are collected
through scanning. The top and bottom surface of each lane is covered with
a lawn of oligonucleotide sequences that are complementary to the anchor
sequences in Illumina adapters. When sequencing libraries, prepared from
DNA through fragmentation and adapter ligation, are loaded into each of the
lanes, DNA templates in the libraries bind to these oligonucleotide sequences
and become immobilized onto the lane surface (Figure 4.3). After immobil-
ization, each template molecule is clonally amplified through an isothermal
process called “bridge amplification,” through which up to 1,000 identical
copies of the template are generated in close proximity (<1 micron in diam-
eter) forming a cluster. During sequencing, these clusters are basic detection
units, which generate enough signal intensity for basecalling.
62 Next-Generation Sequencing Data Analysis
FIGURE 4.3
Illumina sequencing process overview. (Used under license from Illumina, Inc. All Rights
Reserved.)
becomes worse. This is why platforms that are based on clonal amplification
(which also include the Ion Torrent platform to be detailed later) have
declining basecall quality scores toward the end. Eventually the decrease in
basecall quality reaches a threshold beyond which the quality scores become
simply unacceptable. The gradual loss of synchronicity is a major deter-
minant of read length for these platforms.
FASTQ files, which contain raw reads. Since multiple samples are typically
sequenced together in a multiplex fashion, demultiplexing of the sequence
data is also performed in the third step. This is typically performed using
Illumina’s bcl2fastq tool, but other tools such as IlluminaBasecallsToFastq [2]
can also be used. The demultiplexed FASTQ files in a compressed format are
what an end user typically receives from an NGS facility after the completion
of a run.
4.2.2.2 Implementation
At the core of PacBio sequencing is the SMRT cell, which carries millions of
wells, technically called zero-mode waveguides (or ZMWs), for simultaneous
sequencing of millions of DNA templates (the current version, as of early
2022, has 8 million of ZMWs). ZMWs are essentially holes tens of nanometers
in diameter microfabricated in a metal film of 100 nm thickness, which is
in turn deposited onto a glass substrate. Because the diameter of a ZMW is
smaller than the wavelength of visible light, and the natural behavior of vis-
ible light passing through such a small opening from the glass bottom, only
the bottom 30 nm of the ZMW is illuminated. Having a detection volume
of only 20 zetpoliters (10−21 L), this detection scheme greatly reduces back-
ground noise and enables detection of the light of different wavelengths
emitted from nucleotide incorporation into a new DNA strand.
While the SMRT platform performs single- molecule sequencing, the
standard library prep protocol still requires DNA samples at the µg level
Next-Generation Sequencing (NGS) Technologies: Ins and Outs 65
to start (for lower DNA input amplification is needed). The library prep
process includes fragmentation of DNA into desired length, end repair/A-
tailing, and ligation of a hairpin loop adapter. This leads to the formation
of a circular structure called SMRTbell (Figure 4.4). To prepare for sequen-
cing, a SMRTbell template is annealed to a sequencing primer, and a DNA
polymerase enzyme molecule is subsequently bound to the template/primer
structure. The template-primer-polymerase complex is then immobilized to
the bottom of a SMRT cell prior to sequencing.
The currently available PacBio SMRT sequencers (Sequel II/IIe) have two
sequencing modes called continuous long reads (CLR) and circular con-
sensus sequencing (CCS). With CLR, the DNA polymerase continues to
advance along a template until it stops, thereby producing long reads in one
pass. With CCS, the DNA polymerase goes through the SMRTbell structure
multiple times and traverses both strands of the template in order to generate
a consensus read (Figure 4.4).
Double-stranded DNA
Ligate adapters
Sequence
Subread
errors
Subreads
(passes)
Generate
consensus read CCS Read
Reference
FIGURE 4.4
PacBio sequencing library preparation and sequencing. The library prep process mostly involves
ligation of hairpin loop adapters to create the SMRTbell structure. A SMRTbell template can be
sequenced using either circular consensus sequencing (CCS, shown here) or continuous long
reads (CLR) mode. In the CCS mode, a template undergoes multiple passes to produce error-
prone subreads in each pass followed by generation of accurate consensus reads. (Adapted by
permission from Springer Nature Customer Service Centre GmbH: Springer Nature, Nature
Biotechnology, Accurate circular consensus long-read sequencing improves variant detection and
assembly of a human genome, Aaron M. Wenger et al., Copyright 2019.)
Next-Generation Sequencing (NGS) Technologies: Ins and Outs 67
115
dsDNA
Current (pA)
Motor protein
cis Nanopore 85
Memb rane
trans Ions TG A A A 5mCGCT A AC A A A TGA T5mCG
55
0 10 20 30 40
Time (ms)
Array of microscaffolds
Sensor chip
MinION
ASIC
Flow cell
FIGURE 4.5
Nanopore sequencing. Illustrated here is sequencing with a MinION flow cell which contains
512 channels with 4 nanopores in each channel. The electrically insulating membrane that carries
the nanopores is supported by an array of microscaffolds which is underlain by a senor chip.
There are electrodes on the sensor chip that correspond to the individual channels, and electrical
signals from the electrodes are recorded by the application-specific integrated circuit or ASIC.
The recorded signals are then analyzed to make basecalls. (Adapted by permission from Springer
Nature Customer Service Centre GmbH: Springer Nature, Nature Biotechnology, Nanopore
sequencing technology, bioinformatics and applications, Yunhao Wang et al., Copyright 2021.)
68 Next-Generation Sequencing Data Analysis
4.2.3.2 Implementation
ONT currently offers three main devices at different data throughput
levels: MinION, GridION, and PromethION. MinION, at the lower end, is a
USB drive-sized device that holds one flow cell. The GridION as a level up can
hold up to five flow cells. The flow cells used in MinION and GridION is of the
same type that contains 2,048 nanopores in 512 channels. The PromethION at
the high end has the capacity to run up to 48 flow cells simultaneously. The
flow cell used in PromethION has more capacity with 12,000 nanopores in
3,000 channels. For both the MinION/GridION and PromethION flow cells,
in each channel only one pore can perform sequencing at a time. Besides
these flow cells, for commonly conducted, smaller-scale sequencing, ONT
also offers a flow cell dongle called Flongle, which provides an adapter for
MinION/GridION to allow use of smaller and lower-cost flow cells. The
Flongle adapter has 126 channels allowing simultaneous sequencing from
126 nanopores.
ONT provides two modes of sequencing, with one for generation of long
reads (currently defined as below 100 kb) and the other for ultra-long reads
(≥100 kb). The length of input DNA determines which mode to use. The
sample library prep process for long-read sequencing involves fragmenta-
tion and/or size selection (optional), end repair, A-tailing, and ligation of
sequencing adapters. Ultra-long sequencing library prep requires extraction
of ultra-
high- molecular- weight DNA. While the steps involved in ultra-
long sequencing library prep may continue to evolve, the current procedure
includes a transposition step that cleaves the template and attaches tags to the
cleaved ends simultaneously, a subsequent step to add sequencing adapters
to the tagged ends, and lastly an overnight elution of the DNA library prior
to loading into a flow cell.
the newest nanopore (R10.4) available at the time of writing, the raw read
error rate is at 1% to achieve 99% (Q20) accuracy. Homopolymer error is the
most common error type in ONT sequencing. In terms of data output, the
MinION/GridION can typically produce 10–20 Gb data (30 Gb maximum
at the current moving speed of 250 bases/second) from each flow cell. The
PromethION has a throughput of 50–100 Gb (170 Gb maximum) per flow
cell. With a top loading capacity of 48 flow cells, the data output from the
PromethION can exceed that of PacBio Sequel II and Illumina NovaSeq 6000.
The cost of sequencing on the MinION/GridION platforms is US$45–90 per
Gb, and US$13–40 per Gb on the PromethION. Again this calculation is based
on the list price per flow cell at the time of writing divided by the typical data
output on each platform.
4.2.4.2 Implementation
The library construction process in this technology is similar to other NGS
technologies, involving ligation of platform-specific primers to DNA shotgun
fragments. The library fragments are then clonally amplified by emulsion
PCR onto the surface of 3-micron diameter beads. The microbeads coated
with the amplified sequence templates are then deposited into an Ion chip.
Each Ion chip has a liquid flow chamber that allows influx and efflux of native
nucleotides (introduced one at a time), along with DNA polymerase and
buffer that are needed in the sequencing-by-synthesis process. For measuring
possible pH change associated with each introduction of nucleotide, there are
millions of pH microsensors that are manufactured on the chip bottom by the
employment of standard processes used in the semiconductor industry.
71
72 Next-Generation Sequencing Data Analysis
employs five chip types: Ion 510, 520, 530, 540, and 550, with the 550 chip
producing the most data (20–25 Gb) and being the most cost effective and
the 510 having the least amount of data (0.3–1 Gb) and being the least cost
effective. The Ion 318 Dx chip used on the PGM Dx system, similar to the 510
chip in throughput, generates 600 Mb to 1Gb data and is suitable for running
molecular diagnostic tests that do not require a lot of data.
Sequencing Target
(DNA or RNA)
Fragmentation*
Size Selection*
End Repair/A-Tailing
Adapter Ligation
Library Enrichment*
Sequencing
Data Analysis
FIGURE 4.6
The general workflow of an NGS experiment. For library construction, only core steps shared
by the different sequencing platforms are shown. The steps marked with asterisks are not used
in some library construction protocols. Adapters ligated to sequencing targets are specific to
each platform. There are other library construction strategies or procedures, such as use of non-
ligation or target sequence capture, that are not shown here.
5’-
dT overhangs and thereby avoid self- ligation of DNA fragments or
adapters. This AT overhang-based adapter ligation process, however, tend to
be biased against DNA fragments that start with a T [8]. The sequencing of
large RNA species, such as mRNAs or long non-coding RNAs, is also affected
by this bias, as cDNA molecules reverse transcribed from these species are
also subjected to the same adapter ligation process. Small RNA sequencing is
not affected by this bias, as the ligation of adapters in small RNA sequencing
library preparation is carried out prior to the reverse transcription step. The
small RNA adapter ligation step, however, introduces a different type of bias,
which affects some small RNAs in a sequence-specific manner. Sequence spe-
cificity underlies small RNA secondary and tertiary structure, which is also
affected by temperature, concentration of cations, and destabilizing organic
agents (such as DMSO) in the ligation reaction mixture. The efficiency of
small RNA adapter ligation is influenced by their secondary and tertiary
structure [9].
PCR biases. After adapter ligation, the DNA library is usually enriched by
PCR for sequencing on most of the current NGS platforms. PCR, based on the
use of DNA polymerases, is known to be biased against DNA fragments that
are extremely GC-or AT-rich [10]. This can lead to variation in the coverage
of different genomic regions and under-representation of those regions that
are GC-or AT-rich. While optimization of PCR conditions can ameliorate
this bias to some degree especially for high-GC regions, this bias can only
be eliminated via adoption of a PCR-free workflow. To achieve this, Illumina
provides PCR-free options. For single-molecule sequencing carried out on
the PacBio SMRT and ONT platforms, PCR amplification is typically not
required unless the input amount of starting DNA/RNA is low.
The sequencing signal processing and basecalling steps may also intro-
duce bias. For example, on the Illumina platform, basecalling results can
be affected by color crosstalk, spatial crosstalk, as well as phasing and pre-
phasing [11, 12]. On MiSeq, for instance, four images are generated from
four detection channels after each cycle, which need to be overlaid to extract
signal intensities for basecalling. This procedure is complicated by three
factors: 1) signals from the four channels are not totally independent, as there
is crosstalk between A and C, and between G and T channels, due to the over-
lapping in the emission spectra of their fluorescent labels; 2) neighboring
clusters may partially overlap leading to spatial crosstalk between adja-
cent clusters; and 3) signals from a particular cycle are also dependent on
signals from the cycles before and after, due to phasing and pre-phasing.
While the Illumina’s proprietary software is efficient at dealing with these
factors for basecalling, there are other commercial and open-source tools
that employ different algorithms for these tasks and generate varying results
[13]. The algorithms these methods use (including the Illumina method)
make different assumptions on signal distribution, which may not strictly
represent the collected data, and therefore introduce method-specific bias to
basecalling.
4.5.6 Metagenomics
To study a community of microorganisms like the microbiome in the gut or
those in a bucket of seawater, where extremely large but unknown numbers
of species are present, a brutal force approach that involves the study of all
genomes contained in such a community is metagenomics. Recently the field
of metagenomics has been greatly fueled by the development of NGS tech-
nologies. By quickly sequencing everything in a metagenome, researchers can
get a comprehensive profile of the makeup and functional state of a micro-
bial community. Compared to NGS data generated from a single genome,
the metagenomics data is much more complicated. Chapter 15 focuses on
metagenomics NGS data analysis.
References
1. Bentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Milton J, Brown
CG, Hall KP, Evers DJ, Barnes CL, Bignell HR et al. Accurate whole human
genome sequencing using reversible terminator chemistry. Nature 2008,
456(7218):53–59.
2. Picard toolkit (https://broadinstitute.github.io/picard/)
3. Eid J, Fehr A, Gray J, Luong K, Lyle J, Otto G, Peluso P, Rank D, Baybayan P,
Bettman B et al. Real-time DNA sequencing from single polymerase molecules.
Science 2009, 323(5910):133–138.
4. Wenger AM, Peluso P, Rowell WJ, Chang PC, Hall RJ, Concepcion GT, Ebler J,
Fungtammasan A, Kolesnikov A, Olson ND et al. Accurate circular consensus
long-read sequencing improves variant detection and assembly of a human
genome. Nat Biotechnol 2019, 37(10):1155–1162.
5. Jain M, Fiddes IT, Miga KH, Olsen HE, Paten B, Akeson M. Improved data
analysis for the MinION nanopore sequencer. Nat Methods 2015, 12(4):351–356.
6. Rang FJ, Kloosterman WP, de Ridder J. From squiggle to basepair: computa-
tional approaches for improving nanopore sequencing read accuracy. Genome
Biol 2018, 19(1):90.
7. Poptsova MS, Il’icheva IA, Nechipurenko DY, Panchenko LA, Khodikov MV,
Oparina NY, Polozov RV, Nechipurenko YD, Grokhovsky SL. Non-random
DNA fragmentation in next-generation sequencing. Sci Rep 2014, 4:4532.
8. Seguin-Orlando A, Schubert M, Clary J, Stagegaard J, Alberdi MT, Prado JL,
Prieto A, Willerslev E, Orlando L. Ligation bias in Illumina next-generation
DNA libraries: implications for sequencing ancient genomes. PLoS One 2013,
8(10):e78575.
Next-Generation Sequencing (NGS) Technologies: Ins and Outs 79
In general, NGS data analysis is divided into three stages. In the primary
analysis stage, bases are called based on deconvolution of the optical or
physicochemical signals generated in the sequencing process. Regardless of
sequencing platforms or applications, the basecall results are usually stored
in the standard FASTQ format. Each FASTQ file contains a massive number of
reads, i.e., sequence readouts of DNA fragments sampled from a sequencing
library. In the secondary analysis stage, reads in the FASTQ files are quality
checked, preprocessed, and then mapped to a reference genome. The data
quality check or control (QC) step involves examining a number of sequence
reads quality metrics. Based on data QC result, the NGS sequencing files are
preprocessed in order to filter out low-quality reads, trim off portions of reads
that have low-quality basecalls, and remove adapter sequences or other arti-
ficial sequences (such as PCR primers) if they exist. Subsequent mapping (or
aligning) of the preprocessed reads to a reference genome aims to determine
where in the genome the reads come from, the critical information required
for most tertiary analysis (except de novo genome assembly). The stage of
tertiary analysis is highly application-specific and detailed in the chapters of
Part III. This chapter focuses on steps in the primary and secondary stages,
especially on reads QC, preprocessing, and mapping, which are common and
shared among most applications (Figure 5.1).
DOI: 10.1201/9780429329180-7 81
82 Next-Generation Sequencing Data Analysis
Basecalling
Data QC &
Preprocessing
FIGURE 5.1
General overview of NGS data analysis. The steps in the dashed box are common steps conducted
in primary and secondary analysis.
especially important for the long-read sequencing platforms, with the ONT
platform serving as a good example with active basecalling algorithm devel-
opment. Multiple machine learning-based basecallers have been developed
by ONT, including Albacore, Scappie, Flappie, and Bonito, besides Guppy. At
the time of writing, Guppy offers faster speed than the others while maintains
relatively high accuracy, while Bonito, as the latest iteration of basecallers
from ONT, uses another deep learning approach called CNN (convolutional
neural networks) to achieve even better accuracy than Guppy but at a slower
speed. Other open-source basecalling algorithms developed by the commu-
nity include DeepNano [3], Nanocall [4], Chiron [5], and causalcall [6]. As a
result of these algorithmic development efforts, significant progress has been
made and basecalling accuracy has been significantly increased.
Most end users do not usually intervene in the basecalling process but
rather focus on analysis of the basecalling results. Regardless of the sequen-
cing platform, basecalling results are usually reported in the universally
accepted FASTQ format. In file size, a typical compressed FASTQ file is usu-
ally in the multi-GB range and may contain millions to billions of reads. In a
nutshell, the FASTQ format is a text-based format, containing the sequence of
each read along with the confidence score of each base. Figure 5.2 shows an
example of one such read sequence reported in the FASTQ format.
The confidence (or quality) score, as a measure of the probability of making
an erroneous basecall, is an essential component of the FASTQ format. The
Early-Stage NGS Data Analysis: Common Steps 83
@HISEQ:131:C5NWFACXX:1:1101:3848:2428 1:N:0:CGAGGCTGCTCTCTAT
CTTTTATCAGACATATTTCTTAGGTTTGAGGGGGAATGCTGGAGATTGTAATGGGTATGGAGACATATCATATAAGTAATGCTAGG
GTGAGTGGTAGGAAG
+
BB7FFFFB<F<FBFBBFBFBFFFIFFFFIIIFF<FBFFFBFIFFBFFFIFFFBFB07<BFFF7BBFFFBFFFFFF<BFBFBBBBBB
B'77B<770<BBBBB
FIGURE 5.2
The FASTQ sequence read report format. Shown here is one read generated from an NGS
experiment. A FASTQ file usually contains millions to billions of such reads, with each
containing several lines as shown here. Line 1, starting with the symbol ‘@,’ contains sequence
ID and descriptor. Line 2 is the read sequence. Line 3 (optional) starts with the ‘+’ symbol, which
may be followed by the sequence ID and description. Line 4 lists confidence (or quality) scores
for each corresponding base in the read sequence (Line 2). For Illumina–generated FASTQ files,
the sequence ID in Line 1 in this example basically identifies where the sequence was generated.
This information includes the equipment (“HISEQ” in the above example), sequence run ID
(“131”), flow cell ID (“C5NWFACXX”), flow cell lane (“1”), tile number within the lane (“1101”),
x/y-coordinates of the sequence cluster within the tile (“3848” and “2848,” respectively). The
ensuing descriptor contains information about the read number (“1” is for single read here; for
paired-end read it can be 1 or 2), whether the read is filtered (“N” here means it is not filtered),
control number (“0”), and index (or sample barcode) sequence (“CGAGGCTGCTCTCTAT”).
NGS basecall quality score (Q-score) is similar to the Phred score used in
Sanger sequencing and is calculated as:
where PErr is the probability of making a basecall error. Based on this equation,
a 1% chance of incorrectly calling a base is equivalent to a Q-score of 20, and
Q30 means a 1/1000 chance of making a wrong call. Usually for a basecall
to be reliable, it has to have a Q-score of at least 20. High-quality calls have
Q-scores above 30, usually up to 40. For better visualization of Q-scores
associated with their corresponding basecalls, they are usually encoded with
ASCII characters. While there have been different encoding scheme versions
(e.g., Illumina 1.0, 1.3, and 1.5), currently the NGS field has mostly settled on
the use of the same encoding scheme used by Sanger sequencing (Figure 5.3).
In the FASTQ example shown in Figure 5.2, the first base, C, has an encoded
Q-score of B, i.e., 33.
To come up with the PErr, a control lane or spike control is usually used to
generate a basecall score calibration table in Illumina sequencing for lookup.
A precomputed calibration table can also be used in the absence of a control
lane and spike control. Because each platform calibrates their Q-scores differ-
ently, if they are to be compared with each other or analyzed in an integrated
fashion, their Q-scores need to be recalibrated. To carry out the recalibration,
a subset of reads is used that maps to regions of the reference genome that
contain no SNPs, and any mismatch between the reads and the reference
84 Next-Generation Sequencing Data Analysis
; < = > ? @ A B C D E F G H I J
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41
FIGURE 5.3
Encoding of basecall quality scores with ASCII characters. ASCII stands for American Standard
Code for Information Interchange, and an ASCII code is the numerical representation of a
character in computers (e.g., the ASCII code of the letter ‘B’ is 66). In this encoding scheme,
the ASCII character codes equal to Q-scores plus 33. Current major NGS platforms, including
Illumina (after ver. 1.8), use this encoding scheme for Q-score representation.
filtering, per-read quality pruning, etc. More recent NGS QC tools, such as
seqQscorer [10], apply machine learning approaches in an attempt to achieve
better understanding of quality issues and automated quality control. Tools
such as FQC Dashboard [11] and MultiQC [12] serve as aggregators of QC
results from other tools (e.g., FastQC) and present them in a single report.
After data QC, to perform stand-alone preprocessing tasks such as adapter
trimming and read filtering, tools such as cutadapt [13] and Trimmomatic
[14] are often used.
Some of the QC tools mentioned above, including FastQC and fastp, can be
used for both short and long reads. There are also tools such as NanoQC (part
of NanoPack) [15] specifically designed for long-read QC. Besides NanoQC,
NanoPack has a set of utilities for trimming, filtering, summarization, visual-
ization, etc. PycoQC [16] is another tool that provides interactive QC metrics
for ONT data. For PacBio long reads, SequelTools [17] provides QC data, as
well as other tasks such as reads filtering, summarization, and visualization.
5.3 Read Mapping
After the data is cleaned up, the next step is to map, or align, the reads
to a reference genome if it is available, or conduct de novo assembly. As
shown in Figure 5.1, most NGS applications require read mapping to a
reference genome prior to conducting further analysis. The purpose of this
mapping process is to locate origins of the reads in the genome. Compared
to searching for the location(s) of a single or a small number of sequences
in a genome by tools such as BLAST, simultaneous mapping of millions
of NGS reads, sometimes very short, to a genome is not trivial. Further
challenge comes from the fact that any particular genome from which NGS
reads are derived deviates from the reference genome at many sites because
of polymorphism and mutation. As a result any algorithm built for this task
needs to accommodate such sequence deviations. To further complicate the
situation, sequencing errors are often indistinguishable from true sequence
deviations.
FIGURE 5.4
Major steps in mapping NGS reads. In the first stage, the reference genome is indexed. This is
achieved through extracting seed sequences from the reference genome (a) and subsequently
the seed sequences are indexed using suffix tree or hash table (b). In the second stage, seed
sequences are extracted from reads (c), which are then used to search the indexed reference
genome for possible matching locations (d). In the example shown, each seed extracted from
read 1 is searched to locate their potential locations in the indexed genome. Based on their
adjacency some of the locations are excluded (red X) as such locations are unlikely to span the
read. In the last stage, the adjacent seeds are chained and the gap sequences between the seeds
are inspected for mismatches (red X), based on which pre-alignment filters determine whether to
accept the alignment between the read and the genomic region (e). In the last step, the alignment
is subject to verification to generate alignment result including sequence differences and their
locations. (From Alser, M., Rotman, J., Deshpande, D. et al. Technology dictates algorithms: recent
developments in read alignment. Genome Biol 22, 249 (2021). https://doi.org/10.1186/s13
059-021-02443-7. Used under the terms of the Creative Commons Attribution 4.0 International
License, http://creativecommons.org/licenses/by/4.0/, © 2021 Alsher et al.)
Early-Stage NGS Data Analysis: Common Steps 89
Position N
Burrows–Wheeler
Position 2 transform and indexing
CTGC CGTA AACT AATG
Bowtie index
Position 1
(~2 gigabytes)
ACTG CCGT AAAC T AAT ACTC CCGT ACTC TAAT ACTCCCGT ACTCTAAT
(a)
(b)
FIGURE 5.6
How Burrows–Wheeler transform (BWT) works for aligning NGS reads. Panel (a) shows the
BWT procedure for a short example sequence ‘acaacg.’ Panel (b) shows how to use BWT to
identify the locations of read sequences prefixed by ‘aac.’ (Adapted by permission from Springer
Nature Customer Service Centre GmbH: Springer Nature, Genome Biology, Ultrafast and
memory-efficient alignment of short DNA sequences to the human genome, B Langmead, C
Trapnell, M Pop, and SL Salzberg, copyright 2009.)
TABLE 5.1
Commonly Used Alignment Methods
Name Description Reference
Minimap2 A general-purpose alignment tool for both long and short reads. [18]
Uses hash table for reference genome indexing. Achieves fast
speed through use of minimizers. Splice-aware and can be used
for long RNA-seq reads
BWA-MEM2 The often-used algorithm in the BWA package designed for [48]
short reads. Employs suffix array lookup of seed sequences and
Smith–Waterman-based extended alignment
Bowtie2 A short-read aligner based on the use of BWT and FM [27]
for reference genome indexing, and Smith–Waterman or
Needleman–Wunsch for local or global alignment
SOAP2 Uses BWT compression to index the reference genome to [28]
achieve high speed for short-read alignment
Stampy A short-read aligner that has high sensitivity in mapping [36]
reads that contain variation(s) or diverge from the reference
sequence. Uses fast hashing to build reference index and a
statistical model for alignment
NGMLR A mapper designed for PacBio and ONT long reads. Splits long [42]
reads to shorter anchor sequences for lookup using hashing
and then deploys Smith–Waterman for final alignment
GraphMap A long-read aligner that uses spaced seeds for hashing-based [39]
index construction and lookup, and then performs graph-based
mapping and progressive refinement to achieve alignment of
long but error-prone reads
LAST Implements the standard seed-and-extend approach but with [41]
the use of adaptive seeds instead of fixed-length seeds
TABLE 5.2
Mandatory Fields in the SAM/BAM Alignment Section
Col Field Type Description
A
Coor 12345678901234 5678901234567890123456789012345
Ref TACGATCGAAGGTA**ATGACATGCTGGCATGACCGATACCGCGACA
+r001/1 CGAAGGTACTATGA*ATG
+r002 cggAAGGTA*TATGA
+r003 TGACAT..............TACCG
-r001/2 ACCGCGACA
B
@HD VN:1.6 SO:coordinate
@SQ SN:ref LN:45
r001 99 ref 7 30 8M2I4M1D3M = 37 39 CGAAGGTACTATGAATG *
r002 0 ref 9 30 3S6M1P1I4M * 0 0 CGGAAGGTATATGA *
r003 0 ref 16 30 6M14N5M * 0 0 TGACATTACCG *
r001 147 ref 37 30 9M = 7 -39 ACCGCGACA * NM:i:1
FIGURE 5.7
The SAM/BAM format for storing NGS reads alignment results. The alignment shown in panel
(a) is captured by the SAM format shown in panel (b). In panel (a), the reference sequence is
shown on the top with the corresponding coordinates. Among the sequences derived from it,
r001/1 and r001/2 are paired reads. The bases in lower cases in r002 do not match the reference
and as a result are clipped in the alignment process. The read r003 represents a spliced alignment.
In panel (b), the SAM format contains 11 mandatory fields that are explained in more detail on
Table 5.2.
reference sequence length, respectively. For the alignment section, while most
of the fields listed in Table 5.1 are self-explanatory, some fields may not be so
clear at first glance. The FLAG field uses a simple decimal number to track
the status of 11 flags used in the mapping process, such as whether there are
multiple segments in the sequencing (like r001 in the example) or if the SEQ
is reverse complemented. To check on the status and meaning of these flags,
the decimal number needs to be converted to its binary counterpart. For the
POS field, SAM uses a 1-based coordinate system, that is, the first base of the
reference sequence is counted as 1 (instead of 0). The MAPQ is the mapping
quality score, which is calculated similarly to the Q-score introduced earlier
(MAPQ =−10 × log10(PMapErr)). The CIGAR (or Concise Idiosyncratic Gapped
Alignment Report) field describes in detail how the SEQ maps to the reference
sequence, with the marking of additional bases in the SEQ that are not
present in the reference, or missing reference bases in the SEQ. In the example
above, the CIGAR field for r001/1 shows a value of “8M2I4M1D3M,” which
means the first eight bases matching the reference, the next two bases being
insertions, the next four matching the reference, the next one being a deletion,
and finally the last three again being matches. For more details (such as
those on the different FLAG status) and full specification of the SAM/BAM
format, refer to the documentation from the SAM/BAM Format Specification
Working Group. It should be noted that BAM may also be used to store
unaligned raw reads as an “off-label” use of the format. For example, the
Early-Stage NGS Data Analysis: Common Steps 95
FIGURE 5.8
Detection of duplicate reads after the mapping process. Depth of coverage of the reference
genomic region is shown on the top. Mapped reads, along with a set of duplicate reads that map
to the same area, are show underneath. The green and red colors denote the two DNA strands.
(Generated with CLC Genomics Workbench and used with permission from CLC Bio.)
duplicate reads after the mapping step (Figure 5.8). As technical duplicates
caused by PCR over-amplification and true biological duplicates are indistin-
guishable, researchers should exert caution when making decisions on
whether to remove duplicate reads from further analysis. While removing
duplicate reads can lead to increased performance in subsequent analysis in
many cases (such as variant discovery), in circumstances that involve less
complex or mostly enriched sequencing targets, including those from an
extremely small genome, or those used in RNA-seq or ChIP-seq, removing
them can lead to loss of true biological information.
Furthermore, a variety of other steps can also be conducted to operate
SAM/BAM files. These steps are usually provided by SAMtools and Picard,
two widely used packages for operating SAM/BAM files. These operations
include:
FIGURE 5.9
The pileup file format as generated from SAMtools. A pileup file shows how sequenced bases in
mapped reads align with the reference sequence at each genomic coordinate. The columns are
(from left to right): chromosome (or reference name), genomic coordinate (1-based), reference
base, total number of reads mapped to the base position, read bases, and their call qualities. In
the read bases column, dot signifies match to the reference base, comma to the complementary
strand, and ‘AGCT’ are mismatches. Additionally, the ‘$’ symbol marks the end of a read, while
‘^’ marks the start of a read and the character after the ‘^’ represents mapping quality.
SAMtools and Picard are very versatile in handling and analyzing SAM/
BAM files. In fact, the steps mentioned earlier, that is, generation of alignment
summary statistics and removal of multireads and duplicate reads, can be
directly conducted with these tools. For example, both SAMtools and Picard
have utilities to detect and remove duplicate reads called markdup and
markduplicates, respectively. These utilities mark reads that are mapped to
the same starting genomic locations as duplicates.
Lastly, in terms of examining mapping results, nothing can replace direct
visualization of the mapped reads in the context of the reference genome.
While a text-based alignment viewer, such as that provided by SAMtools,
offers a simple way to examine a small genomic region, direct graphical
visualization of mapping results by overlaying mapped read sequences
against the reference genome provides a more intuitive way of examining
the data and looking for patterns. This visualization process serves mul-
tiple purposes, including additional data QC, experimental procedure valid-
ation, and mapping pattern recognition. Commonly used visualization tools
include Integrative Genomics Viewer (IGV) [51], Artemis [52], SeqMonk
[53], JBrowse [54], and Tablet [55]. The UCSC and Ensembl genome browsers
also provide visualization options by adding customized BAM tracks.
Post-mapping data QC tools such as Qualimap 2 [56] also provide visual
summaries on key metrics including overall coverage across the reference
genome.
98 Next-Generation Sequencing Data Analysis
5.4 Tertiary Analysis
After the sequence read mapping step, subsequent analyses vary greatly
with application. For example, the workflow for RNA-seq data analysis is
different from that for mutation and variant discovery. Therefore, it is not
possible to provide a “typical” workflow for all NGS data analyses in this
chapter beyond the common steps of data QC, preprocessing, and read
mapping. Chapters in Part III provide details on application-specific tertiary
analytic steps and commonly used tools.
References
1. Cacho A, Smirnova E, Huzurbazar S, Cui X. A comparison of base- calling
algorithms for Illumina sequencing technology. Brief Bioinform 2016, 17(5):
786–795.
2. Wick RR, Judd LM, Holt KE. Performance of neural network basecalling tools
for Oxford Nanopore sequencing. Genome Biol 2019, 20(1):129.
3. Boza V, Brejova B, Vinar T. DeepNano: deep recurrent neural networks for
base calling in MinION nanopore reads. PLoS One 2017, 12(6):e0178751.
4. David M, Dursi LJ, Yao D, Boutros PC, Simpson JT. Nanocall: an open
source basecaller for Oxford Nanopore sequencing data. Bioinformatics 2017,
33(1):49–55.
5. Teng H, Cao MD, Hall MB, Duarte T, Wang S, Coin LJM. Chiron: translating
nanopore raw signal directly into nucleotide sequence using deep learning.
GigaScience 2018, 7(5):giy037.
6. Zeng J, Cai H, Peng H, Wang H, Zhang Y, Akutsu T. Causalcall: nanopore
basecalling using a temporal convolutional network. Front Genet 2019, 10:1332.
7. FastQC: A Quality Control Tool for High Throughput Sequence Data [Online]
(www.bioinformatics.babraham.ac.uk/projects/fastqc/)
8. Patel RK, Jain M. NGS QC Toolkit: a toolkit for quality control of next gener-
ation sequencing data. PLoS One 2012, 7(2):e30619.
9. Chen S, Zhou Y, Chen Y, Gu J. fastp: an ultra-fast all-in-one FASTQ pre-
processor. Bioinformatics 2018, 34(17):i884–i890.
10. Albrecht S, Sprang M, Andrade- Navarro MA, Fontaine JF. seqQscorer:
automated quality control of next-generation sequencing data using machine
learning. Genome Biol 2021, 22(1):75.
11. Brown J, Pirrung M, McCue LA. FQC Dashboard: integrates FastQC results
into a web-based, interactive, and extensible FASTQ quality control tool.
Bioinformatics 2017, 33(19):3137–3139.
12. Ewels P, Magnusson M, Lundin S, Kaller M. MultiQC: summarize analysis
results for multiple tools and samples in a single report. Bioinformatics 2016,
32(19):3047–3048.
Early-Stage NGS Data Analysis: Common Steps 99
33. Smith AD, Xuan Z, Zhang MQ. Using quality scores and longer reads improves
accuracy of Solexa read mapping. BMC Bioinformatics 2008, 9:128.
34. Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient
alignment of short DNA sequences to the human genome. Genome Biol 2009,
10(3):R25.
35. Hach F, Hormozdiari F, Alkan C, Hormozdiari F, Birol I, Eichler EE, Sahinalp
SC. mrsFAST: a cache- oblivious algorithm for short- read mapping. Nat
Methods 2010, 7(8):576–577.
36. Lunter G, Goodson M. Stampy: a statistical algorithm for sensitive and fast
mapping of Illumina sequence reads. Genome Res 2011, 21(6):936–939.
37. David M, Dzamba M, Lister D, Ilie L, Brudno M. SHRiMP2: sensitive yet prac-
tical SHort Read Mapping. Bioinformatics 2011, 27(7):1011–1012.
38. Li H. Aligning sequence reads, clone sequences and assembly contigs with
BWA-MEM. arXiv:13033997, 2013.
39. Sovic I, Sikic M, Wilm A, Fenlon SN, Chen S, Nagarajan N. Fast and sensitive
mapping of nanopore sequencing reads with GraphMap. Nat Commun 2016,
7:11307.
40. Chaisson MJ, Tesler G. Mapping single molecule sequencing reads using basic
local alignment with successive refinement (BLASR): application and theory.
BMC Bioinformatics 2012, 13:238.
41. Kielbasa SM, Wan R, Sato K, Horton P, Frith MC. Adaptive seeds tame gen-
omic sequence comparison. Genome Res 2011, 21(3):487–493.
42. Sedlazeck FJ, Rescheneder P, Smolka M, Fang H, Nattestad M, von Haeseler A,
Schatz MC. Accurate detection of complex structural variations using single-
molecule sequencing. Nat Methods 2018, 15(6):461–468.
43. Jain C, Rhie A, Zhang H, Chu C, Walenz BP, Koren S, Phillippy AM. Weighted
minimizer sampling improves long read mapping. Bioinformatics 2020,
36(Suppl_1):i111–i118.
44. Jain C, Rhie A, Hansen NF, Koren S, Phillippy AM. Long-read mapping to
repetitive reference sequences using Winnowmap2. Nat Methods 2022.
45. Zheng H, Kingsford C, Marcais G. Improved design and analysis of practical
minimizers. Bioinformatics 2020, 36(Suppl_1):i119–i127.
46. Shukla HG, Bawa PS, Srinivasan S. hg19KIndel: ethnicity normalized human
reference genome. BMC Genomics 2019, 20(1):459.
47. Chen NC, Solomon B, Mun T, Iyer S, Langmead B. Reference flow: reducing
reference bias using multiple population genomes. Genome Biol 2021, 22(1):8.
48. Vasimuddin M, Misra S, Li H, Aluru S. Efficient architecture-aware acceler-
ation of BWA-MEM for multicore systems. In: 2019 IEEE International Parallel
and Distributed Processing Symposium (IPDPS): 2019: IEEE; 2019: 314–324.
49. Hsi-Yang Fritz M, Leinonen R, Cochrane G, Birney E. Efficient storage of
high throughput DNA sequencing data using reference-based compression.
Genome Res 2011, 21(5):734–740.
50. Yuan Y, Norris C, Xu Y, Tsui KW, Ji Y, Liang H. BM-Map: an efficient software
package for accurately allocating multireads of RNA-sequencing data. BMC
Genomics 2012, 13 Suppl 8:S9.
51. Thorvaldsdottir H, Robinson JT, Mesirov JP. Integrative Genomics Viewer
(IGV): high-performance genomics data visualization and exploration. Brief
Bioinform 2013, 14(2):178–192.
Early-Stage NGS Data Analysis: Common Steps 101
The gap between our ability to pump out NGS data and our capability to
extract knowledge from these data is getting broader. To manage and pro-
cess the tsunami of NGS data for deep understanding of biological systems,
significant investment in computational infrastructure and analytical power
is needed. How to gauge computing needs and build a system to meet the
needs, however, poses serious challenges to small research groups and even
large research organizations. To meet this unprecedented challenge, the NGS
field can borrow solutions from other “big data” fields such as high-energy
particle physics, climatology, and social media. For biologists without much
training in bioinformatics, while getting expert help is needed, having a good
understanding of the various aspects of NGS data management and analysis
is beneficial for years to come.
signal intensity files in formats such as scanned images or movies are in the
scale of TB from a single run (this amount is not counted in the data volume
mentioned above). As these raw signal files accumulate, they can easily
overwhelm most data storage systems. While these raw images files can be
retained long term, newer sequencing systems process them on-the-fly and
delete them by default once they have been analyzed to alleviate the burden
of storing them. Oftentimes it is easier and more economical to rerun the
samples in case of data loss rather than archiving these huge raw signal files.
Due to the huge size of most NGS files, transferring them from one place
to another is non-trivial. For a small-sized project to transfer sequencing files
from a production server to a local storage space, download via FTP or HTTP
might be adequate if a fast network connection is available. As for network
speed, a 1 Gbps network is essential while a 10/100 Gbps network offers
improved performance for high traffic conditions. When the network speed
is slow or the amount of data to be transferred is too large, the use of external
hard drive might be the only option. When the data reaches the lab, for fast
local file reading, writing, and processing, they need to be stored in a hard
drive array inside a dedicated workstation or server.
For a production environment, such as an NGS core facility or a large genome
center, which generates NGS data for a large number of projects, enterprise-
level data storage system, such as DAS (Directly Attached Storage), SAN
(Storage Area Network), or NAS (Network Attached Storage), is required to
provide centralized data repositories with high reliability, access speed, and
security. To avoid accidental data loss, these data storage systems are usu-
ally backed up, mirrored, or synced to data servers distributed at separate
locations. For large-scale collaborative projects that involve multiple sites
and petabytes to exabytes of data, the processes of data transfer and sharing
pose more challenges, which prompts the development of high-capacity and
high-performance platforms such as Globus.
Data sharing among collaborating groups creates additional technical
issues beyond those dealt with by individual labs. A centralized data reposi-
tory might be preferred over simple data replication at multiple sites to foster
effective collaboration and timely discussion. Along with data sharing come
also the issues of data access control and privacy for data generated from
patient-oriented studies. In a broader sense, NGS data sharing with the
entire life science community also increases the value of a research project.
For this reason, many journals enforce a data sharing policy that requires
deposition before publication of sequence read data and processed data into
a publicly accessible database (such as the NCBI’s Sequence Read Archive
[SRA] or the European Nucleotide Archive [ENA]). To facilitate data inter-
pretation and potential meta-analysis, relevant information about such an
experiment must also be deposited with the data. Some organizations, such
as the Functional Genomics Data Society, have developed guidelines on what
information should be deposited with the data. For example, the MINSEQE
Computing Needs for NGS Data Management and Analysis 105
Besides the number of CPU cores, the amount of memory a system has also
heavily affects its performance. Again memory needs depend on the number
and complexity of jobs to be processed, for example, read mapping to a small
genome may need only a few GB of memory while de novo assembly of a large
genome may require hundreds of GB or even TB-level memory. The current
estimation is that for each CPU core the amount of memory needed should
not be less than 3 GB. In an earlier implementation of de novo assembly of the
human genome using the SOAPdenovo pipeline (to be detailed in Chapter 12),
a standard supercomputer with 32 cores (eight AMD quad-core 2.3 GHz CPUs)
and 512 GB memory was used [1]. As a more recent example of the computing
power needed for de novo genome assembly, a server with 64 cores (eight
Intel Xeon X6550 8-core 2.00 GHz CPUs) and 2 TB RAM is used by a Swedish
team [2]. For de novo assembly of small genomes such as those of microbes, a
machine that contains at least 8 CPU cores, 256 GB of RAM, and a fast data
storage system can get a job completed in a reasonable time frame. By current
estimation, an 8-core workstation with 32 GB RAM and 10 TB storage can work
for many projects that do not conduct de novo genome assembly.
The amount of time needed to complete a job varies greatly with the com-
plexity of the job and accessible computing power. As a more concrete example,
running the deep learning-based WGS variant calling tool DeepVariant (see
Chapter 10 for details) needs 24–48 hours when using the minimum setting
of a 8-core computer with 16 GB RAM, but the processing time is reduced
by more than half when using a graphics processing unit (GPU) with 4 GB
dedicated video RAM and CUDA support for parallel computing [3]. To map
an RNA-seq dataset of 80 million 75-bp reads to the human genome using
Bowtie on a computer equipped with 32 cores and 128 GB RAM, it took <2
hours and even less time in subsequent steps including normalization and
differential expression statistical tests [4]. In a small RNA-seq study, with a
32-core and 132 GB memory workstation, processing 20 multiplex barcoded
samples with a total of 160 million reads took a little over 2 hours for sample
de-multiplexing, and about the same amount of time for read mapping to the
host genome and small RNA annotation databases [5].
6.3 Cloud Computing
As clearly demonstrated above, NGS data storage, transfer, and sharing are
no trivial tasks. And one limitation of a locally built computing system is its
scalability. With the rates at which NGS technologies advance and sequen-
cing costs drop being faster than those of development in the computer
hardware industry, the gap between NGS data generation and our ability
to handle and analyze them will only widen. To narrow this gap and speed
up NGS data processing, the NGS community has embraced a trend from
Computing Needs for NGS Data Management and Analysis 107
TABLE 6.1
Current Providers of Cloud Computing That Can Be Used for NGS Data Analysis
Provider URL
Internet, the convenience also means the possibility of data security being
breached or compromised. Some heavy users may find cloud computing
not as cost effective as running a local server. While more and more tools
are becoming available in the cloud, users still need to use due diligence to
make sure that the tools they need are available. For users at places that suffer
frequent network outage, cloud computing can be problematic as all cloud-
based operations are dependent on Internet traffic.
Despite the potential downsides, cloud computing has been proven to be
a viable approach for NGS data analysis. Table 6.1 is a list of some of the
current cloud computing providers that can be used for NGS applications.
To illustrate how cloud computing can be deployed for analyzing NGS data,
below is an example on the conduct of reads alignment using the Amazon
Elastic Compute Cloud (or EC2) Cloud. As the first step, input data files
(FASTQ files and a reference genome file) are uploaded from a local com-
puter to a “bucket” in the Amazon Simple Storage Service (S3). This cloud
storage bucket, which is also used to hold program scripts and output files,
can be created with the Amazon Web Services (AWS) Management Console, a
unified interface to access all Amazon cloud resources. To initiate alignment,
a workflow must be defined first using the Console’s “create workflow”
function. To define the workflow, the input sequence read files, the aligner
script, and the saving location for alignment output files are specified. In the
meantime, the number of Amazon EC2 instances required for the job, which
determines memory and processor allocation, is also configured. After the
configuration the job is submitted through the Management Console. When
the instances are finished, alignment output files are deposited into the pre-
specified file location in the S3 cloud storage.
6.4.1 Parallel Computing
Parallelization, a computation term that describes splitting of a task into a
number of independent subtasks, can significantly increase the processing
speed of highly parallelizable tasks, which include many NGS data analysis
steps. For example, although millions of reads are generated from a sequen-
cing run, mapping of these reads to a reference genome is a process that is
“embarrassingly parallel,” as each read is mapped independently to the ref-
erence. As parallel computing can be efficiently carried out by GPUs since
rendering of each pixel on a computer screen is also a highly parallel process,
the integration of GPUs with CPUs in heterogeneous computing systems
can increase throughput 10-to 100-fold, and turn individual computers into
mini-supercomputers. While these systems can be applied to various aspects
of NGS data analysis, many NGS analytical tools have yet to take full advan-
tage of the power of parallel computing in such systems.
Parallelization is also an important factor in determining how increase in
the number of CPU (or GPU) cores might affect actual NGS data processing
performance. If a step is highly parallelizable, and the algorithm designed for
Computing Needs for NGS Data Management and Analysis 111
For bioinformaticians who deal with NGS data, on the other hand, the
following is a list of skills that are needed:
References
1. Li R, Zhu H, Ruan J, Qian W, Fang X, Shi Z, Li Y, Li S, Shan G, Kristiansen K
et al. De novo assembly of human genomes with massively parallel short read
sequencing. Genome Res 2010, 20(2):265–272.
2. Lampa S, Dahlo M, Olason PI, Hagberg J, Spjuth O. Lessons learned from
implementing a national infrastructure in Sweden for storage and analysis of
next-generation sequencing data. GigaScience 2013, 2(1):9.
Computing Needs for NGS Data Management and Analysis 113
Part III
Application-Specific
NGS Data Analysis
7
Transcriptomics by Bulk RNA-Seq
7.1 Principle of RNA-Seq
Transcriptomic analysis deals with the questions of which parts of the genome
are transcribed, and how active they are transcribed. In the past, these questions
were mostly answered with microarray, which is based on hybridization of RNA
samples to DNA probes that are specific to individual gene-coding regions.
With this hybridization-based approach, the repertoire of hybridization probes,
which are designed based on the current annotation of the genome, determines
what genes in the genome or which parts of the genome are analyzed, and
genomic regions that have no probe coverage are invisible. An NGS-based
approach, on the other hand, does not depend on the current annotation of the
genome. Because it relies on sequencing of the entire RNA population, hence
the term RNA-seq, this approach makes no assumption as to which parts of the
genome are transcribed. After sequencing, the generated reads are mapped to
the reference genome in order to search for their origin in the genome. The total
number of reads mapped to a particular genomic region represents the level of
transcriptional activity at the region. The more transcriptionally active a gen-
omic region is, the more copies of RNA transcripts it produces, and the more
reads it will generate. RNA-seq data analysis is essentially based on counting of
reads generated from different regions of the genome.
By counting the number of reads from transcripts and therefore being digital
in nature, RNA-seq does not suffer from the problem of signal saturation
that is observed with microarray at very high values. RNA-seq also offers a
native capability to differentiate alternative splicing variants, which is basic-
ally achieved by detecting reads that fall on different splice junctions. While
some specially designed microarrays, like the Affymetrix Exon Arrays, can
be used to analyze alternative splicing events, standard microarrays cannot
usually make distinctions between different splicing isoforms. Also different
from microarray signals, which are continuous, raw RNA-seq signals (i.e.,
read counts) are discrete. Because of this difference, distribution model and
methods of differential expression analysis designed for microarray data
cannot be directly applied to RNA-seq data without modification.
7.2 Experimental Design
7.2.1 Factorial Design
Before carrying out an RNA-seq experiment, the biological question to be
answered must be clear and well defined. This will guide experimental
design and subsequent experimental workflow from sample preparation
to data analysis. For experimental design, factorial design is usually used.
Many experiments compare transcriptomic profile between two conditions,
e.g., cancer vs. normal cells. This is a straightforward design, involving
only one biological factor (i.e., cell type). Experiments involving a single
factor may also have more than two conditions, e.g., comparison of samples
collected from multiple tissues in the body in order to detect tissue-specific
gene expression.
If a second biological factor (e.g., treatment of a drug) is added to the
example of cancer vs. normal cell comparison, the experiment will have a
total of four (2×2) groups of samples (Table 7.1). In this two-factor design,
besides detecting the effects of each individual factor, i.e., cell type and
drug treatment, respectively, the interacting effects between the two
factors are also detected, e.g., drug treatment may have a larger effect on
cancer cells than normal cells. If the factors contain more conditions, there
will be a total of m×n groups of samples, with m and n representing the
total number of conditions for each factor. Experiments involving more
than two factors, such as adding a time factor to the above example to
detect time-dependent drug effects on the two cell types, are inherently
more complex and therefore more challenging to interpret, because in this
circumstance it is not easy to attribute a particular gene expression change
to a certain factor, or especially, to the interaction of these factors due to
the existence of multiple interactions (three factors involve four different
types of interactions).
TABLE 7.1
Experimental Design Involving Two Biological Factors
Cancer Cells Normal Cells
also be noted that many RNA extraction protocols do not retain small RNA
species including miRNAs. If these species are also of interest (more on small
RNA sequencing in the Chapter 9), alternative protocols (such as the TRIzol
method) need to be used.
Besides quality, the quantity of RNA sample available also determines
RNA-seq library prep strategy. With enzyme engineering and as a result
improvement in library construction chemistry, preparing sequencing
libraries from increasingly small quantities of RNA is no longer a barrier.
This usually involves signal amplification in order to produce enough library
molecules for sequencing. While the needed amplification can introduce bias
to the process, its impact on the detection of differential gene expression has
been found to be limited [2]. The greatly increased sensitivity in RNA-seq
library making has also made sequencing of transcripts from a single cell a
reality (see next chapter for single-cell RNA-seq).
There are two general approaches to constructing RNA-seq sequencing
libraries. One approach is based on direct enrichment of mRNA molecules,
the major detection targets for the majority of RNA- seq work. Because
most eukaryotic mRNAs have a poly-A tail (Chapter 3, Section 3.3.4), this
approach is carried out through the use of poly-T capture probes to enrich for
mRNA molecules carrying such a tail. The other approach is based on deple-
tion of ribosomal RNAs (rRNAs), since rRNAs are usually the predominant
but uninformative component in total RNA extractions. Depletion of rRNAs
is typically based on hybridization using rRNA-specific probes, followed
by their capture and subsequent removal. Other rRNA depletion strat-
egies include degradation by duplex-specific nuclease (DSN), which relies
on denaturation-reassociation kinetics to remove extremely abundant RNA
species including rRNAs [3], and RNase H selective depletion, on the basis of
binding rRNAs with rRNA-specific DNA probes and then using RNase H to
digest bound rRNAs. Library prep based on the rRNA depletion approach is
more tolerant of RNA degradation issues.
After mRNA enrichment or rRNA depletion, subsequent RNA sequencing
library preparatory process typically involves reverse transcription to cDNA
using random primers, followed by fragmentation (for short-read platforms)
and attachment of sequencing adapters. This sequencing library construction
process may also introduce bias to the subsequent sequencing and data gen-
eration. For example, the use of poly-T based mRNA enrichment introduces
3’ end bias, as this procedure precludes analysis of those mRNAs and other
non-coding RNAs that do not have the poly-A tail structure [4]. If these RNA
species are of interest, a library prep process based on the use of rRNA deple-
tion can be employed.
Compared to short-read sequencing, long-read sequencing offers new
RNA-seq capabilities and options that are impossible or difficult to per-
form using short-read sequencing. For example, both ONT and PacBio
platforms provide full- length transcript sequencing, and thereby enable
Transcriptomics by Bulk RNA-Seq 121
7.2.4 Sequencing Strategy
Sequencing depth and read length are two major factors to consider when
sequencing bulk RNA-seq libraries especially with short-read sequencing.
The factor of sequencing depth, that is, how many reads to obtain, is based
on a number of factors, mainly the size of the organism’s genome, the pur-
pose of the study, and ultimately statistical rigor (effect size and statistical
power). Small genomes, such as those of bacteria, require less reads to ana-
lyze than large genomes, such as those of mammalian species. If the pur-
pose of a study is to identify differentially expressed genes among those
expressed at intermediate to high abundance levels, it requires fewer reads
than studies that aim to encompass low-abundance genes, study alterna-
tive splicing events, or discover new transcripts. As a general guideline,
for a gene expression profiling experiment targeting intermediate-to high-
abundance transcripts, a sequencing depth of 5–25 million reads, depending
on the size the genome, is suggested. To cover transcripts of lower abundance
or common alternative splicing variants, 20–50 million reads are suggested.
For more thorough coverage of the transcriptome and/or discovery of new
transcripts, 100–300 million reads are often needed. As for sequencing read
length, for gene expression profiling, single end 50–75 bp reads are typically
long enough to map to their originating genes in the genome. For assembly of
new transcripts and/or identification of alternative splicing isoforms, longer
and often paired-end sequencing reads, such as paired-end 150 bp reads, are
often acquired.
The number of sample replicates also largely affects the detection power of
an RNA-seq study, as sample replication provides estimation on gene expres-
sion dispersion across biological subjects within a group. While a minimum
of three biological replicates is commonly used, specially designed RNA-seq
power analysis tools can be used to calculate sample size to achieve a detec-
tion power. These tools, including Scotty [5], ssizeRNA [6], PROPER [7], and
RnaSeqSampleSize [8], are designed based on different statistical models.
For example, while PROPER and RnaSeqSampleSize are based on negative
binomial model, Scotty and ssizeRNA use Poisson-lognormal and linear
models, respectively. As to be covered more later in this chapter (Section
7.3.3 “Identification of Differentially Expressed Genes”), these models pro-
vide different approximations to RNA-seq data distribution. Sample size
calculations using these tools require as input a number of parameters,
including the total number of genes expressed, the percentage of genes
expected to be differentially expressed, the minimal fold change needed to
122 Next-Generation Sequencing Data Analysis
call differential expression, false discovery rate, average read count (related
to sequencing depth), and the desired statistical power. Because most of the
parameters are not known a priori, recently published data collected from
similar conditions may be used to provide some guidance. To start on a
species or cell type that has not yet been studied, it might be useful to try out
a small number of samples first to get a general idea on the composition of the
target transcriptome and the variability between biological replicates. Besides
detection power, experimental and sequencing costs are also key factors in
deciding sample size and sequencing depth. For projects on a budget, it has
been reported that increasing the number of biological replicates is more
effective in boosting detection power than increasing sequencing depth [9].
Besides sequencing depth and read length, other considerations when
planning for sequencing samples include how to arrange samples on a
sequencer in terms of flow cell or lane assignment. Here a balanced block
design [10] should be used to minimize technical variation due to flow
cell-to-flow cell or lane-to-lane difference. In such a design, samples from
different conditions are multiplexed on the same flow cell(s) or lanes,
instead of running different samples or conditions on separate flow cell(s)
or lanes.
address this challenge, two approaches have been developed. One is to use the
current gene exonic annotation in the reference genome to build a database of
reference transcript sequences that join currently annotated exons. RNA-seq
reads are then searched against this reference transcripts database using the
ungapped read aligners. Examples of annotation-guided mappers are RUM
[12] and SpliceSeq [13]. These mappers may produce better outcomes when
high accuracy and reliability are emphasized. This approach, however, does
not provide the capability to discover novel transcripts. In addition, it leads
to high rate of multi-mapping, as a read that maps to common exon(s) shared
by multiple splicing isoforms of a gene is counted multiple times.
The other approach conducts ab initio splice junction detection, and there-
fore does not depend on genome annotation. Depending on their method-
ology, ab initio spliced mappers can be classified into two categories: methods
using “exon-first” and those using “seed-and-extend.” The exon-first methods
include TopHat/TopHat2 [14, 15], MapSplice [16], SpliceMap [17], and GEM
[18]. They first align reads to a reference genome to identify unspliced con-
tinuous reads (i.e., exonic reads first), and then predict splice junctions
out of the initially unmapped reads based on the initial mapping results.
Taking TopHat/TopHat2 as an example, they first use Bowtie/Bowtie2 to
align reads to the reference genome. Reads that map to the reference con-
tinuously without interruption are then clustered based on their mapping
position. The clusters, supposedly representing exonic regions, are used to
search for splicing junctions from the remaining reads. The seed-and-extend
methods, on the other hand, use part of reads as substrings (or k-mers) to
initiate the mapping process, followed by extension of candidate hits to
locate splicing sites. Examples of methods in this category include STAR [19],
HISAT/HISAT2 [20, 21], and GMAP/GSNAP [22]. Among these methods,
STAR employs a two-step process for gapped alignment. The first is a seed
searching step, aiming to sequentially locate substrings of maximum length
from a read that each matches exactly to one or more substrings in the refer-
ence genome. If this step does not reach the end of the read due to the presence
of mismatches, it will use the located seed region(s) as anchors to extend
the alignment. In the second step, alignment of the entire read sequence is
built by joining all the seed regions located in the first step. HISAT applies an
algorithm called hierarchical indexing to achieve splice-aware alignment. It
starts with a global search using FM indexing of the whole reference genome
to identify the genomic location(s) of a read using part of its sequence as
seed. Such a location is then used as an anchor to extend the alignment. Once
the alignment cannot be extended further, e.g., reaching a splicing junction,
a local search is then performed using FM indexing of the local region to
map the rest of the read (Figure 7.1). A hybrid strategy combining the two
is also used sometimes, with the exon-first approach employed for mapping
unspliced reads and the seed-and-extend approach for spliced reads. As they
do not rely on current genomic annotations, these ab initio methods are suit-
able to identify new splicing events and variants.
124 Next-Generation Sequencing Data Analysis
FIGURE 7.1
Mapping of RNA-seq reads with HISAT. Three representative reads are shown on the top, i.e.,
one exonic read (1), one read spanning a splice junction with a short anchor in one exon (2), and
another junction-spanning read with long anchor in each exon (3). Panel (a) shows alignment
of Read1 with a global FM genome index search using partial read sequence, followed by an
extension step to align the rest of the read. Panel (b) show that when the global search and
extension are halted at the junction, a local search using the region’s FM index is performed to
align the remaining short sequence. Panel (c) shows that to align Read 3, a second extension
step is conducted after the local FM index search. The shown exemplary reads are error-free
and 100 bases in length. (Adapted by permission from Springer Nature Customer Service
Centre GmbH: Springer Nature, Nature Methods, HISAT: a fast spliced aligner with low memory
requirements, Kim, D., Langmead, B. & Salzberg, S., Copyright 2015.)
Transcriptomics by Bulk RNA-Seq 125
To map long RNA-seq reads generated on the PacBio and ONT platforms,
some of the short-read mappers introduced above can still be used, such
as GMAP [23]. More commonly, however, this task is performed using tools
specially designed for long reads. For example, minimap2 (introduced in
Chapter 5) has a splice-aware option for mapping long RNA sequencing
reads. Other currently available tools include deSALT [24], GraphMap2 [25],
and uLTRA [26].
The percentage of reads that are mapped to the genome is an important
QC parameter. While it is variable depending on a number of factors such
as aligning method and species, this number usually falls within the range
of 70–90%. The percentage of reads that map to rRNA regions is dependent
on and a measure of the efficiency of the rRNA depletion step. Due to tech-
nical and biological reasons, it is usually impossible to remove all rRNA
molecules. The percentage of rRNA reads can vary greatly, from 1–2% to
35% or more. For downstream analysis rRNA reads are filtered out so they
do not usually affect subsequent normalization. Duplicate reads, a common
occurrence in an RNA-seq experiment, can be caused by biological factors,
such as over-presentation of a small number of highly expressed genes, and/
or technical reasons, such as PCR over-amplification. It is possible to have
a high percentage of duplicate reads, e.g., 40–60%, in a run. While it is still
debatable as to how to treat duplicate reads, because of the biological factors
involved in their formation they should not be simply removed. Some experi-
mental approaches, such as removing some of the highly expressed genes
prior to library construction, or using paired-end reads, can help reduce the
number of duplicate reads. With regard to genomic coverage, RNA-seq QC
tools, including RNA-SeQC 2 [27], RSeQC [28], and QoRTs [29], report on
the percentage of reads that are intragenic, that is, those that map within
genes (including exons or introns), or intergenic, for those that map to gen-
omic space between genes. These tools also report other data quality metrics,
including percentage of total aligned reads, percentage of rRNA reads, as
well as rate of duplicate reads.
If the species under study does not have a sequenced reference genome
against which to map RNA-seq reads, two approaches exist. One is to map the
reads to a related species that has a reference genome, while the alternative is
to assemble the target transcriptome de novo. The de novo assembly approach
is more computationally intensive, but it does not rely on reference genomic
sequence. Currently available de novo transcriptome assemblers include
rnaSPAdes [30], Trinity [31], Bridger [32], Trans-ABySS [33], SOAPdenovo-
Trans [34], Oases [35], and StringTie/ StringTie2 [36, 37]. Among these
assemblers rnaSPAdes and StringTie2 can be used with long RNA reads,
or a hybrid of long and short reads [38, 39]. These de novo assemblers are
suited when no related species or only very distantly related species with
a reference genome exists, or the target genome, despite with available ref-
erence sequence, is heavily fragmented or altered (such as in tumor cells).
It should also be noted that if a related reference genome exists with 85%
126 Next-Generation Sequencing Data Analysis
or higher sequence similarity with the species under study, mapping to the
related genome may work equally well, or even better, compared to the
de novo assembly approach. This is especially true when studying alterna-
tive splicing variants. These de novo approaches are also applicable to cases
where aberrant transcripts are expected, or novel transcripts are the detec-
tion targets. For these de novo assembly approaches, paired-end sequencing
or long-read sequencing are more advantageous compared to single-end
short reads.
7.3.2 Quantification of Reads
After mapping of reads, the number of reads mapped to each gene/tran-
script needs to be counted to generate a table with rows representing genes/
transcripts and columns different samples. Such an expression matrix is the
basis for subsequent differential expression determination. This read quan-
tification process can be performed using tools such as featureCounts as part
of Subread [40], htseq-count as distributed with the HT-Seq Python frame-
work [41], RSEM [42], eXpress [43], or Cufflinks [44]. Among these tools,
featureCounts and htseq-count are read count-based, requiring as input
SAM/BAM alignment files and a genomic feature annotation GFF/GTF file.
They generally discard reads that map to multiple regions in the genome
or overlap multiple genomic features. RSEM, eXpress, and Cufflinks, on the
other hand, are model-based, requiring as input SAM/BAM alignments as
well as a transcriptome reference file containing transcript sequences. They
assign multi-mapped or ambiguous reads to transcripts based on prob-
ability from the use of the expectation-maximization algorithm. Because
of the differences in how they quantify genes/transcripts, the selection
of counting methods has been shown to have an effect on quantification
results [45].
Gene/ transcript expression can also be quantified without mapping
reads to a reference genome or transcriptome. Examples of such mapping-
independent algorithms include kallisto [46], Salmon [47], and Sailfish [48].
These methods rely on pseudo-or quasi-alignment of k-mers extracted
from reads, instead of the entire reads, to achieve transcript quantifica-
tion at vastly faster speed. For example, kallisto is a pseudoaligner that
aligns k-mers in reads to a hash table built from k-mers that represent
different transcripts in the reference transcriptome. Although not relying
on mapping of entire reads, this method enables rapid determination of the
compatibility of reads with target transcripts, as it preserves the key infor-
mation needed for transcript quantification. Since only k-mers need to be
aligned to the hash table, the speed is greatly increased with similar quan-
tification performance to the mapping-based methods above. They perform
well on highly expressed protein-coding genes but less so on rare or short
transcripts [49].
Transcriptomics by Bulk RNA-Seq 127
7.3.3 Normalization
As mentioned previously, the basic principle of determining gene expres-
sion levels through RNA-seq is that the more active a gene is transcribed,
the more reads we should be able to observe from it. To apply this basic prin-
ciple to gene expression quantification and cross-condition comparison, at
least two factors must be taken into consideration. The first is sequencing
depth. If a sample is split into two halves, and one half is sequenced to a
depth that is twice of that of the other, for the same gene the former will
generate twice as many reads as the latter although both are from the same
sample. The other factor is the length of gene transcript. If one gene tran-
script is twice the length of another gene transcript, the longer transcript
will also produce twice as many reads as the shorter one. Because of these
confounding factors, prior to comparing abundance of reads from different
genes across samples in different conditions, the number of reads for each
gene needs to be normalized against both factors using the following formula
to ensure different samples and genes can be directly compared,
gi , j × SF
ei , j =
ai × l j
where ei,j is the normalized expression level of gene j in sample i, gi,j is the
number of reads mapped to the gene in the same sample, ai is the total
number of mapped reads (depth) in sample i, and lj is the length of gene j. SF
is a scaling factor and equals to 109 when ei,j is presented as RPKM or FPKM
(Reads, or Fragments [for paired-end reads], per Kilobase of transcript per
Million mapped reads).
The calculation of RPKM or FPKM is the simplest form of RNA-seq data
normalization. In a nutshell, normalization deals with non-intended factors
and/or technical bias, such as those that lead to unwanted variation in total
read counts in different samples. By correcting for the unwanted effects of
these factors or bias, the normalization process puts the focus on the biological
difference of interest, and makes samples comparable. Since the introduction
of RKPM or FKPM as an early normalization approach for RNA-seq data,
other methods of normalization have also been developed. Some of these
methods employ a similar strategy to adjust for sequencing depth. This group
of methods normalize RNA-seq data through dividing gene read counts by
either (1) the total number of mapped reads (i.e., the total count approach);
(2) the total read count in the upper quartile (the upper quartile approach)
[50]; and (3) the median read count (the median approach). Another method
called quantile normalization sorts gene read count levels and adjusts quan-
tile means to be the equal across all samples, so that all samples have the
same empirical distribution [51]. These sequencing depth-based methods do
not normalize against gene length, as it is not needed if the goal is to detect
128 Next-Generation Sequencing Data Analysis
relative expression changes of same genes between groups, rather than com-
pare relative abundance levels of different genes in the same samples.
Subsequently, more sophisticated normalization approaches are developed
based on the assumption that the majority of genes are not differentially
expressed, and for those that show differential expression, the proportion of
up-and down-regulation is about equal. These approaches include those that
are employed by two commonly used RNA-seq analysis tools, DESeq2 [52]
and edgeR [53]. DESeq2 employs a method called relative log expression, or
RLE, which is essentially carried out through dividing the read count of each
gene in each sample by a scaling factor. To compute the scaling factor for each
sample, the ratio of each gene’s read count in a sample over its geometric
mean across all samples is first calculated. After calculating this ratio for all
genes in the sample, the median of this ratio is used as the scaling factor. The
edgeR package employs a approach called Trimmed Means of M-values, or
TMM [54]. In this approach, one sample is used as the reference and others
as test samples. TMM is computed as the weighted mean of gene count log
ratios between a test sample and the reference, excluding genes of highest
expression and those with the highest expression log ratios. Based on the
assumption of no differential expression in the majority of genes, the TMMs
should be 1 (or very close to 1). If not, a scaling factor should be applied
to each sample to adjust their TMMs to the target value of 1. Multiplying
the scaling factor with the total number of mapped reads generates effective
library size. The normalization is then carried out through dividing raw
reads count by the effective library size, i.e., normalized read count =raw
read count ⁄ (scaling factor × total number of mapped reads).
Among other approaches are those that use iterative processes to achieve
normalization, as exemplified by TbT [55], DEGES [56], and PoissonSeq [57].
Based on the same assumption that there is no overall differential expres-
sion, these methods use a multi-iteration process. For example, DEGES,
or Differentially Expression Genes Elimination Strategy, uses a process
to repeatedly remove potential differential genes until their elimination,
prior to calculating the final normalization factor. It starts with using any
of the normalization methods introduced above, e.g., TMM, followed by a
test for differential expression using a differential detection method (to be
introduced next). After removal of the DE genes, the same process is repeated
until convergence.
There are also normalization methods that use a list of housekeeping genes
or spike-in controls as normalization standard. The use of housekeeping
genes or spike-in controls is for conditions in which the assumption that
the majority of genes are not differentially expressed might be violated. In
this approach, a set of constitutively expressed housekeeping genes that are
known to stay unchanged in expression under the study conditions, or a
panel of artificial spike-in controls that mimic natural mRNA and are added
to biological samples at known concentrations, is used as the basis against
which other genes are normalized.
Transcriptomics by Bulk RNA-Seq 129
FIGURE 7.2
Removal of batch effects using ComBat-seq. PCA plots are shown before (A) and after (B) batch
effects correction. In this example, three batches of breast cancer tissue samples that overexpress
three genes (HER2, EGFR, and KRAS) separately with their controls expressing GFP are shown.
The correction effectively removes batch effects on the control samples from the three batches,
while maintains the effects of the transgenes. (Adapted from Y Zhang, G Parmigiani, ComBat-
seq: batch effect adjustment for RNA-seq count data, NAR Genomics and Bioinformatics 2020,
2(3):lqaa078.)
edgeR [53]. While DEGseq has been developed based on the Poisson distri-
bution, baySeq, Cuffdiff 2, DESeq2, EBSeq, and edgeR have been designed
on the negative binomial distribution. To detect DE genes, these packages
use different approaches. For example, baySeq and EBSeq employ empirical
Transcriptomics by Bulk RNA-Seq 131
FIGURE 7.3
The overdispersion problem in RNA-seq data. Poisson distribution is often used to model
RNA-seq data, but instead of the variance/dispersion being approximately equal to the mean
as assumed by the distribution, the variance in RNA-seq data is often dependent on the mean.
The purple line represents the relationship between variance and mean based on the Poisson
distribution, while the solid and dashed orange lines represent local regressions used by DESeq
and edgeR, respectively, based on negative binomial distribution. (Modified from Anders
S. and Huber W. (2010) Differential expression analysis for sequence count data. Genome
Biology, 11, R106. Used under the terms of the Creative Commons Attribution License [http://
creativecommons.org/licenses/by/2.0] © 2010 Anders et al.)
TABLE 7.2
Tools for Detection of DE Genes
Name Description Reference
DESeq2 Employs negative binomial generalized linear modeling and Wald [52]
test to detect DE genes. Uses empirical Bayes estimation to shrink
dispersions and fold changes to increase detection stability
edgeR Detects DE genes based on negative binomial distribution, using [53]
techniques including exact test, generalized linear modeling,
quasi-likelihood F-test, and empirical Bayes
limma Fits a gene-based linear model for DE analysis. Originally [65]
developed for microarray gene expression data, it uses the voom
function for RNA-seq data to apply empirical Bayes to estimate
gene-wise variability
NOISeq A nonparametric and data-adaptive method that detects DE genes [72]
based on simultaneous comparison of fold change and absolute
expression difference
(continued)
132 Next-Generation Sequencing Data Analysis
above generate similar results for well-powered studies, and those based on
the non-parametric approach are often equally effective [73, 74].
Most of the currently available methods are designed to handle samples
with biological replicates. Under non-ideal circumstances when RNA-seq is
performed without replication, it becomes impossible to estimate biological
variability for a satisfactory statistical analysis. For such cases, the only indi-
cator of differential expression is fold change. Some of the above tools can still
take such data, such as edgeR, which offers an option to manually input a dis-
persion value estimated from similar studies, and NOISeq, which provides
technical replicates through simulation of the data assuming a multinomial
distribution, acting as alternative means to estimate biological variability for
a significance analysis. There are also tools especially designed for RNA-seq
experiments without replication, including GFOLD [75] and ACDtool [76].
ACDtool is an implementation of the method originally proposed by Audic
and Claverie developed on the basis of Poisson distribution [77]. Although
originally designed for analyzing relatively small data sets (<10 K reads),
the A-C statistic and its implementation through ACDtool is equally sensi-
tive and applicable to the much larger NGS data sets that contain millions of
reads without replicates.
7.3.7 Gene Clustering
Genes showing differential expression pattern between conditions can be
grouped into different clusters based on their overall expression pattern.
This unsupervised process helps uncover different patterns of overall gene
expression changes, and thereby serves as an important exploratory step to
find key target genes for further investigation and hypothesis generation.
Among the most widely used clustering algorithms are hierarchical and k-
means clustering. Hierarchical clustering aims to build a dendrogram based
on similarity of expression between genes. This clustering method can take
either a “bottom-up,” also called agglomerative, approach in which each gene
is in their own cluster initially and then recursively merged until only one
cluster remains, or a “top-down” (divisive) process that employs a reverse
process. For RNA-seq data, the agglomerative approach is used more com-
monly. Besides clustering genes, samples are often clustered at the same time
to uncover relationships among them (Figure 7.4). With k-means clustering,
the number of clusters, i.e., the k value, needs to be defined a priori. Performing
hierarchical clustering first can help assess what k value to use. The objective
of k-means clustering is to assign each gene to the nearest cluster mean. For
both hierarchical and k-means clustering, an often-used similarity measure is
Pearson correlation coefficient. Besides these two commonly used clustering
algorithms, other clustering methods include Self-Organizing Map (SOM)
and Partitioning Around Medoids (PAM). These clustering processes can be
performed using functions in R such as “hclust” for hierarchical clustering
and “kmeans” for k-means clustering.
FIGURE 7.4
Hierarchical clustering of DE genes as well as experimental samples. RNA-seq data shown here
is collected from cultured fibroblasts that were subjected to two treatment conditions (irradiation
[IR] and TGF-β1 [TGF-β], in comparison with control [CTR]). (From Mellone M, Hanley CJ,
Thirdborough S, Mellows T, Garcia E, Woo J, Tod J, Frampton S, Jenei V, Moutasim KA, Kabir TD,
Brennan PA, Venturi G, et al. (2016) Induction of fibroblast senescence generates a non-fibrogenic
myofibroblast phenotype that differentially impacts on cancer prognosis. Aging (Albany NY).
159:114–132. Used under the terms of the Creative Commons Attribution License (CC BY 3.0)
[https://creativecommons.org/licenses/by/3.0/] © 2016 Mellone et al.)
and machine learning techniques RNA-seq profiles are used to stratify breast
cancer into different subtypes [104]. As another example, RNA sequencing
has significantly improved diagnostic outcome for hereditary cancer by
resolving uncertainties from DNA genetic sequencing alone [105].
Besides interrogating currently catalogued genes, RNA- seq, being an
unbiased approach, is also a powerful technology for discovering novel
transcripts, splicing events, and other transcription- related phenomena.
RNA-seq studies of the transcriptional landscape of the genome have found
that besides protein-coding regions, the majority of the genome produces
RNA transcripts. The finding that 75% of the human genome is transcribed
(see Chapter 3), made with extensive use of RNA-seq, shows the power of
this technology in discovering currently unknown transcripts. RNA-seq has
also been used to discover novel alternative splicing isoforms. For example,
the discovery of circular RNAs (also see Chapter 3), which are formed as a
result of non-canonical RNA splicing, is also due to the application of RNA-
seq [106]. RNA-seq has also been applied to uncover other transcription-
related phenomena, such as gene fusion. Gene fusion is caused by genomic
rearrangement and is a common occurrence under certain conditions such as
cancer. Because RNA-seq has the capability to locate transcripts generated
from a fused gene, detection of gene fusion events has been greatly
facilitated by this powerful technology [107, 108]. With rapid technological
developments in advancing RNA sequencing as an even more powerful dis-
covery tool, RNA-seq has now entered into a new era –single-cell RNA-seq,
which is covered next.
References
1. Li J, Fu C, Speed TP, Wang W, Symmans WF. Accurate RNA sequencing from
formalin-fixed cancer tissue To represent high-quality transcriptome from
frozen tissue. JCO Precis Oncol 2018, 2:PO.17.00091.
2. Parekh S, Ziegenhain C, Vieth B, Enard W, Hellmann I. The impact of amplifi-
cation on differential expression analyses by RNA-seq. Sci Rep 2016, 6:25533.
3. Zhulidov PA, Bogdanova EA, Shcheglov AS, Vagner LL, Khaspekov GL,
Kozhemyako VB, Matz MV, Meleshkevitch E, Moroz LL, Lukyanov SA et al.
Simple cDNA normalization using kamchatka crab duplex-specific nuclease.
Nucleic Acids Res 2004, 32(3):e37.
4. Yang L, Duff MO, Graveley BR, Carmichael GG, Chen LL. Genomewide char-
acterization of non-polyadenylated RNAs. Genome Biol 2011, 12(2):R16.
5. Busby MA, Stewart C, Miller CA, Grzeda KR, Marth GT. Scotty: a web tool
for designing RNA-Seq experiments to measure differential gene expression.
Bioinformatics 2013, 29(5):656–657.
6. Bi R, Liu P. Sample size calculation while controlling false discovery rate for
differential expression analysis with RNA- sequencing experiments. BMC
Bioinformatics 2016, 17:146.
Transcriptomics by Bulk RNA-Seq 139
62. Zhang Y, Parmigiani G, Johnson WE. ComBat-seq: batch effect adjustment for
RNA-seq count data. NAR Genom Bioinform 2020, 2(3):lqaa078.
63. Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y. RNA- seq: an
assessment of technical reproducibility and comparison with gene expression
arrays. Genome Res 2008, 18(9):1509–1517.
64. Anders S, Huber W. Differential expression analysis for sequence count data.
Genome Biol 2010, 11(10):R106.
65. Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, Smyth GK. limma powers
differential expression analyses for RNA-sequencing and microarray studies.
Nucleic Acids Res 2015, 43(7):e47.
66. Frazee AC, Pertea G, Jaffe AE, Langmead B, Salzberg SL, Leek JT. Ballgown
bridges the gap between transcriptome assembly and expression analysis. Nat
Biotechnol 2015, 33(3):243–246.
67. Hardcastle TJ, Kelly KA. baySeq: empirical Bayesian methods for identi-
fying differential expression in sequence count data. BMC Bioinformatics
2010, 11:422.
68. Trapnell C, Hendrickson DG, Sauvageau M, Goff L, Rinn JL, Pachter L.
Differential analysis of gene regulation at transcript resolution with RNA-seq.
Nat Biotechnol 2013, 31(1):46–53.
69. Wang L, Feng Z, Wang X, Wang X, Zhang X. DEGseq: an R package for iden-
tifying differentially expressed genes from RNA-seq data. Bioinformatics 2010,
26(1):136–138.
70. Leng N, Dawson JA, Thomson JA, Ruotti V, Rissman AI, Smits BM, Haag
JD, Gould MN, Stewart RM, Kendziorski C. EBSeq: an empirical Bayes hier-
archical model for inference in RNA-seq experiments. Bioinformatics 2013,
29(8):1035–1043.
71. Li J, Tibshirani R. Finding consistent patterns: a nonparametric approach for
identifying differential expression in RNA-Seq data. Stat Methods Med Res
2013, 22(5):519–536.
72. Tarazona S, Furio-Tari P, Turra D, Pietro AD, Nueda MJ, Ferrer A, Conesa
A. Data quality aware analysis of differential expression in RNA-seq with
NOISeq R/Bioc package. Nucleic Acids Res 2015, 43(21):e140.
73. Corchete LA, Rojas EA, Alonso- Lopez D, De Las Rivas J, Gutierrez NC,
Burguillo FJ. Systematic comparison and assessment of RNA-seq procedures
for gene expression quantitative analysis. Sci Rep 2020, 10(1):19737.
74. Stupnikov A, McInerney CE, Savage KI, McIntosh SA, Emmert- Streib F,
Kennedy R, Salto-Tellez M, Prise KM, McArt DG. Robustness of differen-
tial gene expression analysis of RNA-seq. Comput Struct Biotechnol J 2021,
19:3470–3481.
75. Feng J, Meyer CA, Wang Q, Liu JS, Shirley Liu X, Zhang Y. GFOLD: a
generalized fold change for ranking differentially expressed genes from RNA-
seq data. Bioinformatics 2012, 28(21):2782–2788.
76. Claverie JM, Ta TN. ACDtool: a web-server for the generic analysis of large
data sets of counts. Bioinformatics 2019, 35(1):170–171.
77. Audic S, Claverie JM. The significance of digital gene expression profiles.
Genome Res 1997, 7(10):986–995.
78. Benjamini Y, Hochberg Y. Controlling the false discovery rate –a practical
and powerful approach to multiple testing. J R Stat Soc Ser B Methodol 1995,
57(1):289–300.
Transcriptomics by Bulk RNA-Seq 143
79. Gene Ontology C. The Gene Ontology resource: enriching a GOld mine.
Nucleic Acids Res 2021, 49(D1):D325–D334.
80. Kanehisa M, Furumichi M, Tanabe M, Sato Y, Morishima K. KEGG: new
perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Res
2017, 45(D1):D353–D361.
81. Rodchenkov I, Babur O, Luna A, Aksoy BA, Wong JV, Fong D, Franz M,
Siper MC, Cheung M, Wrana M et al. Pathway Commons 2019 update: inte-
gration, analysis and exploration of pathway data. Nucleic Acids Res 2020,
48(D1):D489–D497.
82. Martens M, Ammar A, Riutta A, Waagmeester A, Slenter DN, Hanspers K, R
AM, Digles D, Lopes EN, Ehrhart F et al. WikiPathways: connecting commu-
nities. Nucleic Acids Res 2021, 49(D1):D613–D621.
83. Gillespie M, Jassal B, Stephan R, Milacic M, Rothfels K, Senff-Ribeiro A, Griss
J, Sevilla C, Matthews L, Gong C et al. The reactome pathway knowledgebase
2022. Nucleic Acids Res 2022, 50(D1):D687–D692.
84. Xie Z, Bailey A, Kuleshov MV, Clarke DJB, Evangelista JE, Jenkins SL,
Lachmann A, Wojciechowicz ML, Kropiwnicki E, Jagodnik KM et al. Gene set
knowledge discovery with Enrichr. Curr Protoc 2021, 1(3):e90.
85. Young MD, Wakefield MJ, Smyth GK, Oshlack A. Gene ontology analysis for
RNA-seq: accounting for selection bias. Genome Biol 2010, 11(2):R14.
86. Eden E, Navon R, Steinfeld I, Lipson D, Yakhini Z. GOrilla: a tool for discovery
and visualization of enriched GO terms in ranked gene lists. BMC bioinfor-
matics 2009, 10:48.
87. Raudvere U, Kolberg L, Kuzmin I, Arak T, Adler P, Peterson H, Vilo J.
g:Profiler: a web server for functional enrichment analysis and conversions of
gene lists (2019 update). Nucleic Acids Res 2019, 47(W1):W191–W198.
88. Huang da W, Sherman BT, Lempicki RA. Systematic and integrative analysis
of large gene lists using DAVID bioinformatics resources. Nat Protoc 2009,
4(1):44–57.
89. Chen J, Bardes EE, Aronow BJ, Jegga AG. ToppGene Suite for gene list enrich-
ment analysis and candidate gene prioritization. Nucleic Acids Res 2009,
37(Web Server issue):W305–311.
90. Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA,
Paulovich A, Pomeroy SL, Golub TR, Lander ES et al. Gene set enrichment
analysis: a knowledge-based approach for interpreting genome-wide expres-
sion profiles. Proc Natl Acad Sci U S A 2005, 102(43):15545–15550.
91. Doncheva NT, Morris JH, Gorodkin J, Jensen LJ. Cytoscape stringapp:
network analysis and visualization of proteomics data. J Proteome Res 2019,
18(2):623–632.
92. Montojo J, Zuberi K, Rodriguez H, Kazi F, Wright G, Donaldson SL, Morris Q,
Bader GD. GeneMANIA Cytoscape plugin: fast gene function predictions on
the desktop. Bioinformatics 2010, 26(22):2927–2928.
93. Bindea G, Mlecnik B, Hackl H, Charoentong P, Tosolini M, Kirilovsky A,
Fridman WH, Pages F, Trajanoski Z, Galon J. ClueGO: a Cytoscape plug-in
to decipher functionally grouped gene ontology and pathway annotation
networks. Bioinformatics 2009, 25(8):1091–1093.
94. Merico D, Isserlin R, Stueker O, Emili A, Bader GD. Enrichment map: a
network-based method for gene-set enrichment visualization and interpret-
ation. PLoS One 2010, 5(11):e13984.
144 Next-Generation Sequencing Data Analysis
high measurement noise in the scRNA-seq data. To effectively extract the rich
information embedded in the scRNA-seq data, these characteristics demand
development and deployment of algorithms and tools that are different from
those designed for bulk RNA-seq.
Furthermore, the questions that can be answered from scRNA-seq data are
beyond those from bulk RNA-seq. Detection of cell-to-cell variation, identifi-
cation and visualization of different cell types/identities in a population, and
inference of cellular developmental trajectories are all new realms that have
emerged from the development of scRNA-seq. Some other data analytical
topics such as data preprocessing, normalization, batch effect correction, and
clustering are also significantly different from those covered in Chapter 7,
and are presented in detail in this chapter. On topics that are similar to and/
or have significant overlap with those covered in Chapter 7, such as identifi-
cation and functional analysis of differentially expressed genes, only aspects
that are specific to or need adjustment for scRNA-seq data are presented to
avoid redundancy.
8.1 Experimental Design
8.1.1 Single-Cell RNA-Seq General Approaches
The landscape of single-cell sequencing, including RNA-seq, has been rap-
idly evolving. There have been a multitude of strategies or platforms in
existence today, while new ones are continuously being developed and
existing ones improved upon. In general, existing scRNA- seq strategies
can be divided into two broad approaches, i.e., low-and high-throughput,
based on the number of cells that can be analyzed simultaneously at a time.
Low-throughput methods can process up to a few hundred cells at a time,
while those of the high-throughput approach allow simultaneous analysis
of thousands to tens of thousands of cells, or even more with the technolo-
gies continuing to evolve. A good example of the low-throughput approach
is Smart-seq3 [6]. The low-throughput approach suits situations where the
focus is on a small number of cells that need detailed molecular character-
ization. This approach has higher sensitivity, which leads to the identifica-
tion of more transcripts and genes from each cell. It also provides full-length
coverage of the transcriptome, thus enabling recognition of different splicing
isoforms and allelic gene expression.
There are currently a multitude of platforms for the high- throughput
approach. These platforms include Drop-seq [7], inDrop [8], sci-RNA-seq for
single-cell combinatorial indexing RNA sequencing [9], and 10× Chromium
[10]. Among these various platforms, the commercially available 10×
Transcriptomics by Single-Cell RNA-Seq 147
Chromium platform has been more widely adopted so far in the biomed-
ical research community. Benchmark studies have shown that 10× generally
has more consistent performance, better sensitivity and precision, and lower
technical noise [11, 12]. In general, the high-throughput approach is more
suitable to study cellular heterogeneity in a large population of cells. The
transcript detection sensitivity is typically lower with this approach, as a
result the data is sparser. The detection and counting of transcripts are usu-
ally based on sequencing of either the 3’ or 5’ end and not full length. Because
of such differences between the two approaches, it is advisable to evaluate
the particular needs of a project in order to decide which approach is more
appropriate.
Despite the differences, the two general approaches share basically the
same workflow in wet lab operational procedures. As detailed in Section
8.2, for both approaches single cells need to be prepared from dissociation of
input materials, such as an organ or a tissue biopsy, followed by partitioning
of the cells. The mechanism of cell partitioning varies with the approach
and particular platform chosen, e.g., cells can be partitioned into 96-or 384-
well plates (such as for the low-throughput approach), microfluidic droplets
(used by many of the high-throughput platforms), etc. On the 10× Chromium
platform, single cells are partitioned into nanoliter-scale GEMs (or Gel beads-
in-EMulsion), with each GEM carrying a unique barcode. After partitioning
cells are lysed to release cellular RNA, which is then reverse transcribed into
cDNA. This is followed by cDNA amplification and subsequently library
construction. Because of the shared commonalities in lab operation as well as
data analysis, the two general approaches are not covered separately in the
following sections unless otherwise noted.
The maximum allowable cell number varies with platform, for example, on
the current configuration of 10× Chromium 10,000 cells is the upper limit. For
any droplet-based platforms including the Chromium, the limit on maximum
cell number is affected by the law of Poisson distribution, as the loading of
single cells into droplets (or GEMs for Chromium) is a Poisson process. This
process leads to the formation of doublets (or multiplets), i.e., two (or more)
cells being loaded into the same droplet/GEM and treated as one cell, and
the rate of their formation follows Poisson statistics. To ensure scRNA-seq
data quality, the rate of doublets/multiplets needs to be controlled at a man-
ageable level (usually <5%).
On the question of sequencing depth, this depends on how transcription-
ally active cells are in the sample, and the diversity of their transcriptomes.
Without this knowledge a priori, the generally suggested depth for 3’ (or
5’) end gene expression profiling on the 10× Chromium platform is a min-
imum of 20,000 reads per cell using the current v3 chemistry. The number of
detected genes at this depth varies with cell type, e.g., 1,000–2,000 for periph-
eral blood mononuclear cells and 2,000–3,000 for neurons. Sequencing depth
may be fine-tuned based on specific project needs, and increasing sequen-
cing depth from the suggested minimum depth generally leads to identifica-
tion of more genes. But beyond a certain point (varies with cell type), further
increase in sequencing depth leads to diminished return.
On the question of how to balance cell number and sequencing depth
to maximize the amount of information obtained, the options are either
sequencing fewer cells at greater depth, or more cells with fewer reads per
cell. The former option allows identification of more transcripts and genes,
and as a result generates a more accurate picture of each cell’s transcription
status. However, the fewer cells used may not offer a sufficient representa-
tion of the cellular population under study. The latter option, on the other
hand, allows analysis of more cells to increase cell representation, but at the
expense of identifying less transcripts and genes per cell. To quantify the
tradeoff between sequencing depth and cell number, computational simula-
tion using a multivariate generative model showed that increasing sequen-
cing depth is better than increasing cell number before reaching the depth of
15,000 reads per cell, beyond which point there is a diminished return [14].
Another modeling demonstrated that under a fixed budget the strategy is
to sequence as many cells as possible at the depth of one read per gene per
cell [15]. Commonly used scRNA-seq pipeline tools like the open source 10×
Cell Ranger [10] and zUMIs [16] have downsampling function to help deter-
mine whether the library was sequenced to saturation, or whether additional
sequencing would increase the number of detected genes cost effectively.
To further compare the two options, another factor to consider is that
many genes are often co-regulated in a cell, forming functional modules. The
modularity of the cellular gene transcriptional system is manifested in the
extensive gene-gene covariance embedded in the sequencing data, which
Transcriptomics by Single-Cell RNA-Seq 149
...
...
...
...
...
...
Gene M 18 12 0 ... 0
FIGURE 8.1
Single-cell RNA-seq general lab process. Single cells are first partitioned into individual droplets, wells, tubes, etc. Once partitioned, cells are lysed to
release RNA for reverse transcription into cDNA. During cDNA synthesis and subsequent sequencing library preparation, cell-specific barcodes are
incorporated to track transcripts from each cell. After sequencing of scRNA-seq libraries, reads are demultiplexed as part of preprocessing to reveal
transcripts from each cell using cell barcodes. Counting of transcripts from each cell generates a gene-cell matrix for further analyses.
151
152 Next-Generation Sequencing Data Analysis
Read 1:28
Sample 10x BC +UMI Sample
Index (i5:10) Index (i7:10)
FIGURE 8.2
Structure of a 10× scRNA-seq 3’ library. To sequence such a library using Illumina sequencers,
Read 1 covers 10× barcodes (16 bp) that track individual cells and UMI (12 bp) for removal of
PCR duplicates, while Read 2 is used to sequence actual cDNA fragments for gene detection.
The dual sample indices (i5 and i7, 10 bp each) are used to separate reads into different samples,
each of which may contain thousands of cells. (This illustration is based on 10× scRNA-seq v3.1
chemistry. Image provided by 10× Genomics.)
Read Alignment
& Gene Counting
Data Preprocessing
Normalization
Signal Imputation
Feature Selection
Dimension Reduction
Dimension Reduction
& Visualization
Visualization
Cell Clustering
Downstream
Analyses
Cell Identity Annotation
FIGURE 8.3
Basic scRNA-seq data analysis workflow.
156 Next-Generation Sequencing Data Analysis
Gene 2 0 0 3 ... 1
Gene 3 5 2 13 ... 0
...
...
...
...
...
...
Gene M 0 18 12 ... 0
FIGURE 8.4
Gene-cell count matrix.
the UMIs, the number of UMIs is counted for each gene and each cell barcode,
thereby generating a gene-cell matrix (Figure 8.4). This matrix becomes the
basis of nearly all downstream analyses.
The above processes of genome alignment, reads classification, barcode
assignment, UMI collapsing, and transcript counting can be accomplished
together in a single workflow using various pipeline tools. For instance, for
10× data the “cellranger count” pipeline is typically used. However, it requires
significant computational resources and is not very fast. To increase pro-
cessing efficiency and speed up these preprocessing steps, newer tools such as
STARsolo [29], kallisto/bustools [30], and Alevin [31] have been developed.
Besides 10× data, these later tools can also be used to process scRNA-seq data
generated from other platforms. As suggested by its name, STARsolo is built
around the high-performing STAR aligner for read mapping, as well as cell
barcode assigning, UMI collapsing, and gene-cell matrix creation. Kallisto,
which aims to achieve a balance between computing efficiency and accuracy,
can generate pseudoaligned scRNA-seq data in a new format called BUS,
which provides a binary representation of the data in the form of barcode,
UMI, and sets of equivalence classes. Once generated, BUS files can be
manipulated with bustools [30] to produce a data matrix consisting of gene
count and barcode. To improve the accuracy of gene abundance estimates,
Alevin includes gene-ambiguous reads in its quantification, i.e., those that
multimap between genes and are usually discarded by other tools.
cells. Detection and removal of reads associated with such unwanted GEMs
are based on three indicators: the number of uniquely mapped genes per
GEM barcode, the total count of UMIs per GEM barcode, and the fraction of
reads per barcode that map to the mitochondrial genome. If a barcode has
unusually high gene and UMI counts, it is possibly associated with a doublet
(or multiplet). Conversely, if a barcode is associated with few mapped genes,
a low UMI count, and/or a high fraction of mitochondrial genes, it is an
indication of ambient RNA or a dead cell in the original sample. A dead cell
may have most of its cytoplasmic mRNA leaked out due to compromised
cell membrane, with only mitochondrial RNA preserved because of the
organelle’s double membrane system (Chapter 1, Section 1.4.9). To detect
such unwanted reads, these three quality indicators should be used in com-
bination rather than alone. It should also be noted that under some cellular
conditions these commonly used quality indicators may be violated. For
example, cells in high metabolic state may have unusually higher mitochon-
drial RNA fraction, or very large cells may appear to be doublets.
Doublets/multiplets represent hybrid-or super-transcriptomes and com-
pound downstream data analyses if not removed. Besides the basic detec-
tion method using the total count of genes and UMIs as mentioned above, a
number of specially developed doublet/multiplet detection tools are avail-
able, including DoubletFinder [32], Scrublet [33], DoubletDetection [34],
cxds and/or bcds [35], and solo [36]. In general, these methods work by first
building artificial doublets from combination of randomly selected droplets
(or GEMs), then generating a “doublet score” for each droplet based on their
similarity to the artificial doublets, and finally calling doublets if the score
surpasses a threshold. The major difference between these methods is on
how the doublet score is generated. For example, DoubletFinder, one of the
top performing tools based on a benchmark study [37], uses the k-nearest
neighbors (kNN) method to calculate the proportion of nearest artificial
doublet neighbors as the score for each droplet. It should be noted that none
of these tools works well for every case. In addition, to avoid removing large
sized cells that appear to be doublets, the rate of doublets/multiplets identi-
fied by these tools for removal should not exceed that expected from Poisson
statistics for the experimental condition.
Besides using software tools alone, experimental techniques can also be
used to improve their detection, such as cell hashing [38], or mixing of cells
from different species (e.g., human and mouse cells). With cell hashing,
antibodies against ubiquitous cell surface proteins are tagged with oligo-
nucleotide barcodes to distinguish cells from different samples for robust
identification of cross- sample doublets/ multiplets. Along the same line,
genotypic differences between cells, such as those collected from unrelated
individuals, can also be used to for detection of doublets/multiplets. For
example, tools like demuxlet [39], scSplit [40], souporcell [41], and Vireo [42]
can separate mixed cells into individual samples based on each sample’s
Transcriptomics by Single-Cell RNA-Seq 159
10000
1000
1000
Total count
Total count
100
100
10
10
1
1
1e+00 1e+02 1e+04 1e+06 1e+00 1e+02 1e+04 1e+06
Rank Rank
Large cells Small cells
10000
10000
1000
1000
Total count
Total count
100
100
10
10
1
8.3.4 Normalization
While the use of UMIs removes the effects of PCR duplicates, other factors
during single-cell sample processing can still introduce undesirable variations
among cells or samples. Such factors include RNA capture rate, reverse tran-
scription efficiency, random sampling of molecules during sequencing, and
sequencing depth. If uncorrected, such variations may lead to inaccurate
results in downstream analytic steps. The goal of normalization is to correct
Transcriptomics by Single-Cell RNA-Seq 161
for such variations in order to make cells and samples directly comparable.
Normalization approaches developed for bulk RNA-seq (Chapter 7, Section
7.3.3) may be used for scRNA-seq data, especially those generated on low-
throughput platforms with full-length transcript coverage. In general, how-
ever, normalization of scRNA-seq data faces unique challenges mostly due to
the issue of signal sparsity. To address the challenges, a number of methods
have been developed specifically for normalizing scRNA-seq data. Examples
of these methods are SCnorm [49], Linnorm [50], BASiCS [51], Census [52],
ZINB-WaVE [53], and sctransform [54]. Besides these dedicated normal-
ization methods, commonly used scRNA-seq pipeline toolkits such as Cell
Ranger, Seurat [55], Scanpy [56], scran [57], and scVI [58] also contain their
own normalization methods.
These normalization methods can be generally classified into two general
categories: global scaling-based and modeling-based. The former approach,
represented by BASiCS and those employed by Cell Ranger, Seurat, Scanpy,
and scran, is based on the use of a global “size factor.” As an example, Cell
Ranger performs normalization in its aggr pipeline, through subsampling
of libraries of higher sequencing depth until all libraries have on average
the same number of confidently mapped reads per cell. Seurat, as another
example, includes a similarly simple normalization process performed as
follows: first divide the UMI count of each gene by the total number of UMIs
in each cell, then multiply the resultant ratios with a scaling factor (typic-
ally in the range of 104 to 106), and lastly perform a log transformation of the
scaled values (actually log(x+1) to accommodate zero count genes) to gen-
erate normalized data. This global scaling approach assumes that RNA con-
tent is constant across all cells, and that one size factor fits all genes/cells. This
assumption may not be true at times, especially with highly heterogeneous
population containing cells of different sizes and RNA content. To address this
concern, methods in the other category, as represented by SCnorm, Linnorm,
Census, ZINB-WaVE, sctransform, and scVI, are based on modeling of cel-
lular molecule counts using probabilistic approaches. SCnorm, for example,
employs quantile regression modeling, which is used to group genes based
on the relationship between their UMI count and sequencing depth. Genes
in different groups are normalized using group-specific size factors. With
sctransform, which is included in Seurat from ver3 as a normalization option,
a regularized negative binomial (NB) regression model is used. This method
first constructs a generalized linear model (GLM) for each gene to estimate
the relationship between its UMI count and sequencing depth. The estimated
parameters are then regularized based on gene expression level. To generate
normalized values for each gene, the regularized parameters are applied to
an NB regression model. Overall, these methods are based on models that
make different assumptions about the sparsity and underlying distribution
of gene expression values in cells. Comparative studies on these methods
have reported that the performances of different methods vary from dataset
162 Next-Generation Sequencing Data Analysis
to dataset, and therefore it is advisable to use more than one method on the
dataset at hand and then select the one that has the best performance [59, 60].
Variance stabilization is often an inherent goal of normalization. The log
transformation used in many methods toward the end achieves this goal
while also makes the normalized gene expression approximate a normal
distribution to facilitate downstream analyses. Without this stabilization,
the magnitude of a gene’s average gene expression correlates with the mag-
nitude of its variance, i.e., the so-called mean-variance relationship. Based
on this relationship, highly expressed genes also tend to have high levels
of variance, even if they do not contribute to cellular heterogeneity, such as
housekeeping genes. Conversely, genes that are expressed at low levels have
relatively low variance, even if they are biologically significant including
those coding for transcription factors. The goal of variance stabilization is to
remove the unwanted effects of this relationship, so that genes with greater-
than-expected variance between cells, regardless of the magnitude of their
expression, can be revealed.
FIGURE 8.6
Correction of batch effects. Without batch correction (left) batch effects are evident with
cells being colored by their original batches. With batch correction (right) batch effects are
removed. (Adapted from: MD Luecken, FJ Theis, Current best practices in single-cell RNA-
seq analysis: a tutorial, Molecular Systems Biology 2019, 15(6):e8746. Used under the terms of
the Creative Commons Attribution 4.0 License, https://creativecommons.org/licenses/by/4.0,
©2019 Luecken et al.)
8.3.6 Signal Imputation
As mentioned at the beginning of this chapter, technical factors such as low
mRNA molecular capture rate and non-exhaustive sequencing cause signal
dropout, i.e., the inability to detect a transcript that is present in a cell. This
leads to high signal stochasticity, signal sparsity, and zero inflation, all of
which can affect downstream data analyses. Signal imputation, also called
denoising or expression recovery, aims at inferring missing transcript values
to help alleviate this problem. Some of the commonly used scRNA-seq signal
imputation methods include MAGIC [72], kNN-smoothing [73], SAVER and
SAVER-X [74, 75], ALRA [76], CIDR [77], DCA [78], scImpute [79], mcImpute
[80], and DrImpute [81]. Some of the pipeline tools introduced earlier have
built-in imputation function (such as scVI), or use wrappers to run exter-
nally developed imputation tools (such as Seurat running imputation using
ALRA). These methods can be separated into three groups based on the
general approach they employ [82]: (1) modeling-based: SAVER, CIDR, and
Transcriptomics by Single-Cell RNA-Seq 165
between different cell groups or identities, and therefore most likely con-
tribute to cell-to-cell variation. This relies on the premise that genes showing
high variability across cells are a result of biological effects, not experimental
noise. Methods for selecting highly variable genes (HVGs) include the
squared coefficient of variation method proposed by Brennecke et al. (2013)
[86], the FindVariableGenes method used by Seurat, or those incorporated in
other pipeline tools scran and scLVM [87]. Because of the heteroscedasticity
of cellular gene expression, i.e., the aforementioned mean-variance relation-
ship, the selection of HVGs should not be based on variance alone. Instead,
these methods fit the relationship between variance and the mean into their
respective models, based on which HVGs are selected using different stat-
istic tests. For example, scran performs LOESS fit on the variance-mean
relationship, and then uses the fit as the model to infer biological variation
across cells for HVG selection. Evaluation of commonly used HVG selection
methods reported large differences among the methods, and that different
tools perform optimally for different datasets [88]. Besides using the vari-
ability of a gene’s expression across cells, alternative feature selection strat-
egies include using average gene expression level to select genes with the
highest average expression [89], deviance to identify genes that deviate from
the null model of constant expression [90], or dropout rate to retain genes
with higher number of dropouts than expected [91].
8.4.2 Dimension Reduction
Collected from a large number of cells (e.g., tens of thousands) for a large
number of genes (potentially all genes in the genome), scRNA-seq data
has high dimensionality. This, in data science terms, causes the curse of
dimensionality, i.e., the amount of data needed for accurate generalization
grows exponentially with the increase in dimensions. Computationally,
high dimensionality leads to mathematical intractability for many modeling
and statistical calculations. Feature selection is one preliminary step toward
dimensionality reduction. However, even after this step the dimensionality
of the data is still very high. For example, there are still 2,000 gene dimensions
if the top 2,000 HVGs are retained. For many downstream analytical steps,
such as visualization and clustering to be detailed next, the number of
dimensions must first be significantly reduced to only a very small number
of dimensions (e.g., 10). To achieve this, specialized dimensionality reduc-
tion methods are required. The goal of these methods is not to select a small
number of original features, but to transform the data to create new features
so that the information contained in the original data can be preserved in the
low-dimensional space.
Among the most commonly used dimensionality reduction methods are
those based on linear transformation. Principal components analysis (PCA),
independent component analysis (ICA) [92], non-negative matrix factorization
Transcriptomics by Single-Cell RNA-Seq 167
(NMF) [93], and factor analysis [94] are all examples that have been applied
to scRNA-seq data. Among these methods, PCA is perhaps the best known.
It projects the cell-gene count matrix onto a subspace that is defined by a few
principal components, which are linear combinations of the original genes.
The first principal component, or PC1, is one axis in the new subspace along
which the maximal amounts of variation in the original data are captured.
The second axis, corresponding to PC2, is orthogonal to the first axis catching
the second most variation in the data. Although straightforward, PCA does
not take signal dropout into consideration. ZIFA (zero-inflated factor ana-
lysis), often considered to be a variation of PCA, is developed to address this
issue [94]. Although effective, such PCA-based approaches have one down-
side, i.e., the principal components that catch the majority of variance in the
original data are sometimes difficult to interpret biologically. Other methods,
such as f-scLVM (or factorial single-cell latent variable model) and NMF,
address this difficulty through generation of reduced dimensions that are
more biologically relevant. For example, reduced dimensions from f-scLVM
are based on explicit modeling of bio-pathway annotations of gene sets [95].
The linear dimensionality reduction methods introduced above are based
on the assumption that the underlying data structure is linear in nature.
Methods that are based on non- linear transformation do not make this
assumption, instead these methods operate under the premise that in a high-
dimensional space most relevant information concentrates in a small number
of low- dimensional manifolds. Currently available non- linear methods
include t- SNE (t-distributed Stochastic Neighbor Embedding) [96], MDS
(Multi-Dimensional Scaling) [97], Isomap (Isometric Feature Mapping) [98],
LLE (Locally Linear Embedding) [99], diffusion maps [100], spectral embed-
ding [101], and UMAP (Uniform Manifold Approximation and Projection)
[102, 103]. The t-SNE algorithm, for example, works by modeling transcrip-
tionally similar cells around each cell based on probability distribution,
with Gaussian and t-distributions being used in the original and dimension-
reduced space, respectively. The process first computes cell-cell similarity
in the original space using a Gaussian kernel and then maps the cells to a
dimension-reduced space that best preserves that similarity. The strength of
t-SNE is to reveal local data structure, but this is often achieved at the expense
of global data structure. An important parameter in the t-SNE algorithm is
perplexity, which effectively controls the number of transcriptionally similar
cells that each cell has. Proper adjustment of this parameter can help regulate
the balance between local and global data structure [104]. UMAP is a more
recent algorithm that is designed to provide better preservation of global
data structure without losing performance on local data structure, better
scalability, and faster computation speed. While it has a similar procedure to
t-SNE in that it first constructs a high-dimensional representation of the ori-
ginal dataset and then projects it to a low-dimensional space, UMAP differs
from t-SNE in how cell-cell similarity and the topology of cellular relations in
168 Next-Generation Sequencing Data Analysis
8.4.3 Visualization
One goal of dimensionality reduction is to enable graphical depiction of the
underlying data so that the researcher can visualize the major cell types (or
conditions) in the dataset, and thereby intuitively understand the inherent het-
erogeneity of the represented cellular population. In such a visual (Figure 8.7),
scatter plots are often used, in which each point represents a cell projected
into a two-or three-dimensional space, with each dimension corresponding
newgenrtpdf
Force-directed
b 17 BC1A BC5D
18 BC1B BC6
Shekhar (2016 )
19 BC2 BC7
(n = 6,174)
21 BC3A BC8/9_ 1
23 BC3B BC8/9_ 2
24 BC4 Cone PR
25 BC5A Muller Glia
26 BC5B Rod BC
Amacrine_ 1 BC5C Rod PR
Amacrine_ 2
c
Astrocytes ependyma l
Zeisel (2015 )
Endothelial–mural
(n = 3,005)
Interneurons
Microglia
Oligodendrocytes
Pyramidal CA 1
Pyramidal SS
d 25
Embryoid body
(n = 16,825 )
20
15
Day
10
0
e 20.0
17.5
Zunder (2016 )
15.0
(n = 220,450)
12.5
Day
10.0
7.5
5.0
2.5
0
FIGURE 8.7
Visualization of single-cell RNA-seq data in 2-D space. Different visualization methods are applied to different datasets to show how these methods
differ from each other in generating visualizations for datasets of different characteristics. (Adapted by permission from Springer Nature Customer
Service Centre GmbH: Springer Nature, Nature Biotechnology, Visualizing structure and transitions in high-dimensional biological data, Kevin R. Moon
169
et al., Copyright 2019.)
170 Next-Generation Sequencing Data Analysis
The K-means clustering algorithm first requires the user to specify k, the
number of clusters. The process starts with all cells randomly assigned to
one of k clusters. Then the centroid of each cluster is determined, and each
cell re-clustered based on their distance to each of the k centroids. This pro-
cess is reiterated until each cell’s cluster assignment no longer changes. The
third commonly used approach, graph-based clustering, is based on graph
construction that connects cells to their nearest neighbors. In a kNN graph
(different from K-means), for example, two nodes (cells) are connected by
an edge, if the distance from cell A to B is within the k-th lowest distances
from A to other cells. The edge may have a weight assigned based on the
similarity between the cells. In a Shared Nearest Neighbor (SNN) graph,
an edge is weighted based on their proximity to each other, or similarity in
terms of the number of mutual neighbors the two cells share. After such a
graph is constructed, dense regions that contain a large number of highly
connected nodes can be detected as the so-called communities, representing
distinct cell identities. Within each community (or cluster), cells are more
highly connected with each other indicative of high similarity, than those in
other communities. To partition cells into distinct communities, community
detection techniques, such as the Louvain and Leiden methods [116, 117], can
be used.
To carry out these clustering approaches, either general-purpose clustering
methods or tools specifically designed for scRNA-seq data can be used. As
examples of general-purpose methods, the hclust() and kmeans() functions
in R can be directly used on PCA dimension reduced data to perform hier-
archical and K-means clustering, respectively. Examples of tools specially
developed for clustering single cells include SC3 [118], pcaReduce [119],
CIDR, RaceID2 [120], SIMLR [108], and SNN-Cliq [121] (Table 8.1). Many
pipeline tools, such as Seurat, Cell Ranger, Pagoda2 [122], and Scanpy, scran,
ascend [97], and SINCERA [123], also provide built-in clustering functional-
ities. Seurat, for example, provides graph-based clustering. This process first
constructs a kNN graph using Euclidean distance in the PCA space, with the
edges weighted by Jaccard similarity that measures the number of neighbors
they share. To find clusters in the graph, the Louvain community detection
method is then applied. Seurat clustering uses a user adjustable parameter
called resolution to control the number of clusters generated, with a higher
resolution (e.g., >1.0) leading to a larger number of clusters. Cell Ranger
provides a similar nearest neighbor graph-based clustering approach, with an
additional step to merge clusters that show no differential gene expression. In
addition, it also offers K-means clustering as another option. Benchmarking
studies on most of these clustering methods show that there is wide variation
in actual performance and poor concordance among them [89, 124]. Some
methods, such as Seurat, SC3, and Cell Ranger, showed better overall per-
formance than others. In terms of running times, Seurat also showed consist-
ently faster speed than most other methods.
Transcriptomics by Single-Cell RNA-Seq 173
TABLE 8.1
Single-Cell Clustering Methods
Name Description Reference
Cell Ranger finds differential genes for each cluster using edgeR (covered in
Chapter 7), or sSeq, which is a modified NB exact test [125]. From the iden-
tified differentially expressed, candidate marker genes, manual annotation
can be used to identify cell type or state in each cluster, based on previously
characterized, canonical cell identity specific marker genes. For example,
GFAP (glial fibrillary acidic protein) gene expression is a marker of astroglial
cells in the brain, and CD79a and CD79b are markers for B cells. The identifi-
cation of cell type or state based on such classic gene markers is an extension
of the traditional method often used in the lab for cell identity recognition.
This process, however, is typically labor- intensive and time- consuming,
as it often needs extensive review of currently available literature, or deep
domain knowledge about the cell system under study, which requires close
interactions between bench scientists and informaticians. In addition, some
cell types may not have well characterized gene markers, or the expression of
marker genes in host cells may be undetectable due to signal dropout.
To address some of the issues and speed up the cell identity recognition
process, a relatively easy and semi-automatic approach is through Gene
Ontology (GO) and/or bio-pathway enrichment analysis of the candidate
marker genes (detailed in Chapter 7, Section 7.3.8), since the GO terms or
pathways significantly enriched in the genes can produce insights into cell
identity. For example, if “hepatocyte homeostasis” is identified as a signifi-
cantly enriched GO term, it is indicative that at least some of cells in the
cluster are hepatocytes. Further, the cell identity detection process on the
basis of canonical marker genes can be automated, with the use of specialized
tools such as Garnett [126], Digital Cell Sorter [127], SCINA [128], CellAssign
[129], and scANVI [130]. The list of marker genes required by these automated
tools to classify different cell identities can be provided by databases such as
PanglaoDB [131], CellMarker [132], the Mouse Brain Atlas [133], the BRAIN
Initiative Cell Census Network (or BICCN) [134], and DropViz.org [135].
Garnett, as an example, first uses as input a list of gene markers to train a
regression-based classifier, which is then applied to classify cells in a new
dataset. As another example, scANVI, a semi-supervised variant of scVI, also
classifies cells based on their expression of canonical marker genes. This tool
goes even further to use these cells as “seeds” to classify other cells in the
same dataset with unobserved expression of the marker genes, on the basis
of how close these other cells are to the “seeds.”
Instead of expression of marker genes, overall gene expression pattern may
also be used for automated inference of cellular identities. This approach uses
information embedded in the overall gene expression of annotated cells in a
reference dataset, to predict cell identities in a query dataset. Without reliance
on prior knowledge in the form of a pre-defined list of marker genes, this
approach directly harnesses the power of rapidly accumulating single-cell
data that can be used as reference, such as those from the Human Cell Atlas
[136] and the Tabula Muris Atlas for mouse [137]. Table 8.2 lists some of the
Transcriptomics by Single-Cell RNA-Seq 175
TABLE 8.2
Cell Identity Annotation Tools
Name Description Reference
methods that use this strategy. Among them, scmap [138] projects cells in
a query dataset onto a reference dataset (or combined references), to iden-
tify matching individual cells (the scmap-cell mode) or specific cell types
(the scmap-cluster mode). In this method, the similarity between individual
cells in the query sample and cells or cell types in the reference is measured
using three distance metrics (Pearson, Spearman, and cosine). Some of the
other methods are machine learning based, i.e., they first construct classi-
fier models from a reference dataset and then use the classifiers to annotate
individual cells or clusters from a query dataset. SingleCellNet (SCN), for
example, uses a random forest (RF) classifier trained on annotated reference
scRNA-seq data to transfer to query cell classification [139]. ACTINN (or
Automated Cell Type Identification using Neural Networks) [140] is based
on neural network for such transfer learning. Another generalizable method,
scPred uses support vector machines (SVMs), combined with singular value
decomposition for unbiased feature selection, to perform probability-based
cell type prediction [141]. Besides these machine learning-based tools spe-
cially designed for single-cell data, general- purpose classifiers including
SVM and RF may also be used directly [142]. Among other cell classification
methods that use overall gene expression pattern instead of marker genes
are those developed for integrated analysis of multiple scRNA-seq datasets,
such as Seurat Integration, Scanorama, and Conos as introduced in Section
8.3.5 (therefore not listed on Table 8.2). These methods achieve automated
annotation of cells from a query dataset through detection of equivalent cells
in a reference dataset. As an example, Seurat performs annotation of query
cells by mapping the query dataset onto a reference, which is accomplished
through projecting the query cells onto the reference UMAP structure.
Because the diverse array of methods introduced above may not always
use the same term to label the same cell type, for consistency cell annota-
tion can benefit from the use of standardized cell type terms from the Cell
Ontology (CO). CO is a community effort to organize cell types anatomic-
ally and hierarchically through a structured and controlled vocabulary [143].
To further leverage the inherent hierarchical relationships built into the CO
terms, specialized CO-based cell annotation tools, such as OnClass [144] and
CellO [145], have also been developed. Cello, for example, performs hier-
archical classification to annotate cells based on the graph structure of the CO
system. Compared to methods that do not use CO terms, such tools provide
more consistent and standardized annotation of individual cells or clusters,
through the use of CO terms.
To evaluate the performance of these methods, comprehensive bench-
mark studies have been performed [146, 147]. Using 27 scRNA-seq datasets
of varying cell numbers, platforms, species, and cellular heterogeneity, one
of the studies [146] compared 22 automated cell identification methods and
showed that the general-purpose SVM classifier had the best overall per-
formance. Other top performers included SingleCellNet, scmap-cell, scPred,
Transcriptomics by Single-Cell RNA-Seq 177
8.5.3 Compositional Analysis
The composition of different cell identities in a population varies with internal
and external conditions. For example, upon bacterial pathogen infection, in
the small intestinal epithelium there is a change in the proportions of different
cell types as part of antimicrobial response (Figure 8.8) [153]. Compositional
analysis involves examination of the proportions of different cell identities in
different samples. For this analysis, different statistical approaches have been
used to assess the significance associated with changes of cellular composition.
For example, to detect the change of cell composition in the intestinal epithe-
lium upon infection, Haber et al. (2017) applied a Poisson process to model
0.4
0.2
0.0
Tuft Tuft-1 Tuft-2
progenitor
FIGURE 8.8
Changes in cellular composition in intestinal epithelium caused by pathogen infection. Shown
here are changes in the fraction of three different types of tuft cells (having chemosensory
function in the gut lining) after infection with the parasitic helminth Heligmosomoides polygyrus.
Significant changes in frequency are marked (* FDR < 0.25, ** FDR < 0.05; Wald test). (Adapted
by permission from Springer Nature Customer Service Centre GmbH: Springer Nature, Nature,
A single-cell survey of the small intestinal epithelium, Adam L. Haber et al., Copyright 2017.)
178 Next-Generation Sequencing Data Analysis
challenge because of the much larger number of single cells involved in the
comparison. The substantial zero inflation, signal overdispersion, transcrip-
tional bursting, and multimodality associated with scRNA-seq data, how-
ever, pose different challenges.
Methods developed for scRNA-seq DE analysis (see Table 8.3) use different
approaches to address the specific challenges posed by single-cell data.
Examples of these methods are SCDE [158], MAST [159], D3E [160], scDD
[161], BPSC [162], NBID [163], DEsingle [164], DECENT [165], and SwarnSeq
TABLE 8.3
Single-Cell Differential Expression Analysis Tools
Name Description Reference
[166]. To deal with zero inflation, SCDE fits a mixture of two error models,
with one using Poisson distribution to model the signal dropout process,
with the other using the NB distribution to model the signal amplification
process for detection of transcripts in correlation to their abundance in cells
[158]. MAST uses a two-part generalized linear hurdle model, with one mod-
eling the discrete expression rate of each gene across cells (i.e., how many
cells express the gene) using logistic regression, and the other modeling the
continuous positive expression level of each gene by Gaussian distribution
[159]. To fit zero-inflated and overdispersed scRNA-seq data, SwarnSeq uses
the zero-inflated negative binomial (ZINB) model to model the observed
UMI counts of transcripts. In addition, through using a binomial model to
adjust for cellular RNA capture rates, this method allows detection of DE
genes, as well as differential zero-inflated genes, i.e., those that show signifi-
cant difference in the number of cells that have zero expression between
two groups [166]. To address the multimodality nature of scRNA-seq data,
scDD employs Bayesian modeling to identify genes that display differential
distributions across conditions, and the genes are then further classified into
different multimodal expression patterns. SwarnSeq also classifies influential
genes into various gene types based on their differential expression and zero
inflation patterns. To address the issue of transcriptional bursting, D3E has
two modules, with one for DE gene identification, and the other for fitting a
model for transcriptional bursting to help discover the mechanisms under-
lying the observed expression changes.
The same marker gene finding functions contained in most compre-
hensive pipeline toolkits as introduced earlier can also be generalized
for DE analysis. For example, Seurat provides DE analysis from the same
FindMarkers() function through specifying two groups of cells for com-
parison. Currently available differential test methods include Wilcoxon
rank sum test, likelihood- ratio test, Student’s t- test, negative binomial
GLM, Poisson GLM, logistic regression, as well as the aforementioned
MAST and DESeq2. SINCERA, as another example, offers DE analysis using
one-tailed Welch’s t-test if gene expression can be assumed to come from
two independent normal distributions, or one-tailed Wilcoxon rank sum
test in case of small sample sizes. To identify genes that are differentially
expressed along a developmental lineage or trajectory (Trajectory Inference
to be introduced next), methods such as Monocle 3 [103] and tradeSeq
[167] can be used. Monocle 3, for example, employs two approaches: graph
auto-correlation and regression analysis. The former is suitable to identify
genes that change along a trajectory, or differ between clusters, while the
latter is to find genes that change expression under different experimental
conditions.
To evaluate the plethora of methods that are currently available for scRNA-
seq DE analysis, several benchmarking studies were performed [168–171].
Based on these studies, methods that were originally developed for bulk
RNA-seq data have been shown to perform as well as methods developed
Transcriptomics by Single-Cell RNA-Seq 181
specifically for single-cell data, especially after applying some strategies such
as prefiltering to remove lowly expressed genes [169] or weighting to deal
with zero inflation [172]. As these bulk or scRNA-seq tools have different
ways of dealing with signal sparsity, multimodality, and heterogeneity, there
is a general lack of agreement in the DE genes they identify. In addition, these
benchmarking studies also show a general tradeoff between precision and
sensitivity, i.e., methods of high precision have low sensitivity, which leads
to identification of less true positive genes but also introduces fewer false
positives.
It should be noted that DE analysis is an integral step of an scRNA-seq
analytical pipeline, and upstream data processing can have an effect on the
overall performance of this step. Of the various upstream steps, normaliza-
tion has been shown to have a significant impact on DE results by a system-
atic evaluative study conducted by Vieth et al. (2019). Based on this study, a
good normalization before DE analysis, such as that provided by scran, can
alleviate the need for complex DE methods [177]. Another note is that, just
like in bulk RNA-seq analysis as detailed in Chapter 7 (Section 7.3.8), the
identified DE genes can be subjected to further functional analysis, such as
gene set enrichment analysis, to reveal what biological processes or pathways
are enriched in them.
8.7 Trajectory Inference
Many biological processes, such as development, immune response, or tumori-
genesis, are underlined by continuous dynamic cell changes across time. The
path of changes that a cell undergoes in such a process is often called a tra-
jectory. While it is not yet possible to monitor the continuous transcriptomic
change of an individual cell over time, trajectory can be inferred from a popu-
lation of cells that represent a continuum of transitional cellular states while
cells undergo changes in an unsynchronized manner. Because trajectory
inference (TI) is based off of a snapshot of gene expression of a population
of cells at a certain point of time, it is also called pseudotemporal analysis.
Methodologically, it is built on the premise that cells in the continuum share
many common genes and their gene expression displays gradual change. In
essence, to infer cellular trajectory is to find a path in the cellular gene expres-
sion space that connects cells of various transitional states by maximizing
similarity between neighboring cells. The inferred cellular trajectory can then
be validated with additional experimental evidence.
Trajectory inference is carried out on dimensionality-reduced data, often
after the clustering step. General methods, such as minimum spanning tree
(MST) that aims to connect all points (clustered cells) in a graph to a path
that minimizes distance between points, can be directly used for TI [178].
182 Next-Generation Sequencing Data Analysis
Some of the methods specifically developed for TI are in fact based on MST.
For example, the first version of Monocle, a pioneer method for inferring tra-
jectory from single-cell sequencing data, first creates MST on cells projected
in a dimensionality-reduced space, and then places cells along the longest
path through the MST [92]. Slingshot [179], TSCAN [180], and Waterfall [181]
build MST on cell cluster centroids, instead of cells, and then order cells onto
the path through orthogonal projection. Besides these MST-based methods,
some other commonly used methods are based on graph theory. For example,
Diffusion Pseudotime (DPT) builds weighted kNN graph on cells, and then
orders cells using random-walk-based distance [182]. Also based on the use
of weighted kNN graph, PAGA performs graph partitioning and abstraction
using the Louvain method to identify different cellular states or identities,
and uses an extension of DPT for pseudotime calculation [110] (Figure 8.9
shows an example). Monocle 3 is built on PAGA and adds one step further
to construct more fine-grained trajectory through learning a principal graph
from the PAGA graph [103].
The methods mentioned above and listed on Table 8.4 are among an increas-
ingly long list of TI methods available. Besides the different approaches these
methods use to infer trajectories, they also differ in what trajectory topology
they can infer, whether they require prior information, how scalable they
are with increasing cell numbers, etc. Cellular trajectory topologies can be
muscle
epidermal
neoblast neural
progenitors progenitors
gut neuronal
epidermis progentors
parenchymal
progenitors
parenchymal gut
FIGURE 8.9
Cell trajectories inferred by PAGA to reconstruct a developmental lineage tree encompassing
all cell types in the planarian body based on single-cell transcriptomic data. (Adapted from FA
Wolf, FK Hamey, M Plass, J Solana, JS Dahlin, B Göttgens, N Rajewsky et al., PAGA: graph
abstraction reconciles clustering with trajectory inference through a topology preserving map of
single cells, Genome Biology 2019, 20(1):59. With permission.)
Transcriptomics by Single-Cell RNA-Seq 183
TABLE 8.4
Trajectory Inference Methods
Name Description Reference
tested methods, and (3) the user should choose and use a variety of methods
based on the expected trajectory topology [178]. Based on this comparison,
the top performing methods include PAGA, Slingshot, different versions of
Monocle, as well as generic methods such as MST.
After inference of cellular trajectory, the next questions to ask are what
genes are associated with cell lineage development and what key genes
underlie transitions between cellular states. As indicated in the last section,
methods for DE gene analysis for trajectories are still limited. Besides
Monocle 3 and tradeSeq as mentioned in the last section, other available tra-
jectory DE methods include those employed by TSCAN, GPfates [187], and
earlier versions of Monocle. Monocle 1 uses generalized additive models
to test whether genes significantly change their expression as a function of
pseudotime. TSCAN employs a similar approach. Monocle 2 uses a different
approach called BEAM (branch expression analysis modeling) to test whether
gene expression changes are associated with cell lineage branching along a
trajectory. GPfates models gene expression-dependent cell fates as temporal
mixtures of Gaussian processes. Similar to BEAM, it can identify gene expres-
sion changes associated with bifurcation points. Besides these tools that per-
form both trajectory inference and DE analysis, there are also tools that take
as input pseudotemporal ordering of cells inferred by the TI tools detailed
above, to conduct time-course DE analysis. LineagePulse, representing such
an example, fits ZINB noise model to gene expression data collected from
pseudotemporally ordered single cells [188].
Using static snapshots of cellular states, TI does not make predictions
on the speed or direction of cell progression along the trajectory. To make
such predictions, additional information is required. Change in cellular
mRNA abundance inferred from the same static snapshots by a strategy
called RNA velocity analysis [189] provides such information. The RNA
velocity strategy is based on the detection and comparison of unspliced
pre-mature transcripts that still contain introns, and spliced mature
transcripts. In principle, this strategy is built on the premise that if there is
a high ratio of unspliced to spliced mRNA molecules (called positive vel-
ocity) from a gene, it indicates that expression of the gene is upregulated
from its steady state. Conversely, if the ratio of unspliced to spliced mRNA
abundance is lower than its steady state ratio (i.e., negative velocity), it is
indicative of downregulation for the gene. Based on aggregation of RNA
velocities inferred across genes, this analysis then makes predictions on
the future state of each cell in terms of the speed and direction of their
movement along the trajectory. Currently available RNA velocity analysis
tools include VeloCyto [189] and scVelo [190]. These RNA velocity tools are
compatible and can be deployed with pipeline toolkits such as Seurat and
Scanpy. Because it adds predictive information onto a trajectory about the
direction and speed of cellular movement, RNA velocity analysis is often
carried out in combination with TI.
Transcriptomics by Single-Cell RNA-Seq 185
8.8 Advanced Analyses
8.8.1 SNV/CNV Detection and Allele-Specific Expression Analysis
Besides transcriptomic profiles, scRNA- seq data also contains genotypic
information specific for each cell, including single nucleotide and structural
variants. Such additional information is especially helpful for studies that
involve genome instability, such as cancer or other diseases related to aging.
The genotypic information embedded in scRNA-seq data can help uncover
functional variants in individual cells, and may also inform their specific
gene expression pattern. Understandably, detection of these variants from
scRNA-seq data is limited to expressed regions that have enough sequen-
cing depth. While some single-cell sequencing platforms such as Smart-seq3
generate reads that cover the full length of transcripts, others such as 10×
Chromium focus on the 3’ or 5’ end of transcripts. RNA editing may also
add another layer of complication by revealing variants that may not be
present at the DNA level, but occurrence of RNA editing is typically very
rare. To detect SNVs, methods developed for calling variants from bulk
RNA-seq data, such as MuTect2, Strelka2 [194], VarScan2, SAMtools [195],
Pysam [196], FreeBayes, and BamBam, can be used on scRNA-seq data. The
GATK RNA- seq short variant discovery best practices workflow, which
uses HaplotyperCaller for variant calling followed by variant filtering using
RNA-seq specific settings, is among the most used [197]. Monovar, a method
developed for single-cell DNA sequencing data [198], can also be used for
calling SNVs from scRNA-seq data [199]. There are currently a number of
tools that have been developed for SNV detection from scRNA-seq data,
including SSrGE [200], Trinity CTAT [201], and cellsnp-lite [202]. Among
these methods, cellsnp-lite is a lightweight allelic reads pileup method with
minimum filtering that can be applied to both 10× Chromium and Smart-seq3
data. Because of use of parallel processing it has improved running speed.
Benchmarking comparison has shown that the performance of many current
tools depends on sequencing depth, genomic context (such as high GC con-
tent), functional region, variant allele frequency, and platform (10x has more
dropout events) [203]. It has also been shown that the main detection limi-
tation is low sensitivity caused by low capture efficiency, sequencing depth,
and signal dropout. Among the best performing tools so far are SAMtools,
FreeBayes, Strelka2, and CTAT.
Deletion or duplication of a genomic region may lead to reduced or increased
expression of genes located in the affected region. It is possible, therefore, to
infer CNV information from scRNA-seq data. While it can be challenging
due to uneven coverage of scRNA-seq signal across the genome, inferred
CNVs do provide information on cellular heterogeneity at another dimen-
sion (genome instability), for instance, during cancer development [204].
To meet the challenges of calling CNVs from scRNA-seq data, a relatively
186 Next-Generation Sequencing Data Analysis
events through traversing the graph. Quantification of such events and iden-
tification of differential splicing between groups of cells are based on the
use of “percent-spliced-in” or Psi (ψ). As another example, DESJ-detection
first constructs a cell-splicing junction count matrix for each gene. Iterative
K-means is then used to cluster cells; after removing clusters with low expres-
sion, a list of solid junctions is generated. The identification of DESJs, or dif-
ferentially expressed splicing junctions, is achieved using limma. To help
visualize differential splicing patterns across cells, Millefy and VALERIE can
be used to uncover cellular heterogeneity and splicing differences between
various cell groups. VALERIE, for example, is an R-based tool for using Psi
values to display alternative splicing events. It can also be used to identify
significant splicing difference between different cell populations, through
performing statistical test, such as Kruskal–Wallis test, on the Psi values
followed by multiple testing correction. Most of alternative splicing ana-
lysis tools use reads obtained from full-length scRNA-seq platforms such as
Smart-seq3. Reads derived from the 3’ or 5’ end of transcripts, such as those
generated from the 10× Chromium platform, cover limited number of spli-
cing junctions at either end of genes. Despite this limitation, some tools such
as SCATS can use 10× scRNA-seq data for alternative splicing analysis.
with transcription factors with the use of GENIE3. To trim the large number
of edges that represent potential gene-gene interactions to a shorter list of
high-confidence edges, SCENIC performs a transcription factor- binding
motif enrichment analysis to identify putative target genes. The output from
SCENIC can then be imported into visualization tools such SCope for net-
work visualization. Methods such as LEAP and SCIMITAR take into con-
sideration effects of developmental stages on gene networking through
incorporating pseudo-temporal information from trajectory inference and
velocity analysis (Section 8.7). The pseudo-temporal ordering of cells helps
establish directionality between an upstream gene and a downstream effector.
For these methods, gene correlation is first calculated for each time window,
and the multiple correlation matrices are then aggregated into one adjacency
matrix to represent the overall gene-gene interactions. Other methods such as
SCODE, SCOUP, and GRISLI use a similar approach with the application of
pseudo-temporal information, but they use differential equations to estimate
gene correlation and infer gene relationships. For example, SCODE relies on
Monocle to provide pseudotime information, and uses ordinary differential
equations to calculate gene correlation. There are also methods based on other
approaches, such as SCNS [231] and BTR [232] that use Boolean models. With
such models 0 or 1 represents deactivated or activated gene expression, and
Boolean operations AND, OR, and NOT are used to capture relationships
between two genes. Boolean models provide a simplistic presentation of the
cell system through converting gene expression data into binary data, but
this also leads to loss of gene-gene interaction information.
To pick an appropriate GRN inference method, besides having knowledge
of how they are designed, it also helps to understand whether prior infor-
mation is required and what is the basic characteristic of the cells under
study. Some of the methods require prior information, in the form of pseudo-
temporal ordering of cells or cell types, which can be revealed with trajec-
tory inference. Such methods are more suitable for cells that are in different
developmental stages. If the objective is to compare cellular composition or
heterogeneity under different conditions (e.g., healthy vs. diseased), methods
that employ static data are more appropriate. To provide systematic guidance
to the selection of appropriate GRN inference methods, results from several
benchmark studies [233–236] are available on currently available GRN infer-
ence methods. Overall these studies showed underperformance of current
methods and call for development of better designed tools. For example, cur-
rently inferred networks still show poor agreement with ground truth. In the
meantime, these studies also revealed the challenges of inferring GRN from
scRNA-seq data, which can be technical (due mostly to signal heterogeneity
and sparsity), biological (e.g., complex nature of molecular interactions), or
computational (e.g., complexity of analysis). Validation of an inferred GRN
is also a very challenging task. Most of the currently available methods still
Transcriptomics by Single-Cell RNA-Seq 189
output one gene network for all cells, or for specific cell types, not individual
cells. Some newer methods such as CSN [237] and c-CSN [238] allow building
of cell-specific networks, i.e., one network per cell.
Despite the current challenges, the value of GRN analysis cannot be
overemphasized. In-depth analysis of a GRN enables detection of network
modules and key nodes (hub genes). A module refers to a group of genes
that are highly connected to fulfill a cellular function. The overall topology
of a module, or key nodes linking different modules, might change with
development or cell differentiation, or differ under different conditions.
Differential network analysis may reveal altered gene- gene interactions
between conditions. These network analyses can be carried out using tools
such as WGCNA [239].
References
1. Cui Y, Irudayaraj J. Inside single cells: quantitative analysis with advanced
optics and nanomaterials. Wiley Interdiscip Rev Nanomed Nanobiotechnol 2015,
7(3):387–407.
2. Huang XT, Li X, Qin PZ, Zhu Y, Xu SN, Chen JP. Technical advances in single-
cell RNA sequencing and applications in normal and malignant hematopoi-
esis. Front Oncol 2018, 8:582.
3. Shalek AK, Satija R, Shuga J, Trombetta JJ, Gennert D, Lu D, Chen P, Gertner
RS, Gaublomme JT, Yosef N et al. Single-cell RNA-seq reveals dynamic para-
crine control of cellular variation. Nature 2014, 510(7505):363–369.
4. Marinov GK, Williams BA, McCue K, Schroth GP, Gertz J, Myers RM, Wold BJ.
From single-cell to cell-pool transcriptomes: stochasticity in gene expression
and RNA splicing. Genome Res 2014, 24(3):496–510.
5. Zhang M, Zou Y, Xu X, Zhang X, Gao M, Song J, Huang P, Chen Q, Zhu Z, Lin
W et al. Highly parallel and efficient single cell mRNA sequencing with paired
picoliter chambers. Nat Commun 2020, 11(1):2118.
6. Hagemann- Jensen M, Ziegenhain C, Chen P, Ramskold D, Hendriks GJ,
Larsson AJM, Faridani OR, Sandberg R. Single-cell RNA counting at allele and
isoform resolution using Smart-seq3. Nat Biotechnol 2020, 38(6):708–714.
7. Macosko EZ, Basu A, Satija R, Nemesh J, Shekhar K, Goldman M, Tirosh I,
Bialas AR, Kamitaki N, Martersteck EM et al. Highly Parallel Genome-wide
Expression Profiling of Individual Cells Using Nanoliter Droplets. Cell 2015,
161(5):1202–1214.
8. Klein AM, Mazutis L, Akartuna I, Tallapragada N, Veres A, Li V, Peshkin L,
Weitz DA, Kirschner MW. Droplet barcoding for single-cell transcriptomics
applied to embryonic stem cells. Cell 2015, 161(5):1187–1201.
9. Cao J, Packer JS, Ramani V, Cusanovich DA, Huynh C, Daza R, Qiu X, Lee C,
Furlan SN, Steemers FJ et al. Comprehensive single-cell transcriptional pro-
filing of a multicellular organism. Science 2017, 357(6352):661–667.
190 Next-Generation Sequencing Data Analysis
10. Zheng GX, Terry JM, Belgrader P, Ryvkin P, Bent ZW, Wilson R, Ziraldo SB,
Wheeler TD, McDermott GP, Zhu J et al. Massively parallel digital transcrip-
tional profiling of single cells. Nat Commun 2017, 8:14049.
11. Zhang X, Li T, Liu F, Chen Y, Yao J, Li Z, Huang Y, Wang J. Comparative ana-
lysis of droplet-based ultra-high-throughput single-cell RNA-seq systems.
Mol Cell 2019, 73(1):130–142 e135.
12. Ding J, Adiconis X, Simmons SK, Kowalczyk MS, Hession CC, Marjanovic
ND, Hughes TK, Wadsworth MH, Burks T, Nguyen LT et al. Systematic com-
parison of single- cell and single- nucleus RNA- sequencing methods. Nat
Biotechnol 2020, 38(6):737–746.
13. How many Cells (https://satijalab.org/howmanycells)
14. Svensson V, da Veiga Beltrame E, Pachter L. Quantifying the tradeoff between
sequencing depth and cell number in single- cell RNA- seq. bioRxiv 2019,
doi: https://doi.org/10.1101/762773
15. Zhang MJ, Ntranos V, Tse D. Determining sequencing depth in a single-cell
RNA-seq experiment. Nat Commun 2020, 11(1):774.
16. Parekh S, Ziegenhain C, Vieth B, Enard W, Hellmann I. zUMIs –A fast and
flexible pipeline to process RNA sequencing data with UMIs. GigaScience 2018,
7(6):giy059.
17. Heimberg G, Bhatnagar R, El-Samad H, Thomson M. Low dimensionality
in gene expression data enables the accurate extraction of transcriptional
programs from shallow sequencing. Cell Syst 2016, 2(4):239–250.
18. Wu AR, Neff NF, Kalisky T, Dalerba P, Treutlein B, Rothenberg ME, Mburu
FM, Mantalas GL, Sim S, Clarke MF et al. Quantitative assessment of single-
cell RNA-sequencing methods. Nat Methods 2014, 11(1):41–46.
19. Svensson V, Natarajan KN, Ly LH, Miragaia RJ, Labalette C, Macaulay IC,
Cvejic A, Teichmann SA. Power analysis of single- cell RNA- sequencing
experiments. Nat Methods 2017, 14(4):381–387.
20. Genomics X. Technical Note– Removal of Dead Cells from Single Cell
Suspensions Improves Performance for 10× Genomics® Single Cell
Applications. 2017.
21. van den Brink SC, Sage F, Vertesy A, Spanjaard B, Peterson-Maduro J, Baron
CS, Robin C, van Oudenaarden A. Single-cell sequencing reveals dissociation-
induced gene expression in tissue subpopulations. Nat Methods 2017,
14(10):935–936.
22. Wohnhaas CT, Leparc GG, Fernandez-Albert F, Kind D, Gantner F, Viollet
C, Hildebrandt T, Baum P. DMSO cryopreservation is the method of choice
to preserve cells for droplet-based single-cell RNA sequencing. Sci Rep 2019,
9(1):10699.
23. Cha J, Lee I. Single-cell network biology for resolving cellular heterogeneity in
human diseases. Exp Mol Med 2020, 52(11):1798–1808.
24. Korrapati S, Taukulis I, Olszewski R, Pyle M, Gu S, Singh R, Griffiths C, Martin
D, Boger E, Morell RJ et al. Single cell and single nucleus RNA-seq reveal cel-
lular heterogeneity and homeostatic regulatory networks in adult mouse stria
vascularis. Front Mol Neurosci 2019, 12:316.
25. Gao R, Kim C, Sei E, Foukakis T, Crosetto N, Chan LK, Srinivasan M, Zhang
H, Meric-Bernstam F, Navin N. Nanogrid single-nucleus RNA sequencing
reveals phenotypic diversity in breast cancer. Nat Commun 2017, 8(1):228.
Transcriptomics by Single-Cell RNA-Seq 191
44. Heiser CN, Wang VM, Chen B, Hughey JJ, Lau KS. Automated quality con-
trol and cell identification of droplet-based single-cell data using dropkick.
Genome Res 2021, 31(10):1742–1752 .
45. Ni Z, Chen S, Brown J, Kendziorski C. CB2 improves power of cell detec-
tion in droplet-based single-cell RNA sequencing data. Genome Biol 2020,
21(1):137.
46. Yang S, Corbett SE, Koga Y, Wang Z, Johnson WE, Yajima M, Campbell JD.
Decontamination of ambient RNA in single- cell RNA-seq with DecontX.
Genome Biol 2020, 21(1):57.
47. Young MD, Behjati S. SoupX removes ambient RNA contamination from
droplet-based single-cell RNA sequencing data. Gigascience 2020, 9(12):giaa151.
48. Osorio D, Cai JJ. Systematic determination of the mitochondrial proportion in
human and mice tissues for single-cell RNA sequencing data quality control.
Bioinformatics 2020, 37(7):963–967.
49. Bacher R, Chu LF, Leng N, Gasch AP, Thomson JA, Stewart RM, Newton M,
Kendziorski C. SCnorm: robust normalization of single-cell RNA-seq data.
Nat Methods 2017, 14(6):584–586.
50. Yip SH, Wang P, Kocher JA, Sham PC, Wang J. Linnorm: improved statis-
tical analysis for single cell RNA-seq expression data. Nucleic Acids Res 2017,
45(22):e179.
51. Vallejos CA, Marioni JC, Richardson S. BASiCS: Bayesian Analysis of Single-
Cell Sequencing Data. PLoS Comput Biol 2015, 11(6):e1004333.
52. Qiu X, Hill A, Packer J, Lin D, Ma YA, Trapnell C. Single-cell mRNA
quantification and differential analysis with Census. Nat Methods 2017,
14(3):309–315.
53. Risso D, Perraudeau F, Gribkova S, Dudoit S, Vert JP. A general and flexible
method for signal extraction from single-cell RNA-seq data. Nat Commun
2018, 9(1):284.
54. Hafemeister C, Satija R. Normalization and variance stabilization of single-
cell RNA-seq data using regularized negative binomial regression. Genome
Biol 2019, 20(1):296.
55. Butler A, Hoffman P, Smibert P, Papalexi E, Satija R. Integrating single-cell
transcriptomic data across different conditions, technologies, and species. Nat
Biotechnol 2018, 36(5):411–420.
56. Wolf FA, Angerer P, Theis FJ. SCANPY: large-scale single-cell gene expression
data analysis. Genome Biol 2018, 19(1):15.
57. Lun AT, Bach K, Marioni JC. Pooling across cells to normalize single-cell RNA
sequencing data with many zero counts. Genome Biol 2016, 17:75.
58. Lopez R, Regier J, Cole MB, Jordan MI, Yosef N. Deep generative modeling for
single-cell transcriptomics. Nat Methods 2018, 15(12):1053–1058.
59. Lytal N, Ran D, An L. Normalization methods on single-cell RNA-seq data: an
empirical survey. Front Genet 2020, 11:41.
60. Cole MB, Risso D, Wagner A, DeTomaso D, Ngai J, Purdom E, Dudoit S, Yosef
N. Performance Assessment and Selection of Normalization Procedures for
Single-Cell RNA-Seq. Cell Syst 2019, 8(4):315–328 e318.
61. Buttner M, Miao Z, Wolf FA, Teichmann SA, Theis FJ. A test metric for assessing
single-cell RNA-seq batch correction. Nat Methods 2019, 16(1):43–49.
62. Johnson WE, Li C, Rabinovic A. Adjusting batch effects in microarray expres-
sion data using empirical Bayes methods. Biostatistics 2007, 8(1):118–127.
Transcriptomics by Single-Cell RNA-Seq 193
63. Haghverdi L, Lun ATL, Morgan MD, Marioni JC. Batch effects in single-cell
RNA-sequencing data are corrected by matching mutual nearest neighbors.
Nat Biotechnol 2018, 36(5):421–427.
64. Stuart T, Satija R. Integrative single- cell analysis. Nat Rev Genet 2019,
20(5):257–272.
65. Welch JD, Kozareva V, Ferreira A, Vanderburg C, Martin C, Macosko EZ.
Single-cell multi-omic integration compares and contrasts features of brain
cell identity. Cell 2019, 177(7):1873–1887 e1817.
66. Hie B, Bryson B, Berger B. Efficient integration of heterogeneous single-cell
transcriptomes using Scanorama. Nat Biotechnol 2019, 37(6):685–691.
67. Polanski K, Young MD, Miao Z, Meyer KB, Teichmann SA, Park JE.
BBKNN: fast batch alignment of single cell transcriptomes. Bioinformatics
2020, 36(3):964–965.
68. Korsunsky I, Millard N, Fan J, Slowikowski K, Zhang F, Wei K, Baglaenko Y,
Brenner M, Loh PR, Raychaudhuri S. Fast, sensitive and accurate integration
of single-cell data with Harmony. Nat Methods 2019, 16(12):1289–1296.
69. Barkas N, Petukhov V, Nikolaeva D, Lozinsky Y, Demharter S, Khodosevich K,
Kharchenko PV. Joint analysis of heterogeneous single-cell RNA-seq dataset
collections. Nat Methods 2019, 16(8):695–698.
70. Tran HTN, Ang KS, Chevrier M, Zhang X, Lee NYS, Goh M, Chen J. A bench-
mark of batch-effect correction methods for single-cell RNA sequencing data.
Genome Biol 2020, 21(1):12.
71. Rousseeuw PJ. Silhouettes: a graphical aid to the interpretation and validation
of cluster analysis. J Comput Appl Math 1987, 20:53–65.
72. van Dijk D, Sharma R, Nainys J, Yim K, Kathail P, Carr AJ, Burdziak C, Moon
KR, Chaffer CL, Pattabiraman D et al. Recovering gene interactions from
single-cell data using data diffusion. Cell 2018, 174(3):716–729 e727.
73. Wagner F, Yan Y, Yanai I. K-nearest neighbor smoothing for high-throughput
single-cell RNA-Seq data. bioRxiv 2018, doi: https://doi.org/10.1101/217737
74. Huang M, Wang J, Torre E, Dueck H, Shaffer S, Bonasio R, Murray JI, Raj A, Li
M, Zhang NR. SAVER: gene expression recovery for single-cell RNA sequen-
cing. Nat Methods 2018, 15(7):539–542.
75. Wang J, Agarwal D, Huang M, Hu G, Zhou Z, Ye C, Zhang NR. Data denoising
with transfer learning in single- cell transcriptomics. Nat Methods 2019,
16(9):875–878.
76. Linderman GC, Zhao J, Roulis M, Bielecki P, Flavell RA, Nadler B, Kluger Y.
Zero-preserving imputation of single-cell RNA-seq data. Nat Commun 2022,
13(1):192.
77. Lin PJ, Troup M, Ho JWK. CIDR: Ultrafast and accurate clustering through
imputation for single-cell RNA-seq data. Genome Biol 2017, 18(1):59.
78. Eraslan G, Simon LM, Mircea M, Mueller NS, Theis FJ. Single-cell RNA-seq
denoising using a deep count autoencoder. Nat Commun 2019, 10(1):390.
79. Li WV, Li JJ. An accurate and robust imputation method scImpute for single-
cell RNA-seq data. Nat Commun 2018, 9(1):997.
80. Mongia A, Sengupta D, Majumdar A. McImpute: matrix completion based
imputation for single cell RNA-seq data. Front Genet 2019, 10:9.
81. Gong W, Kwak IY, Pota P, Koyano- Nakagawa N, Garry DJ. DrImpute:
imputing dropout events in single cell RNA sequencing data. BMC
Bioinformatics 2018, 19(1):220.
194 Next-Generation Sequencing Data Analysis
82. Lahnemann D, Koster J, Szczurek E, McCarthy DJ, Hicks SC, Robinson MD,
Vallejos CA, Campbell KR, Beerenwinkel N, Mahfouz A et al. Eleven grand
challenges in single-cell data science. Genome Biol 2020, 21(1):31.
83. Andrews TS, Hemberg M. False signals induced by single-cell imputation.
F1000Research 2018, 7:1740.
84. Hou W, Ji Z, Ji H, Hicks SC. A systematic evaluation of single-cell RNA-
sequencing imputation methods. Genome Biol 2020, 21(1):218.
85. Li Y, Willer C, Sanna S, Abecasis G. Genotype imputation. Annu Rev Genomics
Hum Genet 2009, 10:387–406.
86. Brennecke P, Anders S, Kim JK, Kolodziejczyk AA, Zhang X, Proserpio V,
Baying B, Benes V, Teichmann SA, Marioni JC et al. Accounting for technical
noise in single-cell RNA-seq experiments. Nat Methods 2013, 10(11):1093–1095.
87. Buettner F, Natarajan KN, Casale FP, Proserpio V, Scialdone A, Theis FJ,
Teichmann SA, Marioni JC, Stegle O. Computational analysis of cell-to-cell het-
erogeneity in single-cell RNA-sequencing data reveals hidden subpopulations
of cells. Nat Biotechnol 2015, 33(2):155–160.
88. Yip SH, Sham PC, Wang J. Evaluation of tools for highly variable gene dis-
covery from single-cell RNA-seq data. Brief Bioinform 2019, 20(4):1583–1589.
89. Duo A, Robinson MD, Soneson C. A systematic performance evaluation of
clustering methods for single-cell RNA-seq data. F1000Research 2018, 7:1141.
90. Townes FW, Hicks SC, Aryee MJ, Irizarry RA. Feature selection and dimen-
sion reduction for single-cell RNA-Seq based on a multinomial model. Genome
Biol 2019, 20(1):295.
91. Andrews TS, Hemberg M. M3Drop: dropout- based feature selection for
scRNASeq. Bioinformatics 2019, 35(16):2865–2867.
92. Trapnell C, Cacchiarelli D, Grimsby J, Pokharel P, Li S, Morse M, Lennon
NJ, Livak KJ, Mikkelsen TS, Rinn JL. The dynamics and regulators of cell
fate decisions are revealed by pseudotemporal ordering of single cells. Nat
Biotechnol 2014, 32(4):381–386.
93. Shao C, Hofer T. Robust classification of single-cell transcriptome data by non-
negative matrix factorization. Bioinformatics 2017, 33(2):235–242.
94. Pierson E, Yau C. ZIFA: Dimensionality reduction for zero-inflated single-cell
gene expression analysis. Genome Biol 2015, 16:241.
95. Buettner F, Pratanwanich N, McCarthy DJ, Marioni JC, Stegle O. f-scLVM: scal-
able and versatile factor analysis for single-cell RNA-seq. Genome Biol 2017,
18(1):212.
96. Mahfouz A, van de Giessen M, van der Maaten L, Huisman S, Reinders M,
Hawrylycz MJ, Lelieveldt BP. Visualizing the spatial gene expression organ-
ization in the brain through non-linear similarity embeddings. Methods 2015,
73:79–89.
97. Senabouth A, Lukowski SW, Hernandez JA, Andersen SB, Mei X, Nguyen
QH, Powell JE. ascend: R package for analysis of single-cell RNA-seq data.
GigaScience 2019, 8(8):giz087.
98. Sun S, Zhu J, Ma Y, Zhou X. Accuracy, robustness and scalability of
dimensionality reduction methods for single-cell RNA-seq analysis. Genome
Biol 2019, 20(1):269.
99. Welch JD, Hartemink AJ, Prins JF. SLICER: inferring branched, nonlinear cel-
lular trajectories from single cell RNA-seq data. Genome Biol 2016, 17(1):106.
Transcriptomics by Single-Cell RNA-Seq 195
100. Haghverdi L, Buettner F, Theis FJ. Diffusion maps for high-dimensional single-
cell analysis of differentiation data. Bioinformatics 2015, 31(18):2989–2998.
101. Sun X, Liu Y, An L. Ensemble dimensionality reduction and feature gene
extraction for single-cell RNA-seq data. Nat Commun 2020, 11(1):5853.
102. Becht E, McInnes L, Healy J, Dutertre CA, Kwok IWH, Ng LG, Ginhoux F,
Newell EW. Dimensionality reduction for visualizing single-cell data using
UMAP. Nat Biotechnol 2018, 37(1):38–44.
103. Cao J, Spielmann M, Qiu X, Huang X, Ibrahim DM, Hill AJ, Zhang F, Mundlos
S, Christiansen L, Steemers FJ et al. The single-cell transcriptional landscape
of mammalian organogenesis. Nature 2019, 566(7745):496–502.
104. Kobak D, Berens P. The art of using t-SNE for single-cell transcriptomics. Nat
Commun 2019, 10(1):5416.
105. Hu Q, Greene CS. Parameter tuning is a key part of dimensionality reduction
via deep variational autoencoders for single cell RNA transcriptomics. Pac
Symp Biocomput 2019, 24:362–373.
106. Ding J, Condon A, Shah SP. Interpretable dimensionality reduction of single
cell transcriptome data with deep generative models. Nat Commun 2018,
9(1):2002.
107. Deng Y, Bao F, Dai QH, Wu LF, Altschuler SJ. Scalable analysis of cell-type
composition from single-cell transcriptomics using deep recurrent learning.
Nature Methods 2019, 16(4):311.
108. Wang B, Zhu J, Pierson E, Ramazzotti D, Batzoglou S. Visualization and ana-
lysis of single-cell RNA-seq data by kernel-based similarity learning. Nat
Methods 2017, 14(4):414–416.
109. Xiang R, Wang W, Yang L, Wang S, Xu C, Chen X. A comparison for
dimensionality reduction methods of single-cell RNA-seq data. Front Genet
2021, 12:646936.
110. Wolf FA, Hamey FK, Plass M, Solana J, Dahlin JS, Gottgens B, Rajewsky N,
Simon L, Theis FJ. PAGA: graph abstraction reconciles clustering with trajec-
tory inference through a topology preserving map of single cells. Genome Biol
2019, 20(1):59.
111. Moon KR, van Dijk D, Wang Z, Gigante S, Burkhardt DB, Chen WS, Yim K,
Elzen AVD, Hirn MJ, Coifman RR et al. Visualizing structure and transitions
in high-dimensional biological data. Nat Biotechnol 2019, 37(12):1482–1492.
112. Anchang B, Hart TD, Bendall SC, Qiu P, Bjornson Z, Linderman M, Nolan
GP, Plevritis SK. Visualization and cellular hierarchy inference of single-cell
data using SPADE. Nat Protoc 2016, 11(7):1264–1279.
113. Weinreb C, Wolock S, Klein AM. SPRING: a kinetic interface for visual-
izing high dimensional single- cell expression data. Bioinformatics 2018,
34(7):1246–1248.
114. Kim T, Chen IR, Lin Y, Wang AY, Yang JYH, Yang P. Impact of similarity
metrics on single- cell RNA- seq data clustering. Brief Bioinform 2019,
20(6):2316–2326.
115. Moussa M, Mandoiu, II. Single cell RNA-seq data clustering using TF-IDF
based methods. BMC Genomics 2018, 19(Suppl 6):569.
116. Blondel VD, Guillaume JL, Lambiotte R, Lefebvre E. Fast unfolding of com-
munities in large networks. J Stat Mech-Theory E 2008, doi:10.1088/1742-
5468/2008/10/P10008
196 Next-Generation Sequencing Data Analysis
117. Traag VA, Waltman L, van Eck NJ. From Louvain to Leiden: guaranteeing
well-connected communities. Sci Rep 2019, 9(1):5233.
118. Kiselev VY, Kirschner K, Schaub MT, Andrews T, Yiu A, Chandra T, Natarajan
KN, Reik W, Barahona M, Green AR et al. SC3: consensus clustering of single-
cell RNA-seq data. Nat Methods 2017, 14(5):483–486.
119. Zurauskiene J, Yau C. pcaReduce: hierarchical clustering of single cell tran-
scriptional profiles. BMC Bioinformatics 2016, 17:140.
120. Grun D, Lyubimova A, Kester L, Wiebrands K, Basak O, Sasaki N, Clevers
H, van Oudenaarden A. Single-cell messenger RNA sequencing reveals rare
intestinal cell types. Nature 2015, 525(7568):251–255.
121. Xu C, Su Z. Identification of cell types from single-cell transcriptomes using
a novel clustering method. Bioinformatics 2015, 31(12):1974–1980.
122. Lake BB, Chen S, Sos BC, Fan J, Kaeser GE, Yung YC, Duong TE, Gao D,
Chun J, Kharchenko PV et al. Integrative single-cell analysis of transcrip-
tional and epigenetic states in the human adult brain. Nat Biotechnol 2018,
36(1):70–80.
123. Guo M, Wang H, Potter SS, Whitsett JA, Xu Y. SINCERA: a pipeline for single-
cell RNA-seq profiling analysis. PLoS Comput Biol 2015, 11(11):e1004575.
124. Freytag S, Tian L, Lonnstedt I, Ng M, Bahlo M. Comparison of clustering
tools in R for medium-sized 10× Genomics single-cell RNA-sequencing data.
F1000Research 2018, 7:1297.
125. Yu D, Huber W, Vitek O. Shrinkage estimation of dispersion in negative bino-
mial models for RNA-seq experiments with small sample size. Bioinformatics
2013, 29(10):1275–1282.
126. Pliner HA, Shendure J, Trapnell C. Supervised classification enables rapid
annotation of cell atlases. Nat Methods 2019, 16(10):983–986.
127. Domanskyi S, Szedlak A, Hawkins NT, Wang J, Paternostro G, Piermarocchi
C. Polled Digital Cell Sorter (p-DCS): Automatic identification of hema-
tological cell types from single cell RNA- sequencing clusters. BMC
Bioinformatics 2019, 20(1):369.
128. Zhang Z, Luo D, Zhong X, Choi JH, Ma Y, Wang S, Mahrt E, Guo W, Stawiski
EW, Modrusan Z et al. SCINA: A semi-supervised subtyping algorithm of
single cells and bulk samples. Genes 2019, 10(7):531.
129. Zhang AW, O’Flanagan C, Chavez EA, Lim JLP, Ceglia N, McPherson
A, Wiens M, Walters P, Chan T, Hewitson B et al. Probabilistic cell-type
assignment of single-cell RNA-seq for tumor microenvironment profiling.
Nat Methods 2019, 16(10):1007–1015.
130. Xu C, Lopez R, Mehlman E, Regier J, Jordan MI, Yosef N. Probabilistic har-
monization and annotation of single-cell transcriptomics data with deep
generative models. Mol Syst Biol 2021, 17(1):e9620.
131. Franzen O, Gan LM, Bjorkegren JLM. PanglaoDB: a web server for explor-
ation of mouse and human single- cell RNA sequencing data. Database
(Oxford) 2019, 2019.
132. Zhang X, Lan Y, Xu J, Quan F, Zhao E, Deng C, Luo T, Xu L, Liao G, Yan M
et al. CellMarker: a manually curated resource of cell markers in human and
mouse. Nucleic Acids Res 2019, 47(D1):D721–D728.
133. Zeisel A, Hochgerner H, Lonnerberg P, Johnsson A, Memic F, van der Zwan
J, Haring M, Braun E, Borm LE, La Manno G et al. Molecular architecture of
the mouse nervous system. Cell 2018, 174(4):999–1014 e1022.
Transcriptomics by Single-Cell RNA-Seq 197
134. Ecker JR, Geschwind DH, Kriegstein AR, Ngai J, Osten P, Polioudakis D,
Regev A, Sestan N, Wickersham IR, Zeng H. The BRAIN Initiative Cell
Census Consortium: Lessons Learned toward Generating a Comprehensive
Brain Cell Atlas. Neuron 2017, 96(3):542–557.
135. Saunders A, Macosko EZ, Wysoker A, Goldman M, Krienen FM, de
Rivera H, Bien E, Baum M, Bortolin L, Wang S et al. Molecular diversity
and specializations among the cells of the adult mouse brain. Cell 2018,
174(4):1015–1030 e1016.
136. Regev A, Teichmann SA, Lander ES, Amit I, Benoist C, Birney E, Bodenmiller
B, Campbell P, Carninci P, Clatworthy M et al. The Human Cell Atlas. Elife
2017, 6:e27041.
137. Tabula Muris C, Overall c, Logistical c, Organ c, processing, Library p,
sequencing, Computational data a, Cell type a, Writing g et al. Single-cell
transcriptomics of 20 mouse organs creates a Tabula Muris. Nature 2018,
562(7727):367–372.
138. Kiselev VY, Yiu A, Hemberg M. scmap: projection of single-cell RNA-seq
data across data sets. Nat Methods 2018, 15(5):359–362.
139. Tan Y, Cahan P. SingleCellNet: a computational tool to classify single cell
RNA-seq data across platforms and across species. Cell Syst 2019, 9(2):207–
213 e202.
140. Ma F, Pellegrini M. ACTINN: automated identification of cell types in single
cell RNA sequencing. Bioinformatics 2020, 36(2):533–538.
141. Alquicira-Hernandez J, Sathe A, Ji HP, Nguyen Q, Powell JE. scPred: accurate
supervised method for cell-type classification from single-cell RNA-seq data.
Genome Biol 2019, 20(1):264.
142. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel
M, Prettenhofer P, Weiss R, Dubourg V. Scikit-learn: Machine learning in
Python. J Mach Learn Res 2011, 12:2825–2830.
143. Bakken T, Cowell L, Aevermann BD, Novotny M, Hodge R, Miller JA, Lee A,
Chang I, McCorrison J, Pulendran B et al. Cell type discovery and represen-
tation in the era of high-content single cell phenotyping. BMC Bioinformatics
2017, 18(Suppl 17):559.
144. Wang S, Pisco AO, McGeever A, Brbic M, Zitnik M, Darmanis S, Leskovec J,
Karkanias J, Altman RB. Leveraging the cell ontology to classify unseen cell
types. Nat Commun 2021, 12(1):5556.
145. Bernstein MN, Ma Z, Gleicher M, Dewey CN. CellO: comprehensive and
hierarchical cell type classification of human cells with the Cell Ontology.
iScience 2021, 24(1):101913.
146. Abdelaal T, Michielsen L, Cats D, Hoogduin D, Mei H, Reinders MJT,
Mahfouz A. A comparison of automatic cell identification methods for
single-cell RNA sequencing data. Genome Biol 2019, 20(1):194.
147. Huang Q, Liu Y, Du Y, Garmire LX. Evaluation of cell type Annotation R
Packages on Single-cell RNA-seq Data. Genomics Proteomics Bioinformatics
2021, 19(2):267–281.
148. Aran D, Looney AP, Liu L, Wu E, Fong V, Hsu A, Chak S, Naikawadi RP,
Wolters PJ, Abate AR et al. Reference-based analysis of lung single-cell
sequencing reveals a transitional profibrotic macrophage. Nat Immunol
2019, 20(2):163–172.
198 Next-Generation Sequencing Data Analysis
182. Haghverdi L, Buttner M, Wolf FA, Buettner F, Theis FJ. Diffusion pseudotime
robustly reconstructs lineage branching. Nat Methods 2016, 13(10):845–848.
183. Rashid S, Kotton DN, Bar-Joseph Z. TASIC: determining branching models
from time series single cell data. Bioinformatics 2017, 33(16):2504–2512.
184. Schiebinger G, Shu J, Tabaka M, Cleary B, Subramanian V, Solomon A, Gould
J, Liu S, Lin S, Berube P et al. Optimal-Transport Analysis of Single-Cell Gene
Expression identifies developmental trajectories in reprogramming. Cell
2019, 176(4):928–943 e922.
185. Lin C, Bar- Joseph Z. Continuous- state HMMs for modeling time- series
single-cell RNA-Seq data. Bioinformatics 2019, 35(22):4707–4715.
186. Tran TN, Bader GD. Tempora: cell trajectory inference using time-series
single-cell RNA sequencing data. PLoS Comput Biol 2020, 16(9):e1008205.
187. Lonnberg T, Svensson V, James KR, Fernandez-Ruiz D, Sebina I, Montandon
R, Soon MS, Fogg LG, Nair AS, Liligeto U et al. Single-cell RNA-seq and com-
putational analysis using temporal mixture modelling resolves Th1/Tfh fate
bifurcation in malaria. Sci Immunol 2017, 2(9):eaal2192.
188. LineagePulse (https://github.com/YosefLab/LineagePulse)
189. La Manno G, Soldatov R, Zeisel A, Braun E, Hochgerner H, Petukhov V,
Lidschreiber K, Kastriti ME, Lonnerberg P, Furlan A et al. RNA velocity of
single cells. Nature 2018, 560(7719):494–498.
190. Bergen V, Lange M, Peidli S, Wolf FA, Theis FJ. Generalizing RNA velocity
to transient cell states through dynamical modeling. Nat Biotechnol 2020,
38(12):1408–1414.
191. Herman JS, Sagar, Grun D. FateID infers cell fate bias in multipotent
progenitors from single- cell RNA- seq data. Nat Methods 2018,
15(5):379–386.
192. Bendall SC, Davis KL, Amir el AD, Tadmor MD, Simonds EF, Chen TJ,
Shenfeld DK, Nolan GP, Pe’er D. Single-cell trajectory detection uncovers
progression and regulatory coordination in human B cell development.
Cell 2014, 157(3):714–725.
193. Setty M, Tadmor MD, Reich-Zeliger S, Angel O, Salame TM, Kathail P,
Choi K, Bendall S, Friedman N, Pe’er D. Wishbone identifies bifurcating
developmental trajectories from single- cell data. Nat Biotechnol 2016,
34(6):637–645.
194. Kim S, Scheffler K, Halpern AL, Bekritsky MA, Noh E, Kallberg M, Chen
X, Kim Y, Beyter D, Krusche P et al. Strelka2: fast and accurate calling of
germline and somatic variants. Nat Methods 2018, 15(8):591–594.
195. Rodriguez-Meira A, Buck G, Clark SA, Povinelli BJ, Alcolea V, Louka E,
McGowan S, Hamblin A, Sousos N, Barkas N et al. Unravelling Intratumoral
Heterogeneity through High- Sensitivity Single-
Cell Mutational Analysis
and Parallel RNA Sequencing. Mol Cell 2019, 73(6):1292–1305 e1298.
196. Pysam (https://github.com/pysam-developers/pysam)
197. Fasterius E, Uhlen M, Al-Khalili Szigyarto C. Single-cell RNA-seq variant
analysis for exploration of genetic heterogeneity in cancer. Sci Rep 2019,
9(1):9524.
198. Zafar H, Wang Y, Nakhleh L, Navin N, Chen K. Monovar: single-nucleotide
variant detection in single cells. Nat Methods 2016, 13(6):505–507.
199. Schnepp PM, Chen M, Keller ET, Zhou X. SNV identification from single-cell
RNA sequencing data. Hum Mol Genet 2019, 28(21):3569–3583.
Transcriptomics by Single-Cell RNA-Seq 201
200. Poirion O, Zhu X, Ching T, Garmire LX. Using single nucleotide variations
in single-cell RNA-seq to identify subpopulations and genotype-phenotype
linkage. Nat Commun 2018, 9(1):4892.
201. Fangal VD. CTAT Mutations: A Machine Learning Based RNA-Seq Variant
Calling Pipeline Incorporating Variant Annotation, Prioritization, and
Visualization. 2020.
202. Huang X, Huang Y. Cellsnp-lite: an efficient tool for genotyping single cells.
Bioinformatics 2021, 37(23):4569–4571.
203. Liu F, Zhang Y, Zhang L, Li Z, Fang Q, Gao R, Zhang Z. Systematic compara-
tive analysis of single-nucleotide variant detection methods from single-cell
RNA sequencing data. Genome Biol 2019, 20(1):242.
204. Chung W, Eum HH, Lee HO, Lee KM, Lee HB, Kim KT, Ryu HS, Kim S, Lee
JE, Park YH et al. Single-cell RNA-seq enables comprehensive tumour and
immune cell profiling in primary breast cancer. Nat Commun 2017, 8:15081.
205. Fan J, Lee HO, Lee S, Ryu DE, Lee S, Xue C, Kim SJ, Kim K, Barkas N, Park PJ
et al. Linking transcriptional and genetic tumor heterogeneity through allele
analysis of single-cell RNA-seq data. Genome Res 2018, 28(8):1217–1227.
206. Serin Harmanci A, Harmanci AO, Zhou X. CaSpER identifies and visualizes
CNV events by integrative analysis of single-cell or bulk RNA-sequencing
data. Nat Commun 2020, 11(1):89.
207. inferCNV of the Trinity CTAT Project (https://github.com/broadinstitute/
inferCNV)
208. Muller S, Cho A, Liu SJ, Lim DA, Diaz A. CONICS integrates scRNA-seq with
DNA sequencing to map gene expression to tumor sub-clones. Bioinformatics
2018, 34(18):3217–3219.
209. van de Geijn B, McVicker G, Gilad Y, Pritchard JK. WASP: allele-specific soft-
ware for robust molecular quantitative trait locus discovery. Nat Methods
2015, 12(11):1061–1063.
210. Borel C, Ferreira PG, Santoni F, Delaneau O, Fort A, Popadin KY, Garieri M,
Falconnet E, Ribaux P, Guipponi M et al. Biased allelic expression in human
primary fibroblast single cells. Am J Hum Genet 2015, 96(1):70–80.
211. Song Y, Botvinnik OB, Lovci MT, Kakaradov B, Liu P, Xu JL, Yeo GW. Single-
cell alternative splicing analysis with expedition reveals splicing dynamics
during neuron differentiation. Mol Cell 2017, 67(1):148–161 e145.
212. Huang Y, Sanguinetti G. BRIE: transcriptome-wide splicing quantification in
single cells. Genome Biol 2017, 18(1):123.
213. Huang Y, Sanguinetti G. BRIE2: computational identification of splicing
phenotypes from single-cell transcriptomic experiments. Genome Biol 2021,
22(1):251.
214. Matsumoto H, Hayashi T, Ozaki H, Tsuyuzaki K, Umeda M, Iida T,
Nakamura M, Okano H, Nikaido I. An NMF-based approach to discover
overlooked differentially expressed gene regions from single-cell RNA-seq
data. NAR Genom Bioinform 2019, 2(1):lqz020.
215. Ling JP, Wilks C, Charles R, Leavey PJ, Ghosh D, Jiang L, Santiago CP, Pang
B, Venkataraman A, Clark BS et al. ASCOT identifies key regulators of neur-
onal subtype-specific splicing. Nat Commun 2020, 11(1):137.
216. Ozaki H, Hayashi T, Umeda M, Nikaido I. Millefy: visualizing cell-to-cell
heterogeneity in read coverage of single-cell RNA sequencing datasets. BMC
Genomics 2020, 21(1):177.
202 Next-Generation Sequencing Data Analysis
217. Wen WX, Mead AJ, Thongjuea S. VALERIE: Visual- based inspection of
alternative splicing events at single-cell resolution. PLoS Comput Biol 2020,
16(9):e1008195.
218. Hu Y, Wang K, Li M. Detecting differential alternative splicing events in
scRNA-seq with or without unique molecular identifiers. PLoS Comput Biol
2020, 16(6):e1007925.
219. Benegas G, Fischer J, Song YS. Robust and annotation-free analysis of alter-
native splicing across diverse cell types in mice. Elife 2022, 11:e73520.
220. Liu S, Zhou B, Wu L, Sun Y, Chen J, Liu S. Single-cell differential splicing
analysis reveals high heterogeneity of liver tumor-infiltrating T cells. Sci Rep
2021, 11(1):5325.
221. Aibar S, Gonzalez- Blas CB, Moerman T, Huynh- Thu VA, Imrichova
H, Hulselmans G, Rambow F, Marine JC, Geurts P, Aerts J et al.
SCENIC: single-cell regulatory network inference and clustering. Nat Methods
2017, 14(11):1083–1086.
222. Matsumoto H, Kiryu H, Furusawa C, Ko MSH, Ko SBH, Gouda N, Hayashi
T, Nikaido I. SCODE: an efficient regulatory network inference algorithm
from single- cell RNA- Seq during differentiation. Bioinformatics 2017,
33(15):2314–2321.
223. Matsumoto H, Kiryu H. SCOUP: a probabilistic model based on the Ornstein-
Uhlenbeck process to analyze single-cell expression data during differenti-
ation. BMC Bioinformatics 2016, 17(1):232.
224. Chan TE, Stumpf MPH, Babtie AC. Gene Regulatory Network Inference
from Single-Cell Data Using Multivariate Information Measures. Cell Syst
2017, 5(3):251–267 e253.
225. Specht AT, Li J. LEAP: constructing gene co-expression networks for single-
cell RNA-sequencing data using pseudotime ordering. Bioinformatics 2017,
33(5):764–766.
226. Liu H, Li P, Zhu M, Wang X, Lu J, Yu T. Nonlinear Network Reconstruction
from Gene Expression Data Using Marginal Dependencies Measured by
DCOL. PLoS One 2016, 11(7):e0158247.
227. Cordero P, Stuart JM. Tracing Co-Regulatory Network Dynamics in Noisy,
Single-Cell Transcriptome Trajectories. Pac Symp Biocomput 2017, 22:576–587.
228. Aubin-Frankowski PC, Vert JP. Gene regulation inference from single-cell
RNA- seq data with linear differential equations and velocity inference.
Bioinformatics 2020, 36(18):4774–4780.
229. Huynh- Thu VA, Irrthum A, Wehenkel L, Geurts P. Inferring regulatory
networks from expression data using tree-based methods. PLoS One 2010,
5(9):e12776.
230. Moerman T, Aibar Santos S, Bravo Gonzalez-Blas C, Simm J, Moreau Y, Aerts
J, Aerts S. GRNBoost2 and Arboreto: efficient and scalable inference of gene
regulatory networks. Bioinformatics 2019, 35(12):2159–2161.
231. Woodhouse S, Piterman N, Wintersteiger CM, Gottgens B, Fisher J. SCNS: a
graphical tool for reconstructing executable regulatory networks from single-
cell genomic data. BMC Syst Biol 2018, 12(1):59.
232. Lim CY, Wang H, Woodhouse S, Piterman N, Wernisch L, Fisher J, Gottgens
B. BTR: training asynchronous Boolean models using single-cell expression
data. BMC Bioinformatics 2016, 17(1):355.
Transcriptomics by Single-Cell RNA-Seq 203
233. Pratapa A, Jalihal AP, Law JN, Bharadwaj A, Murali TM. Benchmarking
algorithms for gene regulatory network inference from single- cell
transcriptomic data. Nat Methods 2020, 17(2):147–154.
234. Chen S, Mar JC. Evaluating methods of inferring gene regulatory networks
highlights their lack of performance for single cell gene expression data.
BMC Bioinformatics 2018, 19(1):232.
235. Nguyen H, Tran D, Tran B, Pehlivan B, Nguyen T. A comprehensive survey
of regulatory network inference methods using single-cell RNA sequencing
data. Brief Bioinform 2020, 22(3):bbaa190.
236. Kang Y, Thieffry D, Cantini L. Evaluating the Reproducibility of Single-Cell
Gene Regulatory Network Inference Algorithms. Front Genet 2021, 12:617282.
237. Dai H, Li L, Zeng T, Chen L. Cell-specific network constructed by single-cell
RNA sequencing data. Nucleic Acids Res 2019, 47(11):e62.
238. Li L, Dai H, Fang Z, Chen L. c-CSN: Single-cell RNA Sequencing Data
Analysis by Conditional Cell- specific Network. Genomics Proteomics
Bioinformatics 2021, 19(2):319–329.
239. Langfelder P, Horvath S. WGCNA: an R package for weighted correlation
network analysis. BMC Bioinformatics 2008, 9:559.
9
Small RNA Sequencing
the entire pool of small RNAs, provides an excellent tool for novel miRNA
discovery and experimental validation of computational predictions.
Furthermore, small RNA sequencing offers an assumption-free, comprehen-
sive analysis of the small RNA transcriptome in biological targets, including
differential expression between conditions. In general, small RNA sequen-
cing data analysis shares much commonality with the analysis of RNA-seq
data (Chapter 7). In the meantime, some aspects of small RNA sequencing
data analysis are unique and mostly focused on in this chapter.
Sequence
Reads
miRNA-miRNA*
Pre-miRNA Loop
Duplex
Dicer *Argonaut *
Cleavage Processing
and Deep
Sequencing
FIGURE 9.1
Deep sequencing of mature miRNAs after Dicer and Argonaute processing. Dicer cleaves a short
stem-loop structure out of pre-miRNA to form the miRNA:miRNA* duplex. Upon loading into
RISC, Argonaute unwinds the duplex and uses one strand as guide for gene silencing while
discards the other strand (the star strand). While the short stem-loop and star strand sequences
are usually degraded, they may still generate sequencing signals, because of undegraded
residues or the fact that they may exist to perform other functions (e.g., the star strand is
sometimes functional).
Small RNA Sequencing 207
9.1.2 Preprocessing
After obtaining sequencing reads and demultiplexing, the reads generated
from each sample need to be first checked for quality using the QC tools
introduced in Chapter 5 such as FastQC, NGS QC Toolkit, and fastp, or spe-
cifically developed miRNA-seq data QC tools including miRTrace [6] and
mirnaQC [7]. Besides the typical NGS data QC metrics, specific miRNA-seq
QC tools often provide additional features. For example, mirnaQC provides
quality measures on miRNA yield and the fraction of putative degradation
products (e.g., rRNA fragments) in both absolute values and relative ranks
in comparison to a reference collection of 36,000 published datasets. Because
208 Next-Generation Sequencing Data Analysis
small RNA libraries are usually sequenced longer than the actual lengths of
the small RNA inserts, the 3’ adapter sequence is often part of the generated
sequence reads and therefore should also be trimmed off. The trimming can
be carried out with stand-alone tools such as Cutadapt and Trimmomatic,
or utilities in the NGS QC Toolkit or fastp. Adapter trimming can also be
conducted coincidentally with mapping, as some mappers provide such an
option, or using data preprocessing modules within some small RNA data
analysis tools (to be covered next).
9.1.3 Mapping
For mapping small RNA sequencing reads to a reference genome, short read
aligners introduced in Chapter 5, such as Bowtie/Bowtie2, BWA, Novoalign,
or SOAP/SOAP2, or RNA-seq aligner in Chapter 7, such as STAR, can be
used. Among these aligners, Novoalign offers the option of stripping off
adapter sequences in the mapping command. As for the reference genome,
the most recent assembly should always be used. Because of the short target
read length, the number of allowed mismatches should be set as 1. To speed
up the mapping process, a multi-threading parameter, which enables the use
of multiple CPU cores, can be used if the aligner supports it. After mapping,
reads that are aligned to unique regions are then searched against small RNA
databases to establish their identities (see next section), while those that
are mapped to a large number (e.g., >5,000) of genomic locations should be
removed from further analysis.
Besides the aforementioned general tools for small RNA reads
preprocessing and mapping, tools have also been developed specially for
small RNA-seq analysis, such as miRDeep/miRDeep2 [8], sRNAtoolbox [9],
ShortStack [10], sRNAnalyzer [11], miRge [12], and miRMaster [13]. Among
these tools, sRNAtoolbox is a collection of small RNA-seq data tools for dif-
ferential expression analysis and other downstream analyses. Its center piece
is sRNAbench, which replaces the previously widely used miRanalyzer. It
provides functions such as data preprocessing, genome mapping using
Bowtie, visualization of genome mapped reads, expression profiling, etc.
While the mapping of small RNA reads to a reference genome is similar
to the mapping in RNA-seq as covered in Chapter 7, some characteristics of
small RNAs, mostly their short length and post-transcriptional editing (see
next), present different challenges to the small RNA read mapping process.
Because of their short length, sizeable numbers of small RNA reads are usu-
ally mapped to more than one genomic region. In comparison, this issue is
minimal for RNA-seq data, as longer and sometimes paired-end reads greatly
increase specificity. The easiest way to deal with multi-mapped small RNA
reads is to simply ignore them, but this leads to the loss of great amounts of
data. A more commonly used approach is to assign them to one of the mapped
positions randomly, while an alternative approach is to report them to all
possible positions. More sophisticated algorithms have also been developed
Small RNA Sequencing 209
9.1.5 Normalization
Before identifying differentially expressed small RNAs, read counts for each
small RNA species in the samples need to be normalized. The goal of normal-
ization is to make the samples directly comparable by removing unwanted
sample-specific variations, which are usually due to differences in library size
and therefore sequencing depth. The normalization approaches used in bulk
210 Next-Generation Sequencing Data Analysis
References
1. Huang V, Qin Y, Wang J, Wang X, Place RF, Lin G, Lue TF, Li LC. RNAa is
conserved in mammalian cells. PLoS One 2010, 5(1):e8848.
2. Androvic P, Benesova S, Rohlova E, Kubista M, Valihrach L. Small RNA-
sequencing for analysis of circulating mirnas: benchmark study. J Mol Diagn
2022, 24(4):386–394.
3. Baran-Gale J, Kurtz CL, Erdos MR, Sison C, Young A, Fannin EE, Chines PS,
Sethupathy P. Addressing bias in small RNA library preparation for sequen-
cing: a new protocol recovers microRNAs that evade capture by current
methods. Front Genet 2015, 6:352.
4. Benesova S, Kubista M, Valihrach L. Small RNA-sequencing: approaches and
considerations for miRNA analysis. Diagnostics (Basel) 2021, 11(6):964.
5. Metpally RP, Nasser S, Malenica I, Courtright A, Carlson E, Ghaffari L,
Villa S, Tembe W, Van Keuren-Jensen K. Comparison of analysis tools for
miRNA high throughput sequencing using nerve crush as a model. Front
Genet 2013, 4:20.
212 Next-Generation Sequencing Data Analysis
Read Mapping
Local Realignment
Variant Calling
Variant Annotation
FIGURE 10.1
General workflow for genotyping and variation discovery from resequencing data.
C
T
T
C
T
C
C
C
FIGURE 10.2
The variant calling process is usually affected by various factors. In this illustration, a number
of reads are aligned against a reference sequence (bottom). At the illustrated site, the reference
sequence has a C while the reads have C and T. Depending on the factors mentioned in the text
and prior information, this site can be called as heterozygous (C/T), or no variation (C/C) if
the T’s are treated as errors. It is also possible to be called as a homozygous T/T, if the C’s are
regarded as errors.
FIGURE 10.3
Detection of somatic mutations vs. germline variations. In this example, sequence reads from
normal and tumor tissues are aligned to the reference genome (shown at the top in green). The
allelic counts, i.e., the number of matches (aN and aT) and depth of reads (dN and dT), at each base
position are shown. The blue sites show germline positions, while the red shows a position where
a somatic mutation occurred in some tumor cells. Also shown at the bottom are the predicted
genotypes for the normal and tumor tissues. (Modified from Roth A. et al. JointSNVMix: a
probabilistic model for accurate detection of somatic mutations in normal/ tumour paired
next-generation sequencing data. Bioinformatics, 2012, 28 (7): 907–13, by permission of Oxford
University Press.)
variants in the contrasting samples are then compared to each other to locate
somatic mutations in the cancer tissue. In the latter approach, the samples
are directly compared to each other using statistical tests on the basis of joint
probability. NeuSomatic represents the first attempt to use deep learning
(CNN) for somatic variant detection. The currently developed Mutect3 also
uses machine learning in an effort to improve somatic mutation detection
accuracy. To help evaluate the performance of these various somatic variant
callers, several benchmarking studies [27–30] are available showing that
Mutect2 and Strelka2 are among the top performers so far.
Whole Genome/Exome Sequencing 221
TABLE 10.1
Mandatory Fields in a VCF File
Col Field Type Description
Predicted
Positive Negative
Positive
True Positive False Negative
(TP) (FN)
Actual
Negative False Positive True Negative
(FP) (TN)
FIGURE 10.5
Contingency table for variant calling.
between major and minor alleles, extreme depth of coverage, or strand bias.
The ratio of transitions and transversions (Ti/Tv) is an additional indicator
of variant call specificity and quality. The theoretical ratio of Ti/Tv is 0.5,
because purely from the point of statistical probability the chance of produ-
cing transitions is half that of transversions. However, due to biochemical
mechanisms involved in these nucleotide substitution processes, the
frequency of having transitions is higher than that of transversions. Based
on existing NGS data from multiple species, the expected values of Ti/Tv
for whole genome and exome datasets are usually in the ranges of 2.0–2.1,
3.0–3.5, respectively [38]. Variants that do not pass these QC criteria are then
filtered out. Besides such filtering using preset criteria, low quality variants
may also be identified for removal using machine learning approaches such
as ForestQC [39] and VQSR (part of GATK).
As different variant callers employ different approaches, the variants they
identify usually only partially overlap. It is advisable, therefore, to examine
closely on the specifics of an experiment to decide on more appropriate
variant caller(s). If more than one method is used, it is advisable to compare
their outputs and analyze how they intersect. Use of convergent variants
is an effective way to reduce rates of miscalled variants. Alternatively,
ensemble methods such as VariantMetaCaller [40] and BAYSIC [41] can also
be used.
To further compare results from multiple variant callers side-by-side, pre-
cision and recall are key metrics often used to measure the performance of
variant callers. Precision refers to the ability to not detect false positives,
while recall to not detect false negatives (Figure 10.5). Mathematically, preci-
sion and recall are calculated as
TP
Precision =
TP + FP ( i.e., all predicted positives)
Whole Genome/Exome Sequencing 225
TP
Recall =
TP + FN ( i.e., all actual positives)
Precision × Recall
F1 = 2 ×
Precision + Recall
FIGURE 10.6
Four approaches for SV detection. (From Escaramís, G. & Docampo, E. A decade of structural
variants: description, history and methods to detect structural variation. Briefings in Functional
Genomics, 2015, 14 (5): 305–14, by permission of Oxford University Press.)
FIGURE 10.7
General steps of calling SVs using paired-end reads. (Used with permission from Whelan,
Christopher, “Detecting and Analyzing Genomic Structural Variation Using Distributed
Computing” (2014). Scholar Archive. Paper 3482.)
clusters are filtered based on statistical assessment so that only clusters that
are covered by multiple read pairs are reported as SVs. The boundaries of
possible break points in the region are also identified in this step (indicated
by the shaded area in Figure 10.7, panel d). Among currently available SV
detection algorithms, PEMer (or Paired-End Mapper) [44], BreakDancer [45],
Whole Genome/Exome Sequencing 227
SVDetect [46], and 1-2-3-SV [47] apply this paired-reads based approach.
Pindel [48] provides an example for the SR approach. It first searches for
read pairs in which one read aligns to the reference genome but the other
does not. Based on the assumption that the second read contains a break-
point, it uses the aligned read as anchor to scan the surrounding regions for
split mapping of the second read. While it can locate breakpoints at single
base resolution, this approach is computationally expensive because of the
challenge associated with aligning read sub-sequences to different genomic
regions with gaps in between. Cortex [49] and AsmVar [50] are examples of
the DS approach. In this approach, the genome is first assembled from reads,
and subsequently SVs are called through alignment and statistical analysis of
the de novo genome assembly against the reference.
To improve detection accuracy, many currently available SV detection
algorithms use a combination of these approaches. For example, DELLY [51],
Meerkat [52], SoftSV [53], and Wham [54] combine PR and SR, while GASV/
GASVPro [55, 56], Genome STRiP [57], and inGAP-sv [58] combine RD and
PR. HYDRA [59] is an example of combining RD and AS. As examples of
combining three approaches, MANTA [60], GRIDSS [61], SvABA [62], and
CREST [63] combine PR, SR, and AS. LUMPY [64] and TIDDIT [65], on the
other hand, combine RD, PR, and SR.
10.3.2 Long-Read-Based SV Calling
SV detection from the use of short reads has high miscalling rates because
of the limitations of short-read sequencing. Long-read technologies such as
PacBio and Oxford Nanopore sequencing overcome such inherent limitations
caused by short read length. Mechanistically, long-read-based SV callers
are mostly built on the use of the SR and/or AS approaches. These callers
include pbsv [66], Sniffles [67], Phased Assembly Variant (PAV) [68], MELT
[69], NanoVar [70], NanoSV [71], PALMER [72], SVIM [73], and Picky [74].
Some callers, such as Dysgu [75], are developed to use both long and short
reads. Besides long reads generated from long-read sequencers, synthetic
long reads obtained using technologies such as linked-read sequencing [76–
79] can also be used for SV detection by deploying tools such as Long Ranger,
an open-source pipeline developed by 10× Genomics [80].
10.3.3 CNV Detection
CNVs, caused by duplications, insertions, or deletions, are an important
subtype of structural variation. Among the four basic approaches outlined
in Section 10.3.1, CNV detection algorithms are often based on RD. These
algorithms are based on the assumption that the number of reads obtained
from a region is proportional to its copy number in the genome. If a gen-
omic segment is repeated multiple times, for example, a significantly higher
number of reads will be observed from the segment compared to other
228 Next-Generation Sequencing Data Analysis
10.3.4 Integrated SV Analysis
The different software tools introduced above are based on different algo-
rithmic design, and as a result show varying performance for detecting par-
ticular types (or aspects) of SVs [94, 95]. In order to improve call performance
for the full range of SVs, there have been efforts to take an integrated approach
towards comprehensive SV calling using the different but often complemen-
tary tools. SVMerge, being one of these efforts, integrates SV calling results
from different callers [96]. It first feeds BAM files into a number of SV callers
including those introduced above to generate BED files, and then the SV calls
in the BED files are merged. After computational validation and breakpoint
refinement by local de novo alignment, a final list of SVs is generated. Other
efforts that take a similarly integrated approach include Parliament2 [97],
FusorSV [98], SURVIVOR [99], MetaSV [100], and CNVer [101].
References
1. Acuna-Hidalgo R, Veltman JA, Hoischen A. New insights into the gener-
ation and role of de novo mutations in health and disease. Genome Biol 2016,
17(1):241.
2. Miller MB, Reed HC, Walsh CA. Brain Somatic Mutation in Aging and
Alzheimer’s Disease. Annu Rev Genomics Hum Genet 2021, 22:239–256.
3. Martincorena I, Campbell PJ. Somatic mutation in cancer and normal cells.
Science 2015, 349(6255):1483–1489.
4. McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky
A, Garimella K, Altshuler D, Gabriel S, Daly M et al. The Genome Analysis
Toolkit: a MapReduce framework for analyzing next-generation DNA sequen-
cing data. Genome Res 2010, 20(9):1297–1303.
5. Danecek P, Bonfield JK, Liddle J, Marshall J, Ohan V, Pollard MO, Whitwham
A, Keane T, McCarthy SA, Davies RM et al. Twelve years of SAMtools and
BCFtools. GigaScience 2021, 10(2).
6. Kim S, Scheffler K, Halpern AL, Bekritsky MA, Noh E, Kallberg M, Chen X,
Kim Y, Beyter D, Krusche P et al. Strelka2: fast and accurate calling of germline
and somatic variants. Nat Methods 2018, 15(8):591–594.
230 Next-Generation Sequencing Data Analysis
24. Sahraeian SME, Liu R, Lau B, Podesta K, Mohiyuddin M, Lam HYK. Deep
convolutional neural networks for accurate somatic mutation detection. Nat
Commun 2019, 10(1):1041.
25. Roth A, Ding J, Morin R, Crisan A, Ha G, Giuliany R, Bashashati A, Hirst M,
Turashvili G, Oloumi A et al. JointSNVMix: a probabilistic model for accurate
detection of somatic mutations in normal/tumour paired next-generation
sequencing data. Bioinformatics 2012, 28(7):907–913.
26. Narzisi G, Corvelo A, Arora K, Bergmann EA, Shah M, Musunuri R, Emde
AK, Robine N, Vacic V, Zody MC. Genome-wide somatic variant calling using
localized colored de Bruijn graphs. Commun Biol 2018, 1:20.
27. Cai L, Yuan W, Zhang Z, He L, Chou KC. In-depth comparison of somatic
point mutation callers based on different tumor next-generation sequencing
depth data. Sci Rep 2016, 6:36540.
28. Kroigard AB, Thomassen M, Laenkholm AV, Kruse TA, Larsen MJ. Evaluation
of nine somatic variant callers for detection of somatic mutations in exome
and targeted deep sequencing data. PLoS One 2016, 11(3):e0151664.
29. Chen Z, Yuan Y, Chen X, Chen J, Lin S, Li X, Du H. Systematic comparison of
somatic variant calling performance among different sequencing depth and
mutation frequency. Sci Rep 2020, 10(1):3501.
30. Zhao S, Agafonov O, Azab A, Stokowy T, Hovig E. Accuracy and efficiency
of germline variant calling pipelines for human genome data. Sci Rep 2020,
10(1):20222.
31. GATK Best Practices Workflow for RNAseq Short Variant Discovery (SNPs
+Indels) (https://gatk.broadinstitute.org/hc/en-us/articles/360035531192-
RNAseq-short-variant-discovery-SNPs-Indels-)
32. Piskol R, Ramaswami G, Li JB. Reliable identification of genomic variants
from RNA-seq data. Am J Hum Genet 2013, 93(4):641–651.
33. Tang X, Baheti S, Shameer K, Thompson KJ, Wills Q, Niu N, Holcomb IN,
Boutet SC, Ramakrishnan R, Kachergus JM et al. The eSNV-detect: a computa-
tional system to identify expressed single nucleotide variants from transcrip-
tome sequencing data. Nucleic Acids Res 2014, 42(22):e172.
34. Goya R, Sun MG, Morin RD, Leung G, Ha G, Wiegand KC, Senz J, Crisan A,
Marra MA, Hirst M et al. SNVMix: predicting single nucleotide variants from
next-generation sequencing of tumors. Bioinformatics 2010, 26(6):730–736.
35. Oikkonen L, Lise S. Making the most of RNA-seq: Pre-processing sequen-
cing data with Opossum for reliable SNP variant detection. Wellcome Open Res
2017, 2:6.
36. Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, Handsaker
RE, Lunter G, Marth GT, Sherry ST et al. The variant call format and VCFtools.
Bioinformatics 2011, 27(15):2156–2158.
37. Knaus BJ, Grunwald NJ. vcfr: a package to manipulate and visualize variant
call format data in R. Mol Ecol Resour 2017, 17(1):44–53.
38. Freudenberg-Hua Y, Freudenberg J, Kluck N, Cichon S, Propping P, Nothen
MM. Single nucleotide variation analysis in 65 candidate genes for CNS
disorders in a representative sample of the European population. Genome Res
2003, 13(10):2271–2276.
39. Li J, Jew B, Zhan L, Hwang S, Coppola G, Freimer NB, Sul JH. ForestQC: Quality
control on genetic variants from next- generation sequencing data using
random forest. PLoS Comput Biol 2019, 15(12):e1007556.
232 Next-Generation Sequencing Data Analysis
56. Sindi SS, Onal S, Peng LC, Wu HT, Raphael BJ. An integrative probabilistic
model for identification of structural variation in sequencing data. Genome Biol
2012, 13(3):R22.
57. Handsaker RE, Van Doren V, Berman JR, Genovese G, Kashin S, Boettger LM,
McCarroll SA. Large multiallelic copy number variations in humans. Nat
Genet 2015, 47(3):296–303.
58. Qi J, Zhao F. inGAP-sv: a novel scheme to identify and visualize structural
variation from paired end mapping data. Nucleic Acids Res 2011, 39(Web
Server issue):W567–575.
59. Quinlan AR, Clark RA, Sokolova S, Leibowitz ML, Zhang Y, Hurles ME,
Mell JC, Hall IM. Genome-wide mapping and assembly of structural variant
breakpoints in the mouse genome. Genome Res 2010, 20(5):623–635.
60. Chen X, Schulz-Trieglaff O, Shaw R, Barnes B, Schlesinger F, Kallberg M, Cox
AJ, Kruglyak S, Saunders CT. Manta: rapid detection of structural variants
and indels for germline and cancer sequencing applications. Bioinformatics
2016, 32(8):1220–1222.
61. Cameron DL, Schroder J, Penington JS, Do H, Molania R, Dobrovic A,
Speed TP, Papenfuss AT. GRIDSS: sensitive and specific genomic rearrange-
ment detection using positional de Bruijn graph assembly. Genome Res 2017,
27(12):2050–2060.
62. Wala JA, Bandopadhayay P, Greenwald NF, O’Rourke R, Sharpe T, Stewart
C, Schumacher S, Li Y, Weischenfeldt J, Yao X et al. SvABA: genome-wide
detection of structural variants and indels by local assembly. Genome Res 2018,
28(4):581–591.
63. Wang J, Mullighan CG, Easton J, Roberts S, Heatley SL, Ma J, Rusch MC, Chen
K, Harris CC, Ding L et al. CREST maps somatic structural variation in cancer
genomes with base-pair resolution. Nat Methods 2011, 8(8):652–654.
64. Layer RM, Chiang C, Quinlan AR, Hall IM. LUMPY: a probabilistic frame-
work for structural variant discovery. Genome Biol 2014, 15(6):R84.
65. Eisfeldt J, Vezzi F, Olason P, Nilsson D, Lindstrand A. TIDDIT, an efficient and
comprehensive structural variant caller for massive parallel sequencing data.
F1000Research 2017, 6:664.
66. pbsv (https://github.com/PacificBiosciences/pbsv)
67. Sedlazeck FJ, Rescheneder P, Smolka M, Fang H, Nattestad M, von Haeseler A,
Schatz MC. Accurate detection of complex structural variations using single-
molecule sequencing. Nat Methods 2018, 15(6):461–468.
68. Ebert P, Audano PA, Zhu Q, Rodriguez-Martin B, Porubsky D, Bonder MJ,
Sulovari A, Ebler J, Zhou W, Serra Mari R et al. Haplotype-resolved diverse
human genomes and integrated analysis of structural variation. Science 2021,
372(6537):eabf7117.
69. Gardner EJ, Lam VK, Harris DN, Chuang NT, Scott EC, Pittard WS, Mills
RE, Genomes Project C, Devine SE. The Mobile Element Locator Tool
(MELT): population-scale mobile element discovery and biology. Genome Res
2017, 27(11):1916–1929.
70. Tham CY, Tirado-Magallanes R, Goh Y, Fullwood MJ, Koh BTH, Wang W, Ng
CH, Chng WJ, Thiery A, Tenen DG et al. NanoVar: accurate characterization of
patients’ genomic structural variants using low-depth nanopore sequencing.
Genome Biol 2020, 21(1):56.
234 Next-Generation Sequencing Data Analysis
71. Cretu Stancu M, van Roosmalen MJ, Renkens I, Nieboer MM, Middelkamp
S, de Ligt J, Pregno G, Giachino D, Mandrile G, Espejo Valle- Inclan J
et al. Mapping and phasing of structural variation in patient genomes using
nanopore sequencing. Nat Commun 2017, 8(1):1326.
72. Zhou W, Emery SB, Flasch DA, Wang Y, Kwan KY, Kidd JM, Moran JV,
Mills RE. Identification and characterization of occult human-specific LINE-
1 insertions using long-read sequencing technology. Nucleic Acids Res 2020,
48(3):1146–1163.
73. Heller D, Vingron M. SVIM: structural variant identification using mapped
long reads. Bioinformatics 2019, 35(17):2907–2915.
74. Gong L, Wong CH, Cheng WC, Tjong H, Menghi F, Ngan CY, Liu ET, Wei
CL. Picky comprehensively detects high- resolution structural variants in
nanopore long reads. Nat Methods 2018, 15(6):455–460.
75. Cleal K, Baird DM. Dysgu: efficient structural variant calling using short or
long reads. Nucleic Acids Res 2022, 50(9):e53.
76. Zheng GX, Lau BT, Schnall- Levin M, Jarosz M, Bell JM, Hindson CM,
Kyriazopoulou-Panagiotopoulou S, Masquelier DA, Merrill L, Terry JM et al.
Haplotyping germline and cancer genomes with high-throughput linked-read
sequencing. Nat Biotechnol 2016, 34(3):303–311.
77. Zhang F, Christiansen L, Thomas J, Pokholok D, Jackson R, Morrell N, Zhao Y,
Wiley M, Welch E, Jaeger E et al. Haplotype phasing of whole human genomes
using bead-based barcode partitioning in a single tube. Nat Biotechnol 2017,
35(9):852–857.
78. Wang O, Chin R, Cheng X, Wu MKY, Mao Q, Tang J, Sun Y, Anderson E, Lam
HK, Chen D et al. Efficient and unique cobarcoding of second-generation
sequencing reads from long DNA molecules enabling cost- effective and
accurate sequencing, haplotyping, and de novo assembly. Genome Res 2019,
29(5):798–808.
79. Chen Z, Pham L, Wu TC, Mo G, Xia Y, Chang PL, Porter D, Phan T, Che H, Tran
H et al. Ultralow-input single-tube linked-read library method enables short-read
second-generation sequencing systems to routinely generate highly accurate and
economical long-range sequencing information. Genome Res 2020, 30(6):898–909.
80. Long Ranger (https://github.com/10XGenomics/longranger)
81. Abyzov A, Urban AE, Snyder M, Gerstein M. CNVnator: an approach to dis-
cover, genotype, and characterize typical and atypical CNVs from family and
population genome sequencing. Genome Res 2011, 21(6):974–984.
82. Xie C, Tammi MT. CNV-seq, a new method to detect copy number variation
using high-throughput sequencing. BMC bioinformatics 2009, 10:80.
83. Talevich E, Shain AH, Botton T, Bastian BC. CNVkit: Genome-wide copy
number detection and visualization from targeted DNA sequencing. PLoS
Comput Biol 2016, 12(4):e1004873.
84. Klambauer G, Schwarzbauer K, Mayr A, Clevert DA, Mitterecker A,
Bodenhofer U, Hochreiter S. cn.MOPS: mixture of Poissons for discovering
copy number variations in next-generation sequencing data with a low false
discovery rate. Nucleic Acids Res 2012, 40(9):e69.
85. Ivakhno S, Royce T, Cox AJ, Evers DJ, Cheetham RK, Tavare S. CNAseg--a
novel framework for identification of copy number changes in cancer from
second-generation sequencing data. Bioinformatics 2010, 26(24):3051–3058.
Whole Genome/Exome Sequencing 235
86. Zhu M, Need AC, Han Y, Ge D, Maia JM, Zhu Q, Heinzen EL, Cirulli ET,
Pelak K, He M et al. Using ERDS to infer copy-number variants in high-
coverage genomes. Am J Hum Genet 2012, 91(3):408–421.
87. Yoon S, Xuan Z, Makarov V, Ye K, Sebat J. Sensitive and accurate detection
of copy number variants using read depth of coverage. Genome Res 2009,
19(9):1586–1592.
88. Boeva V, Popova T, Bleakley K, Chiche P, Cappo J, Schleiermacher G,
Janoueix-Lerosey I, Delattre O, Barillot E. Control-FREEC: a tool for assessing
copy number and allelic content using next-generation sequencing data.
Bioinformatics 2012, 28(3):423–425.
89. Alkan C, Kidd JM, Marques-Bonet T, Aksay G, Antonacci F, Hormozdiari F,
Kitzman JO, Baker C, Malig M, Mutlu O et al. Personalized copy number and
segmental duplication maps using next-generation sequencing. Nat Genet
2009, 41(10):1061–1067.
90. Chiang DY, Getz G, Jaffe DB, O’Kelly MJ, Zhao X, Carter SL, Russ C,
Nusbaum C, Meyerson M, Lander ES. High-resolution mapping of copy-
number alterations with massively parallel sequencing. Nat Methods 2009,
6(1):99–103.
91. Miller CA, Hampton O, Coarfa C, Milosavljevic A. ReadDepth: a parallel R
package for detecting copy number alterations from short sequencing reads.
PLoS One 2011, 6(1):e16327.
92. Roller E, Ivakhno S, Lee S, Royce T, Tanner S. Canvas: versatile and scalable
detection of copy number variants. Bioinformatics 2016, 32(15):2375–2377.
93. Dharanipragada P, Vogeti S, Parekh N. iCopyDAV: Integrated platform for
copy number variations-Detection, annotation and visualization. PLoS One
2018, 13(4):e0195334.
94. Cameron DL, Di Stefano L, Papenfuss AT. Comprehensive evaluation and
characterisation of short read general-purpose structural variant calling soft-
ware. Nat Commun 2019, 10(1):3240.
95. Kosugi S, Momozawa Y, Liu X, Terao C, Kubo M, Kamatani Y. Comprehensive
evaluation of structural variation detection algorithms for whole genome
sequencing. Genome Biol 2019, 20(1):117.
96. Wong K, Keane TM, Stalker J, Adams DJ. Enhanced structural variant and
breakpoint detection using SVMerge by integration of multiple detection
methods and local assembly. Genome Biol 2010, 11(12):R128.
97. Zarate S, Carroll A, Mahmoud M, Krasheninina O, Jun G, Salerno WJ, Schatz
MC, Boerwinkle E, Gibbs RA, Sedlazeck FJ. Parliament2: Accurate structural
variant calling at scale. GigaScience 2020, 9(12):giaa145.
98. Becker T, Lee WP, Leone J, Zhu Q, Zhang C, Liu S, Sargent J, Shanker K, Mil-
Homens A, Cerveira E et al. FusorSV: an algorithm for optimally combining data
from multiple structural variation detection methods. Genome Biol 2018, 19(1):38.
99. Jeffares DC, Jolly C, Hoti M, Speed D, Shaw L, Rallis C, Balloux F, Dessimoz
C, Bahler J, Sedlazeck FJ. Transient structural variations have strong effects
on quantitative traits and reproductive isolation in fission yeast. Nat Commun
2017, 8:14061.
100. Mohiyuddin M, Mu JC, Li J, Bani Asadi N, Gerstein MB, Abyzov A, Wong
WH, Lam HY. MetaSV: an accurate and integrative structural-variant caller
for next generation sequencing. Bioinformatics 2015, 31(16):2741–2744.
236 Next-Generation Sequencing Data Analysis
Next- generation sequencing has not only altered the landscape of life
science research, its impact on clinical diagnosis, prognosis, and interven-
tion selection has also become increasingly evident. The launch of precision
or personalized health initiatives worldwide is a testament to the power of
NGS in improving human health, and also a key driver for integrating NGS
into medical practice. From the rapid development of clinical sequencing, it
is apparent that personal genome information guided medicine is the future
of medical practice. Compared to research-oriented NGS, clinical sequencing
is subjected to more regulations as required for other clinical tests of patient
samples to ensure accurate and reliable results. In the United States, clin-
ical sequencing is mostly regulated by the Food and Drug Administration
(FDA) and Centers for Medicare & Medicaid Services through the Clinical
Laboratory Improvement Amendments (CLIA). Many countries around the
world and international organizations such as ISO have similar regulations.
Diagnosis, prognosis, and treatment of oncologic and pediatric diseases are
two exemplary areas that have seen great benefits from clinical sequencing.
As cancer is a disease of the genome, NGS is well suited to unravel tumor
heterogeneity and classify tumors into different types or subtypes based on
what genomic variants they possess [1]. Sequencing of various oncological
gene panels, whole exome, and increasingly whole genome has become
more and more commonplace, and provided much needed guidance for clin-
ical actions. Tumor mutation burden, an overall index of the total amount
of nonsynonymous mutations in a genome measured by NGS, serves as a
good indicator of immunotherapy efficacy [2]. For pediatric patients, espe-
cially those in the neonatal intensive care unit (NICU), speedy diagnosis and
treatment are essential, which require rapid sequencing and data processing.
The use of NGS in the NICU setting is a test to its speed, accuracy, and overall
utility in meeting clinical needs. The development of rapid genome sequen-
cing pipelines, including bioinformatics, has been proven to decrease infant
morbidity and at the same time lead to cost savings [3].
Often different from a research setting, the immediate goal of clinical sequen-
cing is to identify disease-causing variant(s) based on which a treatment plan
can be decided. The major challenge to achieve this goal is how to identify the
causal variant(s) for the primary indication from thousands or even millions
acid extraction [8], can help decrease the frequency of such false-positive
discoveries.
Other surgical biopsy materials, including fine needle aspiration and core
needle biopsy, are also regularly used to sample discrete site(s) of diseased
tissue. A potential pitfall here is that they may not be representative of the
remaining, unsampled parts of the tissue. Some diseases, with cancer being
the best known, are characterized by clonality and cellular heterogeneity.
For these diseases, the sampling with tissue biopsies may not fully reveal
the entire set of mutations. Also because of its invasiveness, obtaining tissue
biopsies is not suitable for sample collection at regular intervals for the pur-
pose of tracking disease progression and monitoring treatment outcome.
Liquid biopsy represents a more recently developed sample type that is
less invasive and therefore better suited for real-time, longitudinal clinical
tracking and monitoring. Instead of sampling diseased tissue directly, liquid
biopsy collects cells or DNA that are shed from diseased tissue into the
blood or other bodily fluid (such as urine). For example, circulating tumor
cells (CTCs), or circulating cell-free DNA (cfDNA), are increasingly used as
input. Because the amount of cfDNA in the plasma or other bodily fluid
is rather low, of which circulating tumor DNA (ctDNA) only constitutes a
minor fraction (<0.1–10%), it is crucial to use a cfDNA extraction method
that provides high efficiency and recovery [9, 10]. While cancer patients usu-
ally have more cfDNA than healthy individuals [11], the extraction yield
is generally within the range from below 10 ng to 100 ng cfDNA per mL
plasma (usually below 20 ng). The fragment size of cfDNA is mostly within
the range of 160–200 bp [12]. Side-by-side comparisons have demonstrated
high levels of concordance on the detection of mutations between cfDNA
and matched tissue biopsies [11, 13]. Because of its low invasiveness and
high accuracy, liquid biopsy has been used for multiple clinical applications,
including detection of minimal residual disease for patients that are in remis-
sion, or early screening of healthy individuals before a disease manifests
itself [14].
Prior to sample collection, patients need to counseled and informed con-
sent must be obtained. Besides the affected individual (proband), often
the proband’s parents and/or other family members may also need to be
sequenced. This is especially required to determine whether a mutation is
passed on from parents or formed de novo, for which samples are collected
from the proband and their biological parents for trio sequencing. During
counseling, the purpose of performing genetic testing and the types of results
anticipated are conveyed to the patient and their family members. In addition,
the patient and family members are also informed of the test’s limitations and
potential risks. For example, the test may not reveal a genetic link, the inter-
pretation may involve uncertainty, or the findings may be distressing instead
of reassuring. Further, interpretation of the results may change over time,
and the test results may have implications to other untested family members
and their lives.
240 Next-Generation Sequencing Data Analysis
with the matched control DNA sequenced to ≥30× [18]. Sequencing depth is
largely determined by the accuracy of sequencing (e.g., 0.1–1% of error rate on
Illumina sequencers, refer to Chapter 4 for more details), and other molecular
steps such as PCR amplification during library preparation. To reach even
lower LOD (e.g., VAF <1%), strategies to improve NGS accuracy and reduce
PCR errors have been deployed, which include employment of signal-to-
noise correction methodologies and single-molecule consensus sequencing
schemes [19]. One of such strategies is called Duplex Sequencing built on
the use of molecular barcoding. Molecular barcoding utilizes the so-called
unique molecular identifiers (or UMIs) to label single molecules prior to PCR
amplification. From sequences generated from PCR duplicates that carry the
same UMI, i.e., those derived from the same molecule, consensus sequence
is reached that corrects errors introduced during amplification (except the
first cycle) as well as random sequencing errors. In Duplex Sequencing, the
two strands of the original DNA duplex generate two separate consensus
sequences, and comparing them leads to generation of duplex consensus
sequence, further removing errors introduced during the first PCR cycle.
With Duplex Sequencing, somatic mutations that occur at a frequency of 10-5
or lower can be detected with high confidence [20]. Because it relies on single
molecule consensus sequence generation through sequencing multiple amp-
lified copies of the same original molecule in a strand-specific fashion and
then later collapsing them, this technique requires a lot more reads than con-
ventional NGS to achieve lower LOD levels.
For QC/QA during library preparation and sequencing, sample and
data quality needs to be checked at multiple steps. Prior to proceeding to
library preparation, input DNA quality and quantity need to be assessed
against pre-defined criteria, which may vary depending on the intended
detection targets of the sequencing assay. For example, for detection of
structural variants, which needs long-range genomic information, FFPE
samples are not recommended even with the use of remedial measures
mentioned above. Once an appropriate library preparatory workflow is
chosen, it needs to be standardized to ensure consistent performance,
and multiple QC/QA steps should be specified to monitor results of the
workflow at key junctures, eventually library yield and fragment size
range. To help assess workflow performance, reference DNA samples with
known variants can be used as positive control. Such reference samples
include well-characterized reference DNA from the Genome in a Bottle
(GIAB) consortium supported by the U.S. National Institute of Standards
and Technology [21], engineered DNA that contains clinically relevant
synthetic variants at pre-defined allele frequencies (available from com-
mercial sources), and clinical samples that have been analyzed by another
CLIA-accredited NGS lab. Sequencing run quality metrics, such as the per-
centage of reads over Q30 and overall error rate, need to be monitored and
must pass a pre-defined quality threshold.
Clinical Sequencing and Detection of Actionable Variants 243
11.3 Variant Filtering
The procedure from a long list of called variants to a clinical testing report
(Figure 11.1) is the major focus of this chapter. Among the large number
of called variants, most are benign and do not have an impact on human
health. For example, a typical WGS of an individual’s germline DNA usually
identifies 5 million or more variants, of which only 30,000 or more are in
protein-coding regions. Of these variants, ~10,000 represent missense amino
acid substitutions, aberrant splicing sites, or small indels [28]. These variants
are further narrowed down to produce a short list of clinically relevant and
actionable variants, to assist clinicians in disease diagnosis, prognosis, and
treatment. This multi-step process requires a multitude of tools and databases
to screen the called variants based on their frequency, functional consequence,
known linkage to human disease(s), and match of clinical phenotype, as well
244 Next-Generation Sequencing Data Analysis
Called Variants
Variant Filtering
- Frequency
- Functional Impact
- Known Evidence
- Phenotype Match
- Inheritance Mode
Variant Ranking
& Prioritization
Variant Pathogenicity
Classification
Expert Review
Variant Validation
FIGURE 11.1
Clinical sequencing general data analytic workflow starting from variant calls.
as mode of inheritance. Besides these in silico variant filtering steps, wet lab
strategies such as pedigree sequencing can also help to reliably and signifi-
cantly reduce the number of potential candidate variants.
Presented below are major steps of the variant filtering workflow. Prior to
performing these steps, preliminary variant screening is needed to filter out
variants that do not pass a predefined variant call quality threshold. While
the following filtering steps are usually performed on all variants that pass
the threshold, these steps can also be applied to a pre-selected list of genes
that are known to be associated with the phenotype/disease of the patient,
for the purpose of minimizing incidental findings and reducing analytic
burden. Each of the filtering steps detailed below focuses on one relevant
aspect of a variant to its potential role in underlying the disease or phenotype
Clinical Sequencing and Detection of Actionable Variants 245
11.3.1 Frequency of Occurrence
Many called variants are too common to be consistent with the low incidence
of a genetic disease. To identify disease-causing variant(s), such common
variants need to be filtered out. While the threshold for occurrence frequency
can be set at different levels, an MAF (minor allele frequency) of less than 1%
is often used. To determine the occurrence frequency of a called variant in
the general population, large databases of human genetic variations, such as
gnomAD [29], the 1000 Genomes Project (1KGP) database [30], TOPMed [31],
UK10K [32], and NHLBI Exome Sequencing Project (ESP) [33], are often used
(Table 11.1, Page 249). These databases contain mostly SNVs and short indels.
Some of these databases, such as gnomAD and 1KGP, also contain common
structural variants, but most SVs are catalogued by specific databases including
the Database of Genomic Variants (DGV) [34], Database of Chromosomal
Imbalance and Phenotype in Humans using Ensembl Resources (DECIPHER)
[35], and dbVar [36]. It should be noted that variant allele frequency is often
population specific. For example, the minor allele of an SNV in the EML6 gene,
rs17046386 (A>G), is not rare in African populations, but rare in non-African
populations. Therefore, the ancestral background of the affected individual
should be taken into consideration in this step.
11.3.2 Functional Consequence
Variants that change amino acid residues in the active site of a protein may
significantly affect its function. Variants located at other conservative base
positions, such as those that affect gene transcript splicing or gene transcrip-
tion initiation, may also exert significant effects on gene product. On the other
hand, functional significance of variants that fall into intergenic regions are
often hard to assess. To sort variants based on their genomic locations, e.g.,
those in protein-coding, regulatory (e.g., intron, splicing site, 5’ or 3’ UTR,
promoter, etc.), or intergenic non-coding regions, variant annotation tools
such as ANNOVAR, VEP, or VariantAnnotation [37] can be used. Intergenic
or non-coding variants are usually filtered out, unless they are predicted to
have regulatory functions such as affecting gene splicing. To predict potential
variant pathogenicity caused by altered splicing, SpliceAI [38], MaxEntScan
[39], and NNSplice [40] are among the best performing methods, based on
currently available benchmarking studies [41–44]. For amino acid-altering
variants, a variety of tools are available to predict their potential impacts to
help determine whether they should be filtered out. Such tools, based on their
underlying algorithm, can be divided into three groups: function prediction,
246 Next-Generation Sequencing Data Analysis
11.3.5 Mode of Inheritance
Family medical history, if available, can also greatly aid the variant filtering
process. For a Mendelian disorder, the five basic modes of inheritance are
autosomal dominant, autosomal recessive, sex-linked dominant, sex-linked
recessive, and mitochondrial. Traditional pedigree analysis can lead to
revealing of the mode of inheritance of such a disorder. From variants called
from a proband and their pedigree (most commonly trio sequencing), those
that do not conform to the inheritance pattern are filtered out.
249
newgenrtpdf
TABLE 11.1 (Continued)
250
Major Tools and Databases for Filtering Variants Using Various Criteria
Name Full Name Description Reference
251
newgenrtpdf
TABLE 11.1 (Continued)
252
Major Tools and Databases for Filtering Variants Using Various Criteria
Name Full Name Description Reference
253
254 Next-Generation Sequencing Data Analysis
mutation spot affecting key functional domain of the coded protein, to those
presumed to be de novo; and (4) Supporting (PP) –covers from those that
show co-segregation with disease in multiple members of the affected family
in a gene known to cause the disease, to those that are called pathogenic by a
reputable source but with no available evidence.
For benign variants, the evidence of benign impact is divided into three
levels: (1) Stand-Alone (coded as BA1) –this level contains variants with
allele frequency above 5% in ESP, 1KGP, or gnomAD; (2) Strong (BS) –covers
from those with allele frequency higher than expected for the disease to those
lacking segregation in the affected family; and (3) Supporting (BP) –includes
missense variants in a gene where truncation variants are the primary mech-
anism of disease to synonymous variants with no evolutionary conservation
and no predicted effect on splicing.
For detailed definition and interpretation of the various levels of evidence,
and the rules for classifying variants into one of the five tiers based on different
combinations of these various evidence (summarized in Figure 11.2), readers
can refer to the original ACMG/AMP publication ([89], particularly Tables 3,
4, and 5) and subsequent revisions. To follow the ACMG/AMP guidelines,
all available evidence on the pathogenicity or benignity of a variant needs
to be considered. Such evidence includes those gathered from the current
case, data in public databases (such as those listed on Table 11.1) and sci-
entific literature, and/or the clinal sequencing lab’s internal data. Through
reviewing the aggregated evidence and then applying the ACMG/AMP
guidelines, the classification of variants can then be attained. Open-source
or commercial tools, such as InterVar (Clinical Interpretation of Genetic
Variants, open source) or its web version wInterVar [90], QIAGEN Clinical
Insight (QCI) Interpret (commercial), may help classify variants by pro-
viding automated application of many of the criteria from the ACMG/AMP
guidelines. CardioClassifier [91] and CardioVAI [92] are other examples of
decision support tools specifically developed for particular diseases, as both
aim to classify genes related to cardiac diseases. Prior to reporting, output
from decision support tools needs to be examined, and modified if necessary,
by testing lab personnel.
FIGURE 11.2
ACMG/ AMP rules on how to classify variants based on combination of pathogenicity/
benignity evidence at different levels. (From SE Brnich, EA Rivera-Munoz, JS Berg, Quantifying
the potential of functional evidence to reclassify variants of uncertain significance in the
categorical and Bayesian interpretation frameworks, Human Mutation 2018, 39(11):1531–1541.
With permission.)
benign variants. To categorize into one of these four tiers, four levels of evi-
dence are evaluated: (1) Level A –variants that can serve as biomarkers to
predict response or resistance to approved therapies, or are included in pro-
fessional guidelines for a specific cancer type; (2) Level B –those that can
256 Next-Generation Sequencing Data Analysis
are also available to assist with the variant reporting process. While they
cannot entirely replace clinical geneticists and pathologists, these AI-based
decision support tools can help filter out most non-reportable variants to
allow experts to focus on a more manageable number of potentially report-
able variants, making this key analytic process more efficient.
11.6.2 Expert Review
While the variant reporting process can be automated with promising results
using the tools introduced above, review of reportable pathogenic or likely
pathogenic germline variants, or Tier I or II somatic variants, by molecular
pathologists and oncologists is still needed to examine the entire evidence
matrix for their pathogenic/oncogenic role prior to reporting. Current med-
ical literature and database entries should be checked for their links with
known diseases or implicated biological pathways. This step should also be
contextualized in consideration of the patient’s phenotype and family his-
tory. Such expert manual review is needed to avoid non-pathogenic/non-
oncogenic variants from being reported as false positives.
11.6.4 Variant Validation
While traditionally germline SNVs or small indels on a clinical report need to
be validated using an orthogonal technology, with Sanger sequencing being
the golden standard, a number of studies have shown that this is not always
necessary and NGS is often more accurate than Sanger sequencing [102, 103].
CNVs are typically validated using orthogonal techniques such as multiplex
ligation-
dependent probe amplification (MLPA) or comparative genomic
hybridization arrays (aCGH), but validation using an NGS CNV pipeline
has been reported [104]. Somatic variants detected in cancerous samples, on
the other hand, should be validated using an orthogonal method, including
Sanger sequencing, digital PCR, or pyrosequencing, with the latter two
especially suited for those detected at low frequencies. It should be noted
again that for both germline and somatic variants, the preliminary variant
screening step mentioned earlier to filter out low- quality variants is an
important step to minimize the rate of false-positive variants. In addition, to
ensure call accuracy of the variants on the final report, their raw reads should
be manually examined by visually checking pileup of reads that align to their
genomic locations using a visualization tool such as Integrative Genomics
Viewer (IGV). This simple step can help further reduce false positives caused
by factors such as insufficient sequencing coverage. The use of these QC
measures, as well as sound medical judgment and good clinical practice, are
260 Next-Generation Sequencing Data Analysis
needed not only for labs that develop their own custom steps, but also for
those that opt to use a commercially available platform. In the latter case,
although the service provider may offer a pre-validated platform, the testing
lab still needs to go through familiarization and optimization, and validate
that the platform meets the designed analytical goals of the test. To help val-
idate a new or existing pipeline and evaluate its performance, the reference
standard samples used for validating the lab workflow (see Section 11.1.2),
including the GIAB reference DNA and bio-engineered DNA that contains
synthetic variants at predefined frequencies, are equally valuable here since
they provide the ground truth. In addition, bioinformatically generated ref-
erence dataset using programs such as BAMSurgeon [111] may also be used.
For the validation, all analytic steps in the pipeline need to be clearly
defined with required hardware and software specified for each step. Such
specifics include hardware configuration, operating system, name and
version number of software and their dependencies, data storage and trans-
mission system, and network connection protocol. In addition, other analytic
details such as parameters used in each software, reference genome used for
alignment, and databases accessed for annotation and filtering, should also
be specified. Any sequencing reads altering operations, such as trimming,
should be fully evaluated to determine whether they are appropriate, or need
to be revised or dropped. Quality metrics for each of the analytic steps, such
as mean reads on-target coverage and percentage of target genomic regions
with coverage over a threshold (for reads alignment), and depth of coverage
for each called variant (variant calling), should be compared to pre-defined
performance criteria. Applied variant filters should be evaluated carefully to
make sure that true positives are not filtered out.
If the pipeline uses internally developed software tools and scripts, the com-
puter code should be deposited in a source code repository, such as GitHub,
BitBucket, or SourceForge. If using externally developed software, the source
code should also be documented if accessible. Also as part of the validation
process, the strategies established to back up data and maintain the integ-
rity of raw and analyzed files during transfer should be evaluated. On the
issue of legal compliance, the pipeline should follow all applicable laws at
the national and local levels. In the United States, the applicable laws include
the Health Insurance Portability and Accountability Act (HIPAA) and other
national/state/local laws that pertain to clinical genetic testing. According
to these laws, patient genetic information needs to be protected like other
patient information including patient identity and other health records, and
should be secured throughout the analytic process. When using a commer-
cial system, it is the responsibility of the testing lab to verify such compliance
issues. To avoid accidental mixing of patient data, based on the AMP/CAP
guidelines the identity of a patient sample must be preserved throughout the
analytic process using at least four unique identifiers to encompass sample,
patient, run, and test location [110]. As all analytic pipelines have limitations,
such limitations for a validated pipeline should be clearly documented
Clinical Sequencing and Detection of Actionable Variants 263
References
1. Patel LR, Nykter M, Chen K, Zhang W. Cancer genome sequen-
cing: understanding malignancy as a disease of the genome, its conformation,
and its evolution. Cancer letters 2013, 340(2):152–160.
2. Steuer CE, Ramalingam SS. Tumor mutation burden: leading immunotherapy
to the era of precision medicine? J Clin Oncol 2018, 36(7):631–632.
3. Farnaes L, Hildreth A, Sweeney NM, Clark MM, Chowdhury S, Nahas
S, Cakici JA, Benson W, Kaplan RH, Kronick R et al. Rapid whole-genome
sequencing decreases infant morbidity and cost of hospitalization. NPJ Genom
Med 2018, 3:10.
4. Jones S, Anagnostou V, Lytle K, Parpart- Li S, Nesselbush M, Riley DR,
Shukla M, Chesnick B, Kadan M, Papp E et al. Personalized genomic ana-
lyses for cancer mutation discovery and interpretation. Sci Transl Med 2015,
7(283):283ra253.
5. Srinivasan M, Sedmak D, Jewell S. Effect of fixatives and tissue pro-
cessing on the content and integrity of nucleic acids. Am J Pathol 2002,
161(6):1961–1971.
6. Hedegaard J, Thorsen K, Lund MK, Hein AM, Hamilton-Dutoit SJ, Vang S,
Nordentoft I, Birkenkamp-Demtroder K, Kruhoffer M, Hager H et al. Next-
generation sequencing of RNA and DNA isolated from paired fresh-frozen
and formalin-fixed paraffin-embedded samples of human cancer and normal
tissue. PLoS One 2014, 9(5):e98187.
7. Do H, Dobrovic A. Sequence artifacts in DNA from formalin- fixed
tissues: causes and strategies for minimization. Clin Chem 2015, 61(1):64–71.
8. McDonough SJ, Bhagwate A, Sun ZF, Wang C, Zschunke M, Gorman JA, Kopp
KJ, Cunningham JM. Use of FFPE-derived DNA in next generation sequen-
cing: DNA extraction methods. Plos One 2019, 14(4).
9. Oreskovic A, Brault ND, Panpradist N, Lai JJ, Lutz BR. Analytical comparison
of methods for extraction of short cell-free DNA from urine. J Mol Diagn 2019,
21(6):1067–1078.
10. Diefenbach RJ, Lee JH, Kefford RF, Rizos H. Evaluation of commercial kits for
purification of circulating free DNA. Cancer Genet 2018, 228–229:21–27.
11. Alborelli I, Generali D, Jermann P, Cappelletti MR, Ferrero G, Scaggiante B,
Bortul M, Zanconati F, Nicolet S, Haegele J et al. Cell-free DNA analysis in
healthy individuals by next-generation sequencing: a proof of concept and
technical validation study. Cell Death Dis 2019, 10(7):534.
12. Jiang P, Chan CW, Chan KC, Cheng SH, Wong J, Wong VW, Wong GL,
Chan SL, Mok TS, Chan HL et al. Lengthening and shortening of plasma
DNA in hepatocellular carcinoma patients. Proc Natl Acad Sci U S A 2015,
112(11):E1317–1325.
264 Next-Generation Sequencing Data Analysis
13. Wyatt AW, Annala M, Aggarwal R, Beja K, Feng F, Youngren J, Foye A, Lloyd P,
Nykter M, Beer TM et al. Concordance of circulating tumor DNA and matched
metastatic tissue biopsy in prostate cancer. J Natl Cancer Inst 2017, 109(12).
14. Chen M, Zhao H. Next- generation sequencing in liquid biopsy: cancer
screening and early detection. Hum Genomics 2019, 13(1):34.
15. Jones AG, Small CM, Paczolt KA, Ratterman NL. A practical guide to methods
of parentage analysis. Mol Ecol Resour 2010, 10(1):6–30.
16. Zhang L, Dong X, Lee M, Maslov AY, Wang T, Vijg J. Single-cell whole-
genome sequencing reveals the functional landscape of somatic mutations
in B lymphocytes across the human lifespan. Proc Natl Acad Sci U S A 2019,
116(18):9014–9019.
17. Petrackova A, Vasinek M, Sedlarikova L, Dyskova T, Schneiderova P, Novosad
T, Papajik T, Kriegova E. Standardization of sequencing coverage depth in
NGS: recommendation for detection of clonal and subclonal mutations in
cancer diagnostics. Front Oncol 2019, 9:851.
18. Meggendorfer M, Jobanputra V, Wrzeszczynski KO, Roepman P, de Bruijn E,
Cuppen E, Buttner R, Caldas C, Grimmond S, Mullighan CG et al. Analytical
demands to use whole- genome sequencing in precision oncology. Semin
Cancer Biol 2021, 84:16–22.
19. Salk JJ, Schmitt MW, Loeb LA. Enhancing the accuracy of next-generation
sequencing for detecting rare and subclonal mutations. Nat Rev Genet 2018,
19(5):269–285.
20. Schmitt MW, Kennedy SR, Salk JJ, Fox EJ, Hiatt JB, Loeb LA. Detection of
ultra-rare mutations by next-generation sequencing. Proc Natl Acad Sci U S A
2012, 109(36):14508–14513.
21. Zook JM, McDaniel J, Olson ND, Wagner J, Parikh H, Heaton H, Irvine SA, Trigg
L, Truty R, McLean CY et al. An open resource for accurately benchmarking
small variant and reference calls. Nat Biotechnol 2019, 37(5):561–566.
22. Miller NA, Farrow EG, Gibson M, Willig LK, Twist G, Yoo B, Marrs T, Corder
S, Krivohlavek L, Walter A et al. A 26-hour system of highly sensitive whole
genome sequencing for emergency management of genetic diseases. Genome
Med 2015, 7(1):100.
23. Mestek-Boukhibar L, Clement E, Jones WD, Drury S, Ocaka L, Gagunashvili
A, Le Quesne Stabej P, Bacchelli C, Jani N, Rahman S et al. Rapid Paediatric
Sequencing (RaPS): comprehensive real-life workflow for rapid diagnosis of
critically ill children. J Med Genet 2018, 55(11):721–728.
24. Kendig KI, Baheti S, Bockol MA, Drucker TM, Hart SN, Heldenbrand JR,
Hernaez M, Hudson ME, Kalmbach MT, Klee EW et al. Sentieon DNASeq
variant calling workflow demonstrates strong computational performance
and accuracy. Front Genet 2019, 10:736.
25. Loka TP, Tausch SH, Renard BY. Reliable variant calling during runtime of
Illumina sequencing. Sci Rep 2019, 9(1):16502.
26. Stranneheim H, Engvall M, Naess K, Lesko N, Larsson P, Dahlberg M, Andeer
R, Wredenberg A, Freyer C, Barbaro M et al. Rapid pulsed whole genome
sequencing for comprehensive acute diagnostics of inborn errors of metab-
olism. BMC Genomics 2014, 15:1090.
27. Clark MM, Hildreth A, Batalov S, Ding Y, Chowdhury S, Watkins K, Ellsworth
K, Camp B, Kint CI, Yacoubian C et al. Diagnosis of genetic diseases in seriously
Clinical Sequencing and Detection of Actionable Variants 265
103. Beck TF, Mullikin JC, Program NCS, Biesecker LG. Systematic evaluation of
sanger validation of next-generation sequencing variants. Clin Chem 2016,
62(4):647–654.
104. Kerkhof J, Schenkel LC, Reilly J, McRobbie S, Aref-Eshghi E, Stuart A, Rupar
CA, Adams P, Hegele RA, Lin H et al. Clinical validation of copy number
variant detection from targeted next-generation sequencing panels. J Mol
Diagn 2017, 19(6):905–920.
105. Miller DT, Lee K, Chung WK, Gordon AS, Herman GE, Klein TE, Stewart
DR, Amendola LM, Adelman K, Bale SJ et al. ACMG SF v3.0 list for reporting
of secondary findings in clinical exome and genome sequencing: a policy
statement of the American College of Medical Genetics and Genomics
(ACMG). Genet Med 2021, 23(8):1381–1390.
106. Miller DT, Lee K, Gordon AS, Amendola LM, Adelman K, Bale SJ, Chung
WK, Gollob MH, Harrison SM, Herman GE et al. Recommendations for
reporting of secondary findings in clinical exome and genome sequencing,
2021 update: a policy statement of the American College of Medical Genetics
and Genomics (ACMG). Genet Med 2021, 23(8):1391–1398.
107. Appelbaum PS, Parens E, Berger SM, Chung WK, Burke W. Is there a
duty to reinterpret genetic data? The ethical dimensions. Genet Med 2020,
22(3):633–639.
108. Clayton EW, Appelbaum PS, Chung WK, Marchant GE, Roberts JL, Evans BJ.
Does the law require reinterpretation and return of revised genomic results?
Genet Med 2021, 23(5):833–836.
109. Deignan JL, Chung WK, Kearney HM, Monaghan KG, Rehder CW, Chao
EC, Committee ALQA. Points to consider in the reevaluation and reanalysis
of genomic test results: a statement of the American College of Medical
Genetics and Genomics (ACMG). Genet Med 2019, 21(6):1267–1270.
110. Roy S, Coldren C, Karunamurthy A, Kip NS, Klee EW, Lincoln SE, Leon
A, Pullambhatla M, Temple-Smolkin RL, Voelkerding KV et al. Standards
and Guidelines for Validating Next-Generation Sequencing Bioinformatics
Pipelines: A Joint Recommendation of the Association for Molecular
Pathology and the College of American Pathologists. J Mol Diagn 2018,
20(1):4–27.
111. Ewing AD, Houlahan KE, Hu Y, Ellrott K, Caloian C, Yamaguchi TN, Bare JC,
P’ng C, Waggott D, Sabelnykova VY et al. Combining tumor genome simu-
lation with crowdsourcing to benchmark somatic single-nucleotide-variant
detection. Nat Methods 2015, 12(7):623–630.
12
De Novo Genome Assembly with
Long and/or Short Reads
12.2 Assembly of Contigs
12.2.1 Sequence Data Preprocessing, Error Correction,
and Assessment of Genome Characteristics
The de novo assembly of a genome from NGS reads is a multi-step process
(Figure 12.1). As the first step, sequence data quality needs to be inspected.
Data QC steps described in Chapter 5 can be performed here to examine per-
base error rate, quality score distribution, read size distribution, contamin-
ation of adaptor sequences, etc. Low-quality reads need to be filtered out,
and portions of reads that contain low-quality basecalls (usually the 3’ end),
ambiguities (reported as Ns), or adaptor sequences should be trimmed off.
As part of data preprocessing, paired-end reads with part of their sequences
overlapped need to be merged to generate longer reads. The read merging
can also correct errors if discrepancy at some base positions are observed, in
which case the higher quality basecall is used. The merging process can be
handled by tools such as FLASH2 [16], PEAR [17], fastq-join [18], PANDAseq
[19], and VSEARCH [20].
Sequencing error correction is an important step for de novo read assembly,
more so than for most other NGS applications due to the fact that the assembly
process is much more sensitive to these errors. The data QC measures
mentioned above cannot totally remove sequencing errors, as high basecall
quality scores alone cannot guarantee a read is free of sequencing errors.
If left uncorrected, the errors will lead to prolonged computational time,
De Novo Genome Assembly with Long and/or Short Reads 275
Contig Assembly
(Greedy, OLC, & de Bruijn approaches)
Scaffold Construction
Genome Assembly
Quality Evaluation
Gap Closure
FIGURE 12.1
General workflow for de novo genome assembly.
0.015
Error k-mers
0.010
True k-mers
Density
0.005
0.000
0 20 40 60 80 100
Coverage
FIGURE 12.2
The coverage profile of true k-mers and those with sequencing errors. (Adapted from Kelley
D.R. et al. (2010) Quake: quality-aware detection and correction of sequencing errors, Genome
Biology, 11:R116. Used under the terms of the Creative Commons Attribution License (http://
creativecommons.org/licenses/by/2.0). © 2010 Kelley et al.)
Meryl can also be input into GenomeScope [32] to estimate genome size, level
of duplication, and heterozygosity. Besides using k-mer filtering, suffix tree/
array and multiple sequence alignment (MSA) are also often used to correct
sequencing errors. For example, Fiona is an example of the suffix tree/array-
based approach. Coral, on the other hand, is based on the MSA approach. The
approach aligns reads that share common k-mers, and error corrections are
made based on alignment results and consensus sequences.
For long reads from PacBio or ONT that have higher error rates, sequencing
error correction is even more needed and can be performed with tools such
as FMLRC [33], PBcR [34], LoRDEC [35], LSC [36], Nanocorr [37], proovread
[38], and DeepConsensus [39]. Many long-read assemblers (to be detailed
next) contain their own error correction modules, such as Canu [40], Hifiasm
[41], FALCON [42], MARVEL [43], MECAT [44], and NECAT [14]. In general,
long-read error correction methods can be divided into two groups, with one
using redundant information in long reads alone for error correction, while the
other using a hybrid approach to leverage more accurate short reads to help
correct errors in long reads. Long-read-only correction methods belonging
to the first group include LoRMA [45] and FLAS [46]. The error correction
De Novo Genome Assembly with Long and/or Short Reads 277
methods used by Canu, MARVEL, and MECAT, are also in this group since
they are based on consensus correction using sampling redundancy (or
overlaps) within long reads. As an example, to make corrections Canu builds
overlaps between long reads using a probabilistic sequence overlapping
algorithm called the MinHash alignment process (or MHAP) [47]. The over-
lapping results are then used to identify regions that need correction based
on sequence consensus, followed by estimation of corrected read length and
generation of corrected reads. Besides error correction, MARVEL also has a
“patching” module, which aims to repair apparent large-scale errors (e.g.,
regions that contain a lot of errors) based on comparisons between reads.
Examples of the hybrid approach to use more accurate short reads to correct
error-prone long reads include FMLRC, PBcR, LoRDEC, LSC, Nanocorr, and
proovread. Based on how they work, these methods can be further divided
into those based on assembly and those on alignment. Assembly- based
correction methods, such as FMLRC and LoRDEC, first use short reads to
perform assembly, and then align long reads to the assemblies for making
corrections. Alignment-based methods, including PBcR, LSC, Nanocorr, and
proovread, align short reads to long reads and sequencing errors in the long
reads are corrected based on the alignment results.
A B
FIGURE 12.3
Comparison of the OLC (A) and the de Bruijn graph (B) approaches for global de novo genome
assembly. In the OLC example, six sequence reads (R1–R6) are shown for the illustrated genomic
region, with each read being 10 bp in length and the overlap between them set at ≥ 5 bp. The
reads are laid out in order based on how they overlap. The OLC graph is shown at the bottom,
with many nodes having more than one incoming or outgoing connections. In the de Bruijn
graph example, the reads are cut into a series of k-mers (k=5). In total there are 16 such k-mers,
many of which occur in more than one read. The k-mers are arranged sequentially based on how
they overlap, and the de Bruijn graph built from this approach is shown at the bottom. Different
from those in the OLC graph, the majority of the nodes in this graph have only one incoming
and one outgoing connection. (From Li Z. et al. Comparison of the two major classes of assembly
algorithms: overlap–layout–consensus and de-bruijn-graph. Briefings in Functional Genomics,
2012, 11(1): 25–37, by permission of Oxford University Press.)
no longer use this approach, as it cannot take advantage of the global rela-
tionship offered by paired-end and mate-pair reads.
The OLC and the de Bruijn graph approaches are global by design and both
assemble reads into contigs using reads overlapping information based on
the Lander-Waterman model. They approach the task, however, in different
ways (Figure 12.3). The OLC approach, as the name suggests, involves
three steps: (1) detecting potential overlaps between all reads; (2) laying
out all reads with their overlaps in a graph; and (3) constructing consensus
sequence superstring. The first step is computationally intensive and the
run time increases quadratically with the increase in the total number of
reads. The graph created in the second step consists of vertices (or nodes)
representing reads, and edges between them representing their overlaps.
The construction of a consensus sequence superstring equals to finding a
path in the graph that visits each node exactly once, which is known as a
Hamiltonian path in graph theory. While there are a small number of OLC-
based short-read assemblers available such as Edena [52] and Fermi [53],
the OLC approach is more suitable and mostly used to assemble long reads
generated from PacBio and ONT sequencers. In fact, most long read de novo
assembly methods are based on OLC, including Canu, FALCON, Hifiasm,
De Novo Genome Assembly with Long and/or Short Reads 279
MECAT, NECAT, miniasm [54], HINGE [55], Peregrine [56], Shasta [57],
Raven [58], NextDenovo [59], and SMARTdenovo [60]. In the case of Canu,
after the aforementioned error correction process, unsupported sequences
are trimmed off to prepare corrected reads for assembly. In the assembly
stage, reads are scanned one last time for errors, and then used to construct
overlap graph, before output of consensus contig sequences and an assembly
graph. HiCanu is a modified version of Canu developed for using the PacBio
HiFi (CCS) reads [61].
Solving the Hamiltonian path problem in the OLC approach is NP-hard. To
reduce the high computing demand imposed by the OLC approach, a simpli-
fied version called the String graph has been employed to merge and reduce
redundant vertices and edges, along with identification and removal of false
vertices and edges [62]. The implementation of a string indexing data struc-
ture (FM-index) has improved the performance of assemblers such as SGA
and ReadJoiner [63]. FALCON also first builds a string graph that contains
bubbles representing structural variation between haplotypes through the
use of a process called HGAP (or hierarchical genome assembly process). The
subsequent “Unzip” process creates haplotype- resolved assembly graph,
making FALCON a diploid-aware assembler. Other long-read assemblers,
such as Miniasm, Hifiasm, NECAT, and NextDenovo, also use string graph
in their assembly processes. Hifiasm, for example, produces fully phased
assembly for each haplotype from phased string graph created from PacBio
HiFi reads.
Compared to the OLC approach, the de Bruijn graph-based approach takes
an alternative, computationally more tractable route. This approach does not
involve a step to find all possible overlaps between reads. Instead, the reads
are first cut into k-mers. For instance, the sequence read ATTACGTCGA can
be cut into a series of k-mers, e.g., ATT, TTA, TAC, ACG, CGT, GTC, TCG,
and CGA, when k =3. These k-mers are then used as vertices in the de Bruijn
graph. An edge that connects two nodes represents a convergence of the
two nodes, e.g., the edge that connects ATT and TTA is ATTA. Using the de
Bruijn graph, the assembly process is equivalent to finding a shortest path
that visits each node at least once, which is known as the Chinese Postman
Problem in graph theory. An Eulerian path, if it exists, represents the solu-
tion to this problem. Computationally, finding an Eulerian path is much
easier than finding a Hamiltonian path for the OLC approach. The major
drawback of this approach, however, is that it is highly sensitive to sequen-
cing errors. Therefore, to use assemblers in this category, error correction
is mandatory. Assemblers that use this approach include AbySS [3],
ALLPATHS-LG, Euler-SR [29], SOAPdenovo/SOAPdenovo2 [4, 64], SPAdes
[65], and Velvet [2]. Because of higher sequencing error rates, long reads
are not usually assembled with this approach. There are, however, a small
number of long read assemblers that employ variants of de Bruijn Graph.
For example, Flye [66] uses the A-Bruijn variant, which tolerates the higher
280 Next-Generation Sequencing Data Analysis
12.2.3 Polishing
Besides error correction prior to genome assembly, assembly quality can be
further improved after draft assembly is created through a process called
polishing. In general terms, the polishing process uses information from
alignment of reads to the draft assembly as input, examines how reads map
to the draft assembly at each location, and then decides whether assembly
sequences at certain locations need to be modified. To perform this process,
there are a selection of tools available, including Pilon [71], Medaka [72],
Racon [73], Nanopolish [74], MarginPolish & HELEN [57], NextPolish [75],
POLCA [76], and NeuralPolish [77]. These polishers typically use either
short or long reads to polish a draft assembly. For example, Pilon takes as
input an assembled genome in FASTA format, and alignment of short reads
to the assembly in BAM format. After searching the alignment for incon-
sistencies, assembly sequences are modified to provide improvements to
the input assembly through reduction of mismatches and indels, as well
as gap filling and misassembly identification. Medaka, the first neural
network- based polisher, on the other hand, uses ONT long reads for
polishing through creation of consensus sequences. As input it requires a
draft assembly in FASTA format and basecalls in either FASTA or FASTQ
format. Prior to creating consensus sequences, the alignment of the reads
to the input assembly is carried out by minimap2. Nanopolish uses a third
approach. Instead of using called bases, it takes raw sequencing signals
recorded from an ONT sequencer as input, and applies an HMM-based
signal-to-assembly analysis to generate improved consensus sequences for
the draft assembly.
De Novo Genome Assembly with Long and/or Short Reads 281
Scaffold
Contig 1 Contig 2
DNA Fragments
Paired-end read
Approximate length, but no known sequence
FIGURE 12.4
Assembling contigs into a scaffold.
282 Next-Generation Sequencing Data Analysis
FIGURE 12.5
Gap closing with the TGS-GapCloser workflow. Panel (A) displays the overall pipeline. Panel
(B) shows the steps of identification of gap regions and long-read candidates that map to the
gap regions, and error correction of long-read candidates using short reads. Panel (C) details
how the gaps are filled with error-corrected long reads. As input, TGS-GapCloser can accept
long reads generated from any platform or other preassembled contigs to fill gaps in a draft
assembly. The unknown nucleotides marked as N’s between two neighboring contigs in the
scaffolds are identified as gap regions. Long reads are then mapped to the gap regions using
minimap2 to acquire candidate fragments. The subsequent error correction on the long-read
candidates is carried out with Pilon or Racon. The corrected new long read candidates are then
realigned to the gaps, and the read with the best match for a gap is used to fill the gap. To increase
computational efficiency, overlapped candidates are clipped and merged. (Adapted from Xu
M. et al. (2020) TGS-GapCloser: A fast and accurate gap closer for large genomes with low
coverage of error-prone long reads. GigaScience, 9, 1–11. Used under the terms of the Creative
Commons Attribution License (http://creativecommons.org/licenses/by/4.0). © 2020 Xu et al.)
284 Next-Generation Sequencing Data Analysis
length, but the most commonly used statistics are N50 and L50. To calcu-
late the N50, all contigs (or scaffolds) are first ranked based on length from
the largest to the smallest. Their lengths are then summed up from the
largest contig (or scaffold) downward. N50 refers to the size of the contig
(or scaffold) at which the summed length becomes greater or equal to 50%
of the total assembly size. L50, on the other hand, refers to the smallest
number of contigs (or scaffolds) that add up to at least 50% of the genome
assembly.
The total assembly size, however, does not measure the completeness of
the assembly. To determine completeness, the original DNA reads are aligned
to the assembly and the percentage of reads aligned is calculated. Other
sequence data from the same species, such as RNA-seq data, may also be used
for the alignment and rough estimation of completeness. On the measure-
ment of accuracy, the assembly can be compared to a high-quality reference
genome of the species, if such a reference is available. This comparison can
be carried out on two aspects of the assembly: base accuracy and alignment
accuracy. Base accuracy determines if the right base is called in the assembly
at a given position, while alignment accuracy examines the probability of
placing a sequence at the right position and orientation. In many cases,
however, a reference map is not available and instead is the very goal of the
assembling process. For these cases, a measurement on internal consistency,
through aligning original reads to the assembly and checking for evenness
and congruence in coverage across the assembly, provides an indicator of the
assembly quality. Comparison of the assembly with independently acquired
sequences from the same species, such as gene or cDNA sequences, can also
be used to estimate assembly accuracy. With regard to software implementa-
tion on evaluating assembly quality, the often used tools include BUSCO [96]
and QUAST [97], which help perform the above measurements and compare
different contig and scaffold assembly algorithms and settings.
the assembled sequences are usually fragmented and exist in the suboptimal
form of large numbers of contigs. Among the contigs, there are also cer-
tain (sometimes high) levels of falsely assembled contigs, due to chimeric
joining. In addition, the gapped regions between the assembled contigs may
not be filled completely. To overcome some of these limitations and increase
assembly quality, the use of a reference genome, even from a remotely related
species, can be very helpful. This reference- assisted assembly approach
works especially well when scaffolding information from paired reads is not
available or exhausted. With the quickly increasing number of sequenced
genomes, improving assembly quality with this reference-assisted approach
becomes more feasible. Some tools have recently been developed to pro-
vide this functionality, either as dedicated packages such as RaGOO [98],
Chromosomer [99], and RACA [100], or components of existing assemblers
including ALLPATHS-LG, IDBA-Hybrid, and Velvet.
With the constant advancements achieved in long read sequencing,
the landscape of de novo genome assembly has been shifting. The publica-
tion of a gapless human genome assembly by the Telomere-to-Telomere
(T2T) Consortium has demonstrated the utility of long reads in de novo
genome assembly [101]. Built on this success, more and better designed
assemblers are bound to be continuously developed to take full advantage
of what new sequencing technologies have to offer. Undoubtedly such future
developments will further overcome the limitations and challenges facing
the community today.
References
1. Schatz MC, Delcher AL, Salzberg SL. Assembly of large genomes using
second-generation sequencing. Genome Res 2010, 20(9):1165–1173.
2. Zerbino DR, Birney E. Velvet: algorithms for de novo short read assembly
using de Bruijn graphs. Genome Res 2008, 18(5):821–829.
3. Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJ, Birol I. ABySS: a parallel
assembler for short read sequence data. Genome Res 2009, 19(6):1117–1123.
4. Li R, Zhu H, Ruan J, Qian W, Fang X, Shi Z, Li Y, Li S, Shan G, Kristiansen K
et al. De novo assembly of human genomes with massively parallel short read
sequencing. Genome Res 2010, 20(2):265–272.
5. English AC, Richards S, Han Y, Wang M, Vee V, Qu J, Qin X, Muzny DM,
Reid JG, Worley KC et al. Mind the gap: upgrading genomes with Pacific
Biosciences RS long-read sequencing technology. PLoS One 2012, 7(11):e47768.
6. Rayamajhi N, Cheng CC, Catchen JM. Evaluating Illumina-, Nanopore-, and
PacBio-based genome assembly strategies with the bald notothen, Trematomus
borchgrevinki. G3 2022, 12(11):jkac192.
7. van Heesch S, Kloosterman WP, Lansu N, Ruzius FP, Levandowsky E, Lee CC,
Zhou S, Goldstein S, Schwartz DC, Harkins TT et al. Improving mammalian
286 Next-Generation Sequencing Data Analysis
42. Chin CS, Peluso P, Sedlazeck FJ, Nattestad M, Concepcion GT, Clum A, Dunn
C, O’Malley R, Figueroa-Balderas R, Morales-Cruz A et al. Phased diploid
genome assembly with single-molecule real-time sequencing. Nat Methods
2016, 13(12):1050–1054.
43. Nowoshilow S, Schloissnig S, Fei JF, Dahl A, Pang AWC, Pippel M, Winkler S,
Hastie AR, Young G, Roscito JG et al. The axolotl genome and the evolution of
key tissue formation regulators. Nature 2018, 554(7690):50–55.
44. Xiao CL, Chen Y, Xie SQ, Chen KN, Wang Y, Han Y, Luo F, Xie Z. MECAT: fast
mapping, error correction, and de novo assembly for single-molecule sequen-
cing reads. Nat Methods 2017, 14(11):1072–1074.
45. Salmela L, Walve R, Rivals E, Ukkonen E. Accurate self-correction of errors in
long reads using de Bruijn graphs. Bioinformatics 2017, 33(6):799–806.
46. Bao E, Xie F, Song C, Song D. FLAS: fast and high-throughput algorithm for
PacBio long-read self-correction. Bioinformatics 2019, 35(20):3953–3960.
47. Berlin K, Koren S, Chin CS, Drake JP, Landolin JM, Phillippy AM. Assembling
large genomes with single- molecule sequencing and locality- sensitive
hashing. Nat Biotechnol 2015, 33(6):623–630.
48. Lander ES, Waterman MS. Genomic mapping by fingerprinting random
clones: a mathematical analysis. Genomics 1988, 2(3):231–239.
49. Warren RL, Sutton GG, Jones SJ, Holt RA. Assembling millions of short DNA
sequences using SSAKE. Bioinformatics 2007, 23(4):500–501.
50. Dohm JC, Lottaz C, Borodina T, Himmelbauer H. SHARCGS, a fast and highly
accurate short-read assembly algorithm for de novo genomic sequencing.
Genome Res 2007, 17(11):1697–1706.
51. Jeck WR, Reinhardt JA, Baltrus DA, Hickenbotham MT, Magrini V, Mardis ER,
Dangl JL, Jones CD. Extending assembly of short DNA sequences to handle
error. Bioinformatics 2007, 23(21):2942–2944.
52. Hernandez D, Francois P, Farinelli L, Osteras M, Schrenzel J. De novo bacterial
genome sequencing: millions of very short reads assembled on a desktop com-
puter. Genome Res 2008, 18(5):802–809.
53. Li H. Exploring single-sample SNP and INDEL calling with whole-genome de
novo assembly. Bioinformatics 2012, 28(14):1838–1844.
54. Li H. Minimap and miniasm: fast mapping and de novo assembly for noisy
long sequences. Bioinformatics 2016, 32(14):2103–2110.
55. Kamath GM, Shomorony I, Xia F, Courtade TA, Tse DN. HINGE: long-
read assembly achieves optimal repeat resolution. Genome Res 2017,
27(5):747–756.
56. Chin C-S, Khalak A. Human Genome Assembly in 100 Minutes. bioRxiv 2019,
doi: https://doi.org/10.1101/705616
57. Shafin K, Pesout T, Lorig-Roach R, Haukness M, Olsen HE, Bosworth C,
Armstrong J, Tigyi K, Maurer N, Koren S et al. Nanopore sequencing and the
Shasta toolkit enable efficient de novo assembly of eleven human genomes.
Nat Biotechnol 2020, 38(9):1044–1053.
58. Vaser R, Šikić M. Time-and memory-efficient genome assembly with Raven.
Nat Comput Sci 2021, 1(5):332–336.
59. NextDenovo (https://github.com/Nextomics/NextDenovo)
60. Liu H, Wu S, Li A, Ruan J. SMARTdenovo: a de novo assembler using long
noisy reads. Gigabyte 2021:1–9.
De Novo Genome Assembly with Long and/or Short Reads 289
61. Nurk S, Walenz BP, Rhie A, Vollger MR, Logsdon GA, Grothe R, Miga KH,
Eichler EE, Phillippy AM, Koren S. HiCanu: accurate assembly of segmental
duplications, satellites, and allelic variants from high- fidelity long reads.
Genome Res 2020, 30(9):1291–1305.
62. Myers EW. Toward simplifying and accurately formulating fragment
assembly. J Comput Biol 1995, 2(2):275–290.
63. Gonnella G, Kurtz S. Readjoiner: a fast and memory efficient string graph-
based sequence assembler. BMC Bioinformatics 2012, 13:82.
64. Luo R, Liu B, Xie Y, Li Z, Huang W, Yuan J, He G, Chen Y, Pan Q, Liu Y et al.
SOAPdenovo2: an empirically improved memory-efficient short-read de novo
assembler. GigaScience 2012, 1(1):18.
65. Nurk S, Bankevich A, Antipov D, Gurevich AA, Korobeynikov A, Lapidus A,
Prjibelski AD, Pyshkin A, Sirotkin A, Sirotkin Y et al. Assembling single-cell
genomes and mini-metagenomes from chimeric MDA products. J Comput Biol
2013, 20(10):714–737.
66. Kolmogorov M, Yuan J, Lin Y, Pevzner PA. Assembly of long, error-prone
reads using repeat graphs. Nat Biotechnol 2019, 37(5):540–546.
67. Ruan J, Li H. Fast and accurate long-read assembly with wtdbg2. Nat Methods
2020, 17(2):155–158.
68. Zimin AV, Marcais G, Puiu D, Roberts M, Salzberg SL, Yorke JA. The MaSuRCA
genome assembler. Bioinformatics 2013, 29(21):2669–2677.
69. Ye C, Hill CM, Wu S, Ruan J, Ma ZS. DBG2OLC: Efficient Assembly of Large
Genomes Using Long Erroneous Reads of the Third Generation Sequencing
Technologies. Sci Rep 2016, 6:31900.
70. Di Genova A, Buena-Atienza E, Ossowski S, Sagot MF. Efficient hybrid de
novo assembly of human genomes with WENGAN. Nat Biotechnol 2021,
39(4):422–430.
71. Walker BJ, Abeel T, Shea T, Priest M, Abouelliel A, Sakthikumar S, Cuomo CA,
Zeng Q, Wortman J, Young SK et al. Pilon: an integrated tool for comprehen-
sive microbial variant detection and genome assembly improvement. PLoS
One 2014, 9(11):e112963.
72. Medaka (https://github.com/nanoporetech/medaka)
73. Vaser R, Sovic I, Nagarajan N, Sikic M. Fast and accurate de novo genome
assembly from long uncorrected reads. Genome Res 2017, 27(5):737–746.
74. Loman NJ, Quick J, Simpson JT. A complete bacterial genome assembled
de novo using only nanopore sequencing data. Nat Methods 2015,
12(8):733–735.
75. Hu J, Fan J, Sun Z, Liu S. NextPolish: a fast and efficient genome polishing tool
for long-read assembly. Bioinformatics 2020, 36(7):2253–2255.
76. Zimin AV, Salzberg SL. The genome polishing tool POLCA makes fast
and accurate corrections in genome assemblies. PLoS Comput Biol 2020,
16(6):e1007981.
77. Huang N, Nie F, Ni P, Luo F, Gao X, Wang J. NeuralPolish: a novel Nanopore
polishing method based on alignment matrix construction and orthogonal Bi-
GRU Networks. Bioinformatics 2021, 37(19):3120–3127.
78. Boetzer M, Pirovano W. SSPACE- LongRead: scaffolding bacterial draft
genomes using long read sequence information. BMC Bioinformatics 2014,
15:211.
290 Next-Generation Sequencing Data Analysis
79. Warren RL, Yang C, Vandervalk BP, Behsaz B, Lagman A, Jones SJ, Birol I.
LINKS: Scalable, alignment-free scaffolding of draft genomes with long reads.
GigaScience 2015, 4:35.
80. Gao S, Bertrand D, Chia BK, Nagarajan N. OPERA-LG: efficient and exact
scaffolding of large, repeat-rich eukaryotic genomes with performance guar-
antees. Genome Biol 2016, 17:102.
81. Qin M, Wu S, Li A, Zhao F, Feng H, Ding L, Ruan J. LRScaf: improving draft
genomes using long noisy reads. BMC Genomics 2019, 20(1):955.
82. SMIS (Single Molecular Integrative Scaffolding) (www.sanger.ac.uk/tool/
smis/)
83. Nguyen SH, Cao MD, Coin LJM. Real-time resolution of short-read assembly
graph using ONT long reads. PLoS Comput Biol 2021, 17(1):e1008586.
84. Putnam NH, O’Connell BL, Stites JC, Rice BJ, Blanchette M, Calef R, Troll
CJ, Fields A, Hartley PD, Sugnet CW et al. Chromosome- scale shotgun
assembly using an in vitro method for long-range linkage. Genome Res 2016,
26(3):342–350.
85. Dudchenko O, Batra SS, Omer AD, Nyquist SK, Hoeger M, Durand NC,
Shamim MS, Machol I, Lander ES, Aiden AP et al. De novo assembly of the
Aedes aegypti genome using Hi- C yields chromosome- length scaffolds.
Science 2017, 356(6333):92–95.
86. Ghurye J, Rhie A, Walenz BP, Schmitt A, Selvaraj S, Pop M, Phillippy AM,
Koren S. Integrating Hi-C links with assembly graphs for chromosome-scale
assembly. PLoS Comput Biol 2019, 15(8):e1007273.
87. Yeo S, Coombe L, Warren RL, Chu J, Birol I. ARCS: scaffolding genome drafts
with linked reads. Bioinformatics 2018, 34(5):725–731.
88. Kuleshov V, Snyder MP, Batzoglou S. Genome assembly from synthetic long
read clouds. Bioinformatics 2016, 32(12):i216–i224.
89. Guo L, Xu M, Wang W, Gu S, Zhao X, Chen F, Wang O, Xu X, Seim I, Fan G
et al. SLR-superscaffolder: a de novo scaffolding tool for synthetic long reads
using a top-to-bottom scheme. BMC Bioinformatics 2021, 22(1):158.
90. Kosugi S, Hirakawa H, Tabata S. GMcloser: closing gaps in assemblies accur-
ately with a likelihood-based selection of contig or long-read alignments.
Bioinformatics 2015, 31(23):3733–3741.
91. Paulino D, Warren RL, Vandervalk BP, Raymond A, Jackman SD, Birol I.
Sealer: a scalable gap-closing application for finishing draft genomes. BMC
Bioinformatics 2015, 16:230.
92. Xu GC, Xu TJ, Zhu R, Zhang Y, Li SQ, Wang HW, Li JT. LR_Gapcloser: a tiling
path-based gap closer that uses long reads to complete genome assembly.
GigaScience 2019, 8(1):giy157.
93. Xu M, Guo L, Gu S, Wang O, Zhang R, Peters BA, Fan G, Liu X, Xu X,
Deng L et al. TGS-GapCloser: A fast and accurate gap closer for large
genomes with low coverage of error-prone long reads. GigaScience 2020,
9(9):giaa094.
94. Zimin AV, Salzberg SL. The SAMBA tool uses long reads to improve the con-
tiguity of genome assemblies. PLoS Comput Biol 2022, 18(2):e1009860.
95. Schmeing S, Robinson MD. Gapless provides combined scaffolding, gap
filling and assembly correction with long reads. bioRxiv 2022, doi: https://doi.
org/10.1101/2022.03.08.483466
De Novo Genome Assembly with Long and/or Short Reads 291
96. Simao FA, Waterhouse RM, Ioannidis P, Kriventseva EV, Zdobnov EM.
BUSCO: assessing genome assembly and annotation completeness with
single-copy orthologs. Bioinformatics 2015, 31(19):3210–3212.
97. Gurevich A, Saveliev V, Vyahhi N, Tesler G. QUAST: quality assessment tool
for genome assemblies. Bioinformatics 2013, 29(8):1072–1075.
98. Alonge M, Soyk S, Ramakrishnan S, Wang X, Goodwin S, Sedlazeck FJ,
Lippman ZB, Schatz MC. RaGOO: fast and accurate reference- guided
scaffolding of draft genomes. Genome Biol 2019, 20(1):224.
99. Tamazian G, Dobrynin P, Krasheninnikova K, Komissarov A, Koepfli KP,
O’Brien SJ. Chromosomer: a reference-based genome arrangement tool for
producing draft chromosome sequences. GigaScience 2016, 5(1):38.
100. Kim J, Larkin DM, Cai Q, Asan, Zhang Y, Ge RL, Auvil L, Capitanu B, Zhang
G, Lewin HA et al. Reference-assisted chromosome assembly. Proc Natl Acad
Sci U S A 2013, 110(5):1785–1790.
101. Nurk S, Koren S, Rhie A, Rautiainen M, Bzikadze AV, Mikheenko A, Vollger
MR, Altemose N, Uralsky L, Gershman A et al. The complete sequence of a
human genome. Science 2022, 376(6588):44–53.
13
Mapping Protein-DNA Interactions
with ChIP-Seq
13.1 Principle of ChIP-Seq
Without the involvement of DNA- interacting proteins, the information
coded in DNA cannot be accessed, transcribed, and maintained. Besides a
large number of transcription factors and coactivators, key DNA-interacting
proteins include histones, DNA and RNA polymerases, and enzymes for DNA
repair and modification (e.g., methylation). Through their DNA-interacting
domains, such as helix-turn-helix, zinc finger, and leucine zipper domains,
these proteins interact with their DNA targets by hydrogen bonding, hydro-
phobic interactions, or base stacking. Because the intimate relationship
between DNA and these proteins plays an important role in the functioning
of the genome, studying how proteins and DNA interact and where DNA-
interacting proteins bind across the genome provides key insights into the
many roles these proteins play in various aspects of genomic function,
including information exposure, transcription, and maintenance.
ChIP-seq is an NGS-based technology to locate binding sites of a DNA-
interacting protein in the genome. An exemplary scenario in using ChIP-
seq is to study transcription factor binding profiles in the genome under
different conditions, such as development stages or pathological conditions.
To achieve this, the protein of interest is first cross-linked covalently to DNA
in cells with a chemical agent, usually formaldehyde (Figure 13.1). Then the
cells are disrupted, and subsequently sonicated or enzymatically digested to
shear chromatin into fragments that contain 100–300 bp DNA, followed by
enrichment of the target protein with its bound DNA by immunoprecipitation
using an antibody specific for the protein. Subsequently, the enriched
protein-DNA complex is dissociated by reversing the cross-links previously
formed between them, and the released DNA fragments are subjected to
NGS. One key experimental factor in the ChIP-seq process is the quality of
the antibody used in the enrichment step, as the use of a poor-quality anti-
body can lead to high experimental noise due to non-specific precipitation
of DNA fragments.
Based on the size the region that they bind, DNA-interacting proteins can
be divided into three groups: (1) punctate-binding: these proteins, usually
FIGURE 13.1
The basic steps of ChIP-seq. (From AM Szalkowski, CD Schmid, Rapid innovation in ChIP-seq
peak-calling algorithms is outdistancing benchmarking efforts, Briefings in Bioinformatics 2011,
12(6):626–33. With permission.)
Mapping Protein-DNA Interactions with ChIP-Seq 295
13.2 Experimental Design
13.2.1 Experimental Control
Appropriate control for a ChIP-seq experiment is key to accounting for
artifacts or biases that might be introduced into the experimental process.
These artifacts and biases may include potential antibody cross-reactivity
with non-specific protein factors, higher signal from open chromatin regions
(since they are easier to be fragmented than closed regions [1]), and uneven
sequencing of captured genomic regions due to variations in base compos-
ition. Two major types of controls are usually set up for ChIP-seq signal adjust-
ment. One is input control, i.e., chromatins extracted from cells or tissues,
which are subjected to the same cross-linking and fragmentation procedure
but without the immunoprecipitation process. The other is “mock” control,
which is processed by the same procedure including immunoprecipitation;
the immunoprecipitation, however, is carried out using an irrelevant anti-
body (e.g., IgG). While it may seem to serve as the better control between
the two, the “mock” control often produces much less DNA for sequencing
than real experimental ChIP samples. Although sequencing can be carried
out on amplified DNA in this circumstance, the amplification process adds
additional artifact and bias to the sequencing data, which justifies the use of
input DNA as experimental control in many cases.
13.2.2 Library Preparation
To prepare ChIP DNA for library prep, 1–10 million cells are typically needed
[2]. Within this range, studies of broad- binding protein factors require
less cells than those of punctate-binding proteins. In terms of the amount
of immunoprecipitated DNA required for library prep, while it is often
suggested to start with 5–10 ng ChIP DNA, it is common to obtain less DNA,
296 Next-Generation Sequencing Data Analysis
which often still generates high-quality libraries for sequencing. For library
prep, the steps involved include end repair, A-tailing, incorporation of 3’ and
5’ adapter sequences, size selection, and PCR amplification for final library
generation. The number of cycles used in the PCR amplification step is
important, and overamplification should be avoided as it can affect fragment
representation and library complexity. If obtaining large numbers of cells
is challenging or not feasible, alternative methods such as CUT&RUN (or
Cleavage Under Targets and Release Using Nuclease), which can generate
high-quality data from as low as 100 cells due to its low background [3], can
be employed.
13.2.4 Replication
To examine the reproducibility of a ChIP-seq experiment, and also reduce
FDR, replicate samples should be used. If a protein of interest binds to
regions of the genome with high affinity, the bound regions should be iden-
tified in replicate samples. Regions that are not identified in replicates are
most possibly due to experimental noise. Pearson correlation coefficient
(PCC) between biological replicates serves as a measurement of experi-
mental reproducibility, while irreproducible discovery rate (IDR) is another
such metric. The calculation and usage of PCC and IDR are detailed later in
this chapter.
Mapping Protein-DNA Interactions with ChIP-Seq 297
Experimental Design
(Controls, sequencing depth, replication)
Data QC
Read Mapping
Peak Calling
Peak Visualization
Functional Analysis
(Peaks assigned to nearby genes)
FIGURE 13.2
Basic ChIP-seq data analysis workflow.
298 Next-Generation Sequencing Data Analysis
* * *
80
60
40
20
Input DNA
0
100
80
60
40 Interferon- γ–stimulated STAT1 ChIP-seq
20
0
100
80
60
40 Interferon- γ–stimulated input DNA
20
0
1
0.8
0.6
0.4
0.2 Mappable bases (1 Kb)
0
SLC25A17 ST13
FIGURE 13.3
Background noise and signal profiles in a ChIP-seq experiment. Shown here is the density of mapped reads in one region of the human chromosome 22
for RNA polymerase II and the transcription factor STAT1 (tracks 1 and 3, count from the top), respectively. Genes coded by the two DNA strands in this
region are displayed at the bottom. Tracks 2 and 4 show the distribution of mapped reads for the respective input DNA controls for the two proteins. It
should be noted that some of the peaks in the protein tracks are also present in their input controls. Track 5 displays the fraction of uniquely mappable
299
bases. (Adapted by permission from Macmillan Publishers Ltd: Nature Biotechnology, PeakSeq enables systematic scoring of ChIP-seq experiments
relative to controls, J Rozowsky, G Euskirchen, RK Auerbach, ZD Zhang, T Gibson, R Bjornson, N Carriero, et al., copyright 2009.)
300 Next-Generation Sequencing Data Analysis
location, with some having strong signals while others more modest signals.
The degree of enrichment at each location is not necessarily a reflection of
their biological importance, as those with more modest enrichment may be
equally important as those at the top of the enrichment list.
After mapping, reproducibility between replicate samples and overall simi-
larity between different samples can be examined using established tools. For
example, the multiBamSummary tool in deepTools2 can be used to check
reads coverage across the entire genome in “bins” mode from two or more
input BAM files, and another tool called plotCorrelation can take the output
to compute and visualize sample correlations. The plotFingerprint tool in
the same toolset can be used to visualize cumulative reads enrichment as an
indication of target DNA enrichment efficiency. This tool is most informative
for punctate-binding proteins such as transcription factors. Besides sample
correlations and the other aforementioned QC measures such as PBC, add-
itional QC analyses can also be performed. For example, visualization of the
distribution of mapped reads in the genome, using the visualization tools
introduced in Chapter 5, can offer further clues on data quality. This is espe-
cially true when some specific binding regions have already been known for
the protein of interest. In comparison to those from input control samples,
sequence reads from ChIP samples should show strong clustering in these
regions.
13.3.2 Peak Calling
Peak calling, the process of finding regions of the genome to which the pro-
tein of interest binds, is a key step in ChIP-seq data analysis. It is basically
achieved through locating regions where reads are mapped at levels signifi-
cantly above the background. As this process is also applicable to ATAC-seq
(assay for transposase-accessible chromatin using sequencing) and DNase-
seq (DNase I hypersensitive sites sequencing), both of which aim to locate
transcriptionally accessible regions of the genome, many of the methods
introduced below can also be used for analysis of ATAC-seq and DNase-
seq data. Among currently available peak calling methods, the simplest is
to count the total number of reads mapped along the genome, and call each
location with the number of mapped reads over a threshold as a peak. Due to
the inherent complexity in signal generation, including uneven background
noise and other confounding experimental factors, this approach is overly
simplistic. Among the experimental factors, the way the immunoprecipitated
DNA fragments are sequenced on most platforms has a direct influence on
how peaks are called. Since the reads are usually short, only one end or both
ends of a fragment, depending on whether single-end or paired-end reads
are produced, are sequenced. To locate a target protein’s binding regions,
which are represented by the immunoprecipitated DNA fragments and
not just the generated reads, peak calling algorithms need to either shift or
extend the reads to cover the actual binding areas. For example, MACS2
Mapping Protein-DNA Interactions with ChIP-Seq 301
shifts reads mapped to the two opposite strands toward the 3’ ends, based on
the average DNA fragment length, to cover the most likely protein binding
location [13]. SPP uses a similar strategy to shift each read mapped to either
strand relative to each other until strand cross-correlation coefficient reaches
the highest level at the shift that equals to the average length of the DNA
fragments (Figure 13.4) [14]. PeakSeq uses an alternative approach to extend
the reads in the 3’ direction to reach the average length of DNA fragments in
the library [15].
The reads shift approach and the strand cross-correlation profile shown in
Figure 13.4 can also be used to evaluate ChIP-seq data quality. When using
short reads (usually less than 100 bases) to analyze large target genomes,
which usually results in a significant number of reads being mapped to mul-
tiple genomic locations, a “phantom” peak also exists at a shift that equals to
the read length (Figure 13.5). If a run is successful, the fragment-length ChIP
peak should be significantly higher than the read-length “phantom” peak,
as well as the background signal. The aforementioned ENCODE software
phantompeakqualtools provides two indices for the examination of strand
cross-correlation, i.e., NSC, being the ratio of the cross-correlation coefficient
at the fragment-length peak over that of the background, and RSC, as the
ratio of background-adjusted cross-correlation coefficient at the fragment-
length peak over that at the “phantom” peak.
Shifting reads mapped to the positive and negative strands toward the
center, or extending reads to reach the average fragment length, in order to
count the number of aggregated reads at each base pair position is the first
sub-step to peak calling. As illustrated in Figure 13.6, peak calling involves
multiple substeps. First, a signal profile is created through smoothing of
aggregated read count across each chromosome. Subsequently, background
noise needs to be defined and the signals along the genome need be adjusted
for the background. One simple approach is to subtract read counts in con-
trol sample, if available, from the signal across the genome, or use the signal/
noise ratio. Subsequently, to call peaks from the background-adjusted ChIP-
seq signal, often-used approaches include using absolute signal strength,
signal enrichment in relation to background noise, or a combination of
both. To facilitate determination of signal enrichment, statistical signifi-
cance is often computed using Poisson or negative binomial distributions.
Empirical estimation of false discovery rate (FDR) can be carried out by first
calling peaks using control data (i.e., false positives), and subsequently cal-
culating the ratio of peaks called from the control to those called from the
ChIP sample. Alternatively, the Benjamini–Hochberg approach introduced in
Chapter 7 can also be applied to correct for multiple comparisons. After peak
calling, artifactual peaks need to be filtered out, including those that contain
only one or a few reads that are most possibly due to PCR artifacts, or those
that involve significantly imbalanced numbers of reads on the two strands
(see Figure 13.6).
newgenrtpdf
302
60 0.14
Protein 10
40 0.12
Tag counts
Tag density
0.08
0 0
0.06
–20 0.04
–5
Crosslink 0.02
Fragmentation –40
Positive-strand tag –10
0.00
Negative-strand tag –100 –50 0 50 100 150 200 250 300
110,500,100 110,500,200 110,500,300
Position on chromosome 1 Relative strand shift
304
The “phantom” peak and its use in determining ChIP-seq data quality. The “phantom” peak corresponds to the cross-
correlation at the strand shift that equals to the read length, while the ChIP peak corresponds to the cross-correlation
at the shift of the average DNA fragment length. A successful run is characterized by the existence of a predominant
ChIP peak and a much weaker “phantom” peak. In marginally passed or failed runs, the former diminishes while the
latter relatively becomes much stronger. (Adapted from SG Landt, GK Marinov, A Kundaje, P Kheradpour, F Pauli, S
Batzoglou, BE Bernstein et al., ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia, Genome
Research 2012, 22(9):1813–1831. Used under the terms of the Creative Commons License (Attribution-NonCommercial
3.0 Unported License) as described at http://creativecommons.org/licenses/by-nc/3.0/. ©2012 Cold Spring Harbor
Laboratory Press.)
FIGURE 13.6
Basic substeps of calling peaks from ChIP-seq data. The P(s) at bottom left signifies the probability
of observing a location covered by S mapped reads, and the sthresh marks the threshold for calling
a peak significant. Bottom right shows two types of possible artifactual peaks: single strand
peaks and those based on mostly duplicate reads. (Adapted by permission from Macmillan
Publishers Ltd: Nature Methods, Computation for ChIP-seq and RNA-seq studies, S Pepke, B
Wold, A Mortazavi, copyright 2009.)
306 Next-Generation Sequencing Data Analysis
For implementation of this peak calling process, different peak callers use
different methods, which can lead to differences in final outcomes. Among
currently available peak callers, MACS2, HOMER (findPeaks module) [16],
SICER2 [17], JAMM [18], SPP, and PeakSeq are among some of the popular
ones (see Table 13.1). Among these methods, MACS2 uses Poisson distribu-
tion to model reads distribution across the genome. To achieve robust peak
prediction, this modeling process considers the dynamic nature and effects
of local background noise as caused by biological factors such as local chro-
matin structure and genome copy number variations, and technical biases
introduced during library prep, sequencing, and mapping processes. Peaks
are called from enrichment of reads in a genomic region with statistical sig-
nificance calculated based on Poisson distribution. On FDR estimation, while
the original MACS uses an empirical approach through exchanging ChIP
and control samples, MACS2 applies the Benjamini–Hochberg method to
adjust p-values. The findPeaks module in HOMER also uses Poisson distri-
bution to identify putative peaks. To arrive at a list of high-quality peaks,
different filters are then applied to these putative peaks. These filters are
either based on the use of (1) control samples, i.e., peaks need to pass fold-
change and cumulative Poisson p-value thresholds in comparison to control
samples; (2) local read counts, i.e., the density of reads at a peak need to be
TABLE 13.1
A Short List of ChIP-Seq Peak Calling Algorithms
Name Description Reference
significantly higher than that in the surrounding region; or (3) clonal signal,
to remove peaks with high number of reads that map to only a small number
of unique positions. SPP, an R package, calculates smoothed read enrichment
profile along the genome and identifies significantly enriched sites compared
to input control using methods such as WTD (Window Tag Density). The
WTD method considers sequence tag patterns immediately upstream and
downstream of the center of the binding location, thereby increasing predic-
tion accuracy. The peak calling employed in PeakSeq is a two-pass process. In
its first pass, PeakSeq uses background modeling to identify initial potential
binding regions. To adjust for background using control data, the fraction of
reads located in the initially identified potential binding regions are excluded
and the reads in the remainder of the genome in the ChIP-seq sample are
normalized to the control data by linear regression. In the second pass, target
peaks are called by scoring reads-enriched target regions based on calcula-
tion of fold enrichment in ChIP-seq sample vs. control, and statistical signifi-
cance associated with each enriched target region is calculated from binomial
distribution. More recently, newer algorithms based on the application of
deep learning approaches have begun to emerge, such as CNN-Peaks [19]
and LanceOtron [20].
To ensure the robustness of analysis results, it is recommended to use more
than one method for peak calling. While IDR is usually used to measure the
rate of irreproducible discoveries, i.e., peaks that are called in one replicate
sample but not in another, it can also be used to compare peak calling results
generated from different methods. The original use of IDR in assessing rep-
licate reproducibility is based on the rationale that peaks of high significance
are more consistently ranked across replicates and therefore have better
reproducibility than those of low significance. As shown in Figure 13.7, to
compare a pair of ranked lists of peaks identified in two replicates, IDR are
plotted against the total numbers of ranked peaks. Since IDR computation
relies on both high significance (more reproducible) and low significance
(less reproducible) peaks, peak calling stringency needs to be relaxed to
allow generation of both high and low confidence calls. The transition in
this plot from reliable signal to noise is an index of overall experimental
reproducibility. Because IDR is independent of any particular peak-calling
method, it can be applied to compare the performance of different peak
calling methods on a particular dataset and therefore help pick the most
appropriate method(s) (Figure 13.8). IDR can also be used to evaluate repro-
ducibility across experiments and labs.
FIGURE 13.7
Use of irreproducible discovery rate (IDR) in assessing replicate reproducibility. Panel (A) shows
the distribution of the significance scores of the peaks identified two replicate experiments. The
IDR method computes the probability of being irreproducible for each peak, and classifies them
as being reproducible (black) or irreproducible (red). Panel (B) displays the IDR at different
rank thresholds when the peaks are sorted by the original significance score. (From T Bailey, P
Krajewski, I Ladunga, C Lefebvre, Q Li, T Liu, P Madrigal, C Taslim, J Zhang, Practical guidelines
for the comprehensive analysis of ChIP-seq data, PLoS Computational Biology 2013, 9:e1003326.
Used under the terms of the Creative Commons Attribution License (http://creativecommons.
org/licenses/by/3.0/). © 2013 Bailey et al.)
0.5 MACS
CisGenome
0.4 spp
SISSRs
0.3 Useq
IDR
QuEST
0.2
0.1
0
0
0
50
00
50
00
50
1,
1,
2,
2,
Number of peaks
FIGURE 13.8
Evaluation of the performance of six peak callers using IDR. (Adapted by permission from
Macmillan Publishers Ltd: Nature Methods, Systematic evaluation of factors influencing ChIP-seq
fidelity, Y Chen, N Negre, Q Li, JO Mieczkowska, M Slattery, T Liu, Y Zhang et al., copyright 2012.)
step. For example, to check for consistency of called peaks between replicate
samples, the aforementioned multiBamSummary tool in deepTools2 can
be used in “BED-file” mode using BED files from peak callers as input, and
the output passed on to plotCorrelation to generate visualization of samples
based on their correlation coefficients.
ChIPQC reports a number of quality metrics related to called peaks,
including peak signal strength, enrichment of reads in peaks, and relative
Mapping Protein-DNA Interactions with ChIP-Seq 309
13.3.4 Peak Visualization
Visualizing peaks in their genomic context allows identification of overlap-
ping or nearby functional elements, and thereby facilitates peak annotation
and data interpretation. Many peak callers generate BED files containing
peak chromosomal locations, along with WIG and bedGraph track files,
all of which can be uploaded to a genome browser for peak visualization.
Examination of peak regions in a genome browser and comparison with
other data/annotation tracks allow identification of associated genomic
features, such as promoters, enhancers, and other regulatory regions.
BEDTools can also be applied to explore relationships between peaks
and other genomic landmarks such as nearby protein- coding or non-
coding genes.
310 Next-Generation Sequencing Data Analysis
this approach, the normalized peak signal is computed as the original peak
sequence read count being scaled by the sum of read counts of all peaks, i.e.,
Xi , j
Zi , j =
∑
N
j =1
Xi , j
where Zi , j and Xi,j are the normalized and original peak signal for sample i
and peak j, and N is the total number of called peaks.
Normalization methods that were previously developed for microarray
data have also been adapted for ChIP-seq data. MAnorm2, as an example,
uses a hierarchical normalization process that is similar to the MA plot
approach used for microarray data. Based on the assumption that signals
from common peaks shared among all samples remain unchanged globally,
MAnorm2 applies a linear transformation process to the raw read counts
in order to make both the arithmetic mean of M values (or differences in
signals between samples), and the covariance between M and A (average
signals between two samples) values, to be zero [30]. ChIPnorm uses a
modified version of quantile normalization [31]. A locally weighted regres-
sion (LOESS) normalization approach for ChIP-seq data [32] is similar to the
LOESS procedure applied to cDNA microarray data normalization. All these
approaches assume that the overall binding profile of the target protein does
not vary across different conditions. This assumption does not hold, however,
under conditions when the overall level or activity of the protein under study
changes due to experimental perturbation. Under such conditions, normal-
ization approaches based on spike-in of an exogenous reference epigenome
at a constant amount [33], similar to the use of spike-in RNA controls in bulk
RNA-seq, can be used.
Besides all the normalization approaches introduced above, good experi-
mental design and consistent experimental procedure can minimize data
variability in different samples and groups, thereby alleviating the burden
on posterior normalization. For example, processing all samples side by side
using the same experimental procedure and parameters, such as the same
antibody, by the sample operator, will minimize sample- to-
sample vari-
ability. When conducting an experiment in this way, the simpler normaliza-
tion approach based on total library read count can be sufficient.
Since the ChIP-seq-based quantitative analysis of differential binding is
similar to the RNA- seq-
based differential expression analysis, packages
such as edgeR and DESeq2 can be applied here. Table 13.2 lists some of the
packages that are designed for ChIP-seq differential binding analysis. As
listed these packages can be divided into two categories, with one composed
of methods that are dependent on peak calling from an external application,
while the other of those that handle peak calling internally or do away with
peak calling altogether. For methods that require internally or externally
312 Next-Generation Sequencing Data Analysis
TABLE 13.2
Packages Developed for ChIP-Seq Differential Binding Analysis
Name Description Reference
called peaks, robust peak calling is essential to produce quality results. For
those that do not rely on peak calling, differential binding analysis aims to
find significant ChIP-seq signal changes between conditions throughout the
entire genome. These latter methods can be further divided into two subcat-
egories. One subcategory, exemplified by diffReps [28], csaw [34], and PePr
[35], uses sliding windows, the size of which is typically selected based on the
Mapping Protein-DNA Interactions with ChIP-Seq 313
footprint of the target protein. The other subcategory uses complex segmenta-
tion techniques, such as hidden Markov model (HMM), to first segment
genome into bins and then infer the hidden state of each bin in order to detect
differential protein-binding sites. ChIPDiff [36] and THOR [37] are examples
of this approach.
To test for differential binding, statistical models based on Poisson or nega-
tive binomial distribution are often used, again similar to RNA-seq DE ana-
lysis. In fact, methods such as DiffBind [38] and DBChip [39] directly inherit
statistical models from edgeR and DESeq2. In terms of detection targets,
while some methods are specifically designed for punctate-binding protein
factors (such as DBChip) and some others for more broad marks (such as
SICER2 and RSEG), most of the methods can be used for binding regions
of different sizes. In addition, in terms of handling replicate samples, some
can work with experiments that do not use replicate samples, while others
require replicates. As mentioned earlier, use of replication is suggested,
which usually leads to increased detection precision with a much reduced
number of differential binding regions, but in the meantime at the expense of
reduced sensitivity. To help select appropriate methods for a particular appli-
cation, existing benchmarking studies provide decision trees based on their
comparative testing results [40, 41]. It should also be noted that like those
devised for RNA-seq-based differential expression analysis, these packages
are designed based on certain assumptions and therefore the user needs to be
aware of these assumptions and ensure they are fulfilled prior to using them.
For example, MAnorm2 is based on the assumption that there is no global
change in binding at peak regions between conditions.
13.5 Functional Analysis
Often the data gathered from a ChIP-seq study is used to understand gene
expression regulation and associated biological functions. To conduct func-
tional analysis, peaks are first assigned to nearby genes using tools such
as ChIPseeker [46], GREAT (Genomic Regions Enrichment of Annotations
Tool) [47], and ChIPpeakAnno [48]. While it is debatable on what genes a
peak should be assigned to, a straightforward approach is to assign it to the
closest gene transcription start site. Once peaks are assigned to target genes,
an integrated analysis of ChIP-seq and gene expression data (more on this
in Section 13.7) can be carried out. Furthermore, Gene Ontology, biological
pathway, gene network, or gene set enrichment analyses can be conducted
using similar approaches as described in Chapter 7. Prior to carrying out
these gene functional analyses, one should also bear in mind that the peak-to-
gene assignment process is biased by gene size, as the presence of peak(s) has
a positive correlation with the length of a gene. In addition, the distribution
314 Next-Generation Sequencing Data Analysis
13.6 Motif Analysis
One of the goals of ChIP-seq data analysis is to identify DNA-binding motifs
for the protein of interest. A DNA-binding motif is usually represented by a
consensus sequence, or more accurately, a position specific frequency matrix.
Figure 13.9 A shows an example of such a DNA-binding motif, the one bound
by a previously introduced transcription factor NRF2 (see Chapter 2). To iden-
tify motifs from ChIP-seq data, all peak sequences need to be assembled and
fed into multiple motif discovery tools. Some of the commonly used motif
discovery tools are Cistrome [50], Gibbs motif sampler (part of CisGenome),
HOMER (findMotifs module), MEME-ChIP [51] as part of the MEME suite
[52], rGADEM [53], and RSAT peak-motifs [54]. The motif discovery phase
usually ends up with one or more motifs, with one being the binding site of
FIGURE 13.9
The consensus DNA-binding motif of the transcription factor NRF2. Panel (A) shows the currently
known NRF2-binding motif, while panel (B) displays the result of a de novo motif analysis using
NRF2 ChIP-seq data. (From BN Chorley, MR Campbell, X Wang, M Karaca, D Sambandan, F
Bangura, P Xue, J Pi, SR Kleeberger, DA Bell, Identification of novel NRF2-regulated genes by
ChIP-seq: influence on retinoid X receptor alpha, Nucleic Acids Research 2012, 40(15):7416–7429,
With permission.)
Mapping Protein-DNA Interactions with ChIP-Seq 315
the target protein and others being that of its partners. The discovered motif(s)
can be compared with currently known motifs catalogued in databases such
as JASPAR [55] to detect similarities, find relationships with other motifs,
or locate other proteins that might bind at or near the peak region as part
of a protein complex. Tools for motif comparison include STAMP [56] and
Tomtom [57]. Motif enrichment analysis can also be carried out to find out
if known motifs are enriched in the peak regions using tools such as AME
[58], CentriMo [59], and SEA [60]. Finally motif scanning and mapping by
tools like FIMO [61] allows visualization of the discovered motif(s) in the
ChIP-seq peak areas. Some of these tools have been integrated into motif ana-
lysis pipelines, such as the MEME Suite [62], which includes MEME-ChIP,
Tomtom, CentriMo, SEA, AME, and FIMO.
References
1. Chen Y, Negre N, Li Q, Mieczkowska JO, Slattery M, Liu T, Zhang Y, Kim TK,
He HH, Zieba J et al. Systematic evaluation of factors influencing ChIP-seq
fidelity. Nat Methods 2012, 9(6):609–614.
2. Visa N, Jordan-Pla A. ChIP and ChIP-Related Techniques: Expanding the
fields of application and improving ChIP performance. Methods Mol Biol 2018,
1689:1–7.
3. Skene PJ, Henikoff JG, Henikoff S. Targeted in situ genome-wide profiling
with high efficiency for low cell numbers. Nat Protoc 2018, 13(5):1006–1019.
4. Meyer CA, Liu XS. Identifying and mitigating bias in next-generation sequen-
cing methods for chromatin biology. Nat Rev Genet 2014, 15(11):709–721.
5. Jordán-Pla A, Visa N. Considerations on experimental design and data ana-
lysis of chromatin immunoprecipitation experiments. Methods Mol Biol 2018,
1689:9–28.
6. Daley T, Smith AD. Predicting the molecular complexity of sequencing
libraries. Nat Methods 2013, 10(4):325–327.
7. ENCODE Software Tools (www.encodeproject.org/software/)
8. Irreproducible Discovery Rate (IDR) (www.encodeproject.org/softw
are/idr/)
9. Diaz A, Nellore A, Song JS. CHANCE: comprehensive software for quality
control and validation of ChIP-seq data. Genome Biol 2012, 13(10):R98.
10. Ramirez F, Ryan DP, Gruning B, Bhardwaj V, Kilpert F, Richter AS, Heyne
S, Dundar F, Manke T. deepTools2: a next generation web server for deep-
sequencing data analysis. Nucleic Acids Res 2016, 44(W1):W160–165.
11. Nakato R, Shirahige K. Sensitive and robust assessment of ChIP-seq read dis-
tribution using a strand-shift profile. Bioinformatics 2018, 34(14):2356–2363.
12. Landt SG, Marinov GK, Kundaje A, Kheradpour P, Pauli F, Batzoglou S,
Bernstein BE, Bickel P, Brown JB, Cayting P et al. ChIP-seq guidelines and
practices of the ENCODE and modENCODE consortia. Genome Res 2012,
22(9):1813–1831.
13. Zhang Y, Liu T, Meyer CA, Eeckhoute J, Johnson DS, Bernstein BE, Nusbaum
C, Myers RM, Brown M, Li W et al. Model-based analysis of ChIP-Seq (MACS).
Genome Biol 2008, 9(9):R137.
14. Kharchenko PV, Tolstorukov MY, Park PJ. Design and analysis of
ChIP- seq experiments for DNA- binding proteins. Nat Biotechnol 2008,
26(12):1351–1359.
15. Rozowsky J, Euskirchen G, Auerbach RK, Zhang ZD, Gibson T, Bjornson R,
Carriero N, Snyder M, Gerstein MB. PeakSeq enables systematic scoring of
ChIP-seq experiments relative to controls. Nat Biotechnol 2009, 27(1):66–75.
16. Heinz S, Benner C, Spann N, Bertolino E, Lin YC, Laslo P, Cheng JX, Murre
C, Singh H, Glass CK. Simple combinations of lineage-determining transcrip-
tion factors prime cis-regulatory elements required for macrophage and B cell
identities. Mol Cell 2010, 38(4):576–589.
17. Zang C, Schones DE, Zeng C, Cui K, Zhao K, Peng W. A clustering approach
for identification of enriched domains from histone modification ChIP-Seq
data. Bioinformatics 2009, 25(15):1952–1958.
Mapping Protein-DNA Interactions with ChIP-Seq 317
18. Ibrahim MM, Lacadie SA, Ohler U. JAMM: a peak finder for joint analysis of
NGS replicates. Bioinformatics 2015, 31(1):48–55.
19. Oh D, Strattan JS, Hur JK, Bento J, Urban AE, Song G, Cherry JM. CNN-
Peaks: ChIP-Seq peak detection pipeline using convolutional neural networks
that imitate human visual inspection. Sci Rep 2020, 10(1):7933.
20. Hentges LD, Sergeant MJ, Cole CB, Downes DJ, Hughes JR, Taylor S.
LanceOtron: a deep learning peak caller for genome sequencing experiments.
Bioinformatics 2022, 38(18):4255–4263.
21. Ji H, Jiang H, Ma W, Johnson DS, Myers RM, Wong WH. An integrated soft-
ware system for analyzing ChIP-chip and ChIP-seq data. Nat Biotechnol 2008,
26(11):1293–1300.
22. Jothi R, Cuddapah S, Barski A, Cui K, Zhao K. Genome-wide identification of
in vivo protein-DNA binding sites from ChIP-Seq data. Nucleic Acids Res 2008,
36(16):5221–5231.
23. Feng X, Grossman R, Stein L. PeakRanger: a cloud-enabled peak caller for
ChIP-seq data. BMC Bioinformatics 2011, 12:139.
24. Rashid NU, Giresi PG, Ibrahim JG, Sun W, Lieb JD. ZINBA integrates local
covariates with DNA-seq data to identify broad and narrow regions of enrich-
ment, even within amplified genomic regions. Genome Biol 2011, 12(7):R67.
25. Carroll TS, Liang Z, Salama R, Stark R, de Santiago I. Impact of artifact removal
on ChIP quality metrics in ChIP-seq and ChIP-exo data. Front Genet 2014, 5:75.
26. Amemiya HM, Kundaje A, Boyle AP. The ENCODE Blacklist: Identification of
Problematic Regions of the Genome. Sci Rep 2019, 9(1):9354.
27. Chen X, Xu H, Yuan P, Fang F, Huss M, Vega VB, Wong E, Orlov YL, Zhang W,
Jiang J et al. Integration of external signaling pathways with the core transcrip-
tional network in embryonic stem cells. Cell 2008, 133(6):1106–1117.
28. Shen L, Shao NY, Liu X, Maze I, Feng J, Nestler EJ. diffReps: detecting dif-
ferential chromatin modification sites from ChIP-seq data with biological
replicates. PLoS One 2013, 8(6):e65598.
29. Manser P, Reimers M. A simple scaling normalization for comparing ChIP-Seq
samples. PeerJ PrePrints 2014, 1.
30. Tu S, Li M, Chen H, Tan F, Xu J, Waxman DJ, Zhang Y, Shao Z. MAnorm2
for quantitatively comparing groups of ChIP-seq samples. Genome Res 2021,
31(1):131–145.
31. Nair NU, Sahu AD, Bucher P, Moret BM. ChIPnorm: a statistical method for
normalizing and identifying differential regions in histone modification ChIP-
seq libraries. PLoS One 2012, 7(8):e39573.
32. Taslim C, Wu J, Yan P, Singer G, Parvin J, Huang T, Lin S, Huang K. Comparative
study on ChIP-seq data: normalization and binding pattern characterization.
Bioinformatics 2009, 25(18):2334–2340.
33. Orlando DA, Chen MW, Brown VE, Solanki S, Choi YJ, Olson ER, Fritz CC,
Bradner JE, Guenther MG. Quantitative ChIP- Seq normalization reveals
global modulation of the epigenome. Cell Rep 2014, 9(3):1163–1170.
34. Lun AT, Smyth GK. csaw: a Bioconductor package for differential binding ana-
lysis of ChIP-seq data using sliding windows. Nucleic Acids Res 2016, 44(5):e45.
35. Zhang Y, Lin YH, Johnson TD, Rozek LS, Sartor MA. PePr: a peak-calling pri-
oritization pipeline to identify consistent or differential peaks from replicated
ChIP-Seq data. Bioinformatics 2014, 30(18):2568–2575.
318 Next-Generation Sequencing Data Analysis
36. Xu H, Wei CL, Lin F, Sung WK. An HMM approach to genome-wide iden-
tification of differential histone modification sites from ChIP- seq data.
Bioinformatics 2008, 24(20):2344–2349.
37. Allhoff M, Sere K, J FP, Zenke M, I GC. Differential peak calling of ChIP-seq
signals with replicates with THOR. Nucleic Acids Res 2016, 44(20):e153.
38. Stark R, Brown G. DiffBind: differential binding analysis of ChIP-Seq peak
data. In R package version 2011, 100.
39. Liang K, Keles S. Detecting differential binding of transcription factors with
ChIP-seq. Bioinformatics 2012, 28(1):121–122.
40. Steinhauser S, Kurzawa N, Eils R, Herrmann C. A comprehensive comparison
of tools for differential ChIP-seq analysis. Brief Bioinform 2016, 17(6):953–966.
41. Eder T, Grebien F. Comprehensive assessment of differential ChIP-seq tools
guides optimal algorithm selection. Genome Biol 2022, 23(1):119.
42. Chen L, Wang C, Qin ZS, Wu H. A novel statistical method for quantitative
comparison of multiple ChIP-seq datasets. Bioinformatics 2015.
43. Taslim C, Huang T, Lin S. DIME: R- package for identifying differential
ChIP- seq based on an ensemble of mixture models. Bioinformatics 2011,
27(11):1569–1570.
44. Schweikert G, Kuo D. MMDiff2: statistical testing for ChIP-Seq data sets. In.,
vol. R package version 1.24.0; 2022.
45. Song Q, Smith AD. Identifying dispersed epigenomic domains from ChIP-Seq
data. Bioinformatics 2011, 27(6):870–871.
46. Yu G, Wang LG, He QY. ChIPseeker: an R/ Bioconductor package for
ChIP peak annotation, comparison and visualization. Bioinformatics 2015,
31(14):2382–2383.
47. McLean CY, Bristor D, Hiller M, Clarke SL, Schaar BT, Lowe CB, Wenger
AM, Bejerano G. GREAT improves functional interpretation of cis-regulatory
regions. Nat Biotechnol 2010, 28(5):495–501.
48. Zhu LJ, Gazin C, Lawson ND, Pages H, Lin SM, Lapointe DS, Green MR.
ChIPpeakAnno: a Bioconductor package to annotate ChIP-seq and ChIP-chip
data. BMC Bioinformatics 2010, 11:237.
49. Welch RP, Lee C, Imbriano PM, Patil S, Weymouth TE, Smith RA, Scott LJ,
Sartor MA. ChIP- Enrich: gene set enrichment testing for ChIP- seq data.
Nucleic Acids Res 2014.
50. Liu T, Ortiz JA, Taing L, Meyer CA, Lee B, Zhang Y, Shin H, Wong SS, Ma J, Lei
Y et al. Cistrome: an integrative platform for transcriptional regulation studies.
Genome Biol 2011, 12(8):R83.
51. Machanick P, Bailey TL. MEME-ChIP: motif analysis of large DNA datasets.
Bioinformatics 2011, 27(12):1696–1697.
52. Bailey TL, Johnson J, Grant CE, Noble WS. The MEME Suite. Nucleic Acids Res
2015, 43(W1):W39–49.
53. Droit A, Gottardo R, Robertson G, Li L. rGADEM: de novo motif discovery. R
package version 2.44.0. . In.; 2022.
54. Thomas- Chollier M, Herrmann C, Defrance M, Sand O, Thieffry D, van
Helden J. RSAT peak-motifs: motif analysis in full-size ChIP-seq datasets.
Nucleic Acids Res 2012, 40(4):e31.
55. Castro- Mondragon JA, Riudavets- Puig R, Rauluseviciute I, Lemma RB,
Turchi L, Blanc-Mathieu R, Lucas J, Boddie P, Khan A, Manosalva Perez N
Mapping Protein-DNA Interactions with ChIP-Seq 319
et al. JASPAR 2022: the 9th release of the open-access database of transcription
factor binding profiles. Nucleic Acids Res 2022, 50(D1):D165–D173.
56. Mahony S, Benos PV. STAMP: a web tool for exploring DNA-binding motif
similarities. Nucleic Acids Res 2007, 35(Web Server issue):W253–258.
57. Gupta S, Stamatoyannopoulos JA, Bailey TL, Noble WS. Quantifying simi-
larity between motifs. Genome Biol 2007, 8(2):R24.
58. McLeay RC, Bailey TL. Motif Enrichment Analysis: a unified framework and
an evaluation on ChIP data. BMC Bioinformatics 2010, 11:165.
59. Bailey TL, Machanick P. Inferring direct DNA binding from ChIP-seq. Nucleic
Acids Res 2012, 40(17):e128.
60. Bailey TL, Grant CE. SEA: Simple Enrichment Analysis of motifs. bioRxiv 2021,
doi: https://doi.org/10.1101/2021.08.23.457422.
61. Grant CE, Bailey TL, Noble WS. FIMO: scanning for occurrences of a given
motif. Bioinformatics 2011, 27(7):1017–1018.
62. The MEME Suite (http://meme.nbcr.net/meme/)
63. Ernst J, Kellis M. Discovery and characterization of chromatin states for sys-
tematic annotation of the human genome. Nat Biotechnol 2010, 28(8):817–825.
64. Klein HU, Schafer M, Porse BT, Hasemann MS, Ickstadt K, Dugas M.
Integrative analysis of histone ChIP-seq and transcription data using Bayesian
mixture models. Bioinformatics 2014, 30(8):1154–1162.
65. Schafer M, Klein HU, Schwender H. Integrative analysis of multiple gen-
omic variables using a hierarchical Bayesian model. Bioinformatics 2017,
33(20):3220–3227.
66. Wang S, Sun H, Ma J, Zang C, Wang C, Wang J, Tang Q, Meyer CA, Zhang
Y, Liu XS. Target analysis by integration of transcriptome and ChIP-seq data
with BETA. Nature protocols 2013, 8(12):2502–2515.
67. Shin H, Liu T, Manrai AK, Liu XS. CEAS: cis-regulatory element annotation
system. Bioinformatics 2009, 25(19):2605–2606.
14
Epigenomics by DNA Methylation
Sequencing
1) Denaturation
2) Bisulfite Treatment
3) PCR Amplification
FIGURE 14.1
Major steps of bisulfite sequencing. Prior to bisulfite treatment, the two strands of DNA are
first separated by denaturation. The bisulfite treatment then converts unmethylated, but not
methylated, cytosines to uracils. The two strands from the treatment, BSW and BSC, are then
subjected to PCR amplification. This leads to the generation of four strands (BSW, BSWR, BSC,
and BSCR), all of which are distinct from the original Watson and Crick strands. (From Xi Y. and
Li W. (2009) BSMAP: whole genome bisulfite sequence MAPping program. BMC Bioinformatics,
10, 232. Used under the terms of the Creative Commons Attribution License (http://creative
commons.org/licenses/by/2.0). © 2009 Xi and Li.)
and that sequencing at higher depth may not be as cost effective as adding
more biological replicates in reaching higher detection power [3].
approaches such as ligation capture [6, 7], bisulfite padlock probes [8], or
liquid hybridization capture [9, 10].
14.1.3 Enrichment-Based Methyl-Seq
Different from the above bisulfite conversion-based methods, the methylated
DNA enrichment strategy captures methylated DNA for targeted sequencing.
One of the methods based on this strategy is MeDIP-seq, or methylated DNA
immunoprecipitation coupled with NGS. In this method, antibodies against
5mC are used to precipitate methylated single-stranded DNA fragments for
sequencing. Another commonly used method is MBD-seq, or methyl-CpG-
binding domain capture (MBDCap) followed by NGS. MBD-seq utilizes
proteins such as MBD2 or MECP2 that contain the methyl-CpG-binding
domain to enrich for methylated double-stranded DNA fragments. In one
type of MBDCap method called MIRA (Methylated-CpG Island Recovery
Assay), a protein complex of MBD2 and MBD3L1 (methyl-CpG-binding
domain protein 3-like-1) is used to achieve enhanced affinity to methylated
CpG regions. While MeDIP-seq and MBD-aeq usually generate highly con-
cordant results, there are some differences between these two approaches.
MeDIP-seq can detect both CpG and non-CpG methylation, while MBD-seq
is focused on methylated CpG sites because of the binding affinity of MBD.
At methylated CpG sites, MeDIP tends to enrich at regions that have low
CpG density, while MBD-seq favors regions of relatively higher CpG content
[15, 16].
Epigenomics by DNA Methylation Sequencing 325
NH2
HO N
N O
H
FIGURE 14.2
The chemical structures of cytosine, 5-methylcytosine (5mC), and 5-hydroxymethylcytosine
(5hmC).
326 Next-Generation Sequencing Data Analysis
14.2.2 Read Mapping
In order to identify methylated DNA sites, sequencing reads derived from
bisulfite or enzymatic conversion, or methylated DNA enrichment, need to
be first mapped to the reference genome. Mapping of reads generated from
the enrichment-based methods is rather straightforward, and like mapping
ChIP-seq reads, is usually conducted with general aligners, such as Bowtie,
BWA, or SOAP. Mapping of bisulfite or enzymatic conversion-based methyl-
seq reads, however, is less straightforward. This is because through the
Epigenomics by DNA Methylation Sequencing 327
Read Mapping
Post Mapping QC
Methylation/Demethylation
Level Quantification
Visualization
Differential Methylation
Analysis
FIGURE 14.3
Major steps of chemical or enzymatic conversion-based DNA methyl-seq data analysis.
TABLE 14.1
Read Mapping Tools for Chemical or Enzymatic Conversion-Based DNA Methylation
Sequencing
Name Description Reference
Three-Letter Aligners
Bismark Deploys Bowtie 2 (or HISAT2) for alignment, and performs [34]
cytosine methylation calling
bwa-meth Wraps BWA-MEM, provides local alignment for speed and [35]
accurary even without trimming
BS-Seeker2/ Incorporates major aligners, such as Bowtie 2, to achieve [36, 37]
BS-Seeker3 gapped local alignment. BS-Seeker3 further improves speed
and accuracy, and offers post-alignment analysis including
QC and visualization
BSmooth Uses Bowtie 2 (or Merman) for alignment. Also provides QC, [28]
smoothing-based methylation quantification, and differential
methylation detection
Wild-Card Aligners
BSMAP/ Combines hash table seeding incorporated in SOAP, and [30]
RRBSMAP bitwise masking to achieve speed and accuracy
GSNAP Employs hash tables built for plus and minus strands using C- [32]
T/G-A substitutions
Last Builds on the traditional alignment strategy of seed-and- [31]
extend (like Blast), but with use of adaptive seeds
BatMeth2 Performs indel-sensitive mapping, DNA methylation [41]
quantification, differential methylation detection, annotation,
and visualization
Epigenomics by DNA Methylation Sequencing 329
C(s). The masked bisulfite reads are then mapped again to the reference
genome.
Aligners such as Bismark [34], bwa-meth [35], and BS-Seeker2/BS-Seeker3
[36, 37] use the other (three-letter) approach. One advantage of this approach
is that fast mapping algorithms such as BWA-MEM or bowtie2 can be used.
For example, Bismark carries out alignment by first converting Cs in the
reads into Ts, and Gs into As (equivalent of the C-to-T conversion on the com-
plementary strand) (Figure 14.4). This conversion process is also performed
on the reference genome. The converted reads are then aligned, using Bowtie
or Bowtie2, to the converted reference genome in four parallel processes (also
refer to Figure 14.1), out of which a unique best alignment is determined
[alignment (1) in Figure 14.4]. Among the above wild-card and three-letter
methods, benchmark studies [38–40] found that bwa-meth, Bismark, and
BSMAP offer a good combination of accuracy, speed, and genomic region
coverage.
Genomic fragment
sequence after bisulfite
treatment
Read conversion
Align to bisulfite
converted genomes
Forward strand C-to-T converted genome Forward strand G-toA converted genome
14.2.4 Visualization
Visualizing DNA methylation data serves at least two purposes. Firstly, distri-
bution pattern of DNA methylation may be discerned through visualization.
Secondly, visual examination of known DNA methylation regions and other
randomly selected regions also offers data validation and a quick estimate of
data quality. One method to visualize DNA methyl-seq data and associated
information, such as depth of coverage, is through the use of bedGraph files.
This standard format (Figure 14.5), compatible with most genome browsers
and tools including the Washington University EpiGenome Browser [50],
can be directly generated from many of the methylation quantification tools
such as Bismark and methylKit. Figure 14.6 shows an example of displaying
methylation level along with read depth in the genome.
Alternatively, DNA methylation quantification results can be saved in tab-
delimited files and then converted to bigBed or bigWig formats [51]. Both
formats are compatible with and enable visualization of the methylation
332 Next-Generation Sequencing Data Analysis
track type=bedGraph
chr19 45408804 45408805 1.0
chr19 45408806 45408807 0.75
chr19 45408854 45408855 0.3
chr19 45408855 45408856 0.5
FIGURE 14.5
An example of the bedGraph file format. It includes a track definition line (the first line),
followed by track data lines in four column format, i.e. chromosome, chromosome start position,
chromosome end position, and data value.
333
334 Next-Generation Sequencing Data Analysis
TABLE 14.2
Tools for Detection of Differentially Methylated Cytosines/Regions
Name Description Reference
Fisher’s Applies directly the classical Fisher’s exact test for DMC/DMR N/A
recognition
methylKit Uses logistic regression for groups with replicates, and Fisher’s [45]
exact test if without replicates
methylSig Identifies DMCs/DMRs using likelihood ratio estimation based [61]
on a beta-binomial distribution model
RnBeads Combines statistical testing p-values, and priority ranking based [53, 54]
on absolute and relative effect size
RADMeth Uses log-likelihood ratio test based on a beta-binomial regression [62]
model for differential methylation testing
DMRFinder Applies Wald and empirical Bayes tests for differential [63]
methylation detection based on beta-binomial modelling
Metilene A nonparametric method based on segmentation of the genome [64]
using a circular binary segmentation algorithm
DSS Uses a Bayesian hierarchical model to allow information sharing [65]
across different CpG sites, and Wald test for DMC detection
replicate-
aware and provide estimation on biological variation, thereby
increasing detection power. On multiple testing correction, FDR is mostly
used, while other methods are also reported, such as a sliding linear model
(SLIM) method used by methylKit. Among the currently available DMC/
DMR detection tools, benchmarking studies show that methylKit, Fisher’s
exact test, methylSig, and DMRFinder are among the top performers [58, 59].
Data obtained from methylated DNA enrichment- based approaches
follows the negative binomial distribution, like the ChIP-seq and RNA-seq
data. Therefore, they can be analyzed to identify DMRs using algorithms
developed for RNA-seq-based differential expression. For example, tools
such as EdgeR and DESeq can be directly used. In some DNA methylation
analysis tools, such as Repitools [60], EdgeR is directly called.
References
1. Seiler Vellame D, Castanho I, Dahir A, Mill J, Hannon E. Characterizing the
properties of bisulfite sequencing data: maximizing power and sensitivity to
identify between-group differences in DNA methylation. BMC Genomics 2021,
22(1):446.
2. Standards and Guidelines for Whole Genome Shotgun Bisulfite Sequencing
(www.roadmapepigenomics.org/protocols)
3. Ziller MJ, Hansen KD, Meissner A, Aryee MJ. Coverage recommendations
for methylation analysis by whole-genome bisulfite sequencing. Nat Methods
2015, 12(3):230–232.
4. Meissner A, Gnirke A, Bell GW, Ramsahoye B, Lander ES, Jaenisch R. Reduced
representation bisulfite sequencing for comparative high- resolution DNA
methylation analysis. Nucleic Acids Res 2005, 33(18):5868–5877.
5. Sun Z, Cunningham J, Slager S, Kocher JP. Base resolution methylome pro-
filing: considerations in platform selection, data preprocessing and analysis.
Epigenomics 2015, 7(5):813–828.
6. Nautiyal S, Carlton VE, Lu Y, Ireland JS, Flaucher D, Moorhead M, Gray JW,
Spellman P, Mindrinos M, Berg P et al. High-throughput method for analyzing
methylation of CpGs in targeted genomic regions. Proc Natl Acad Sci U S A
2010, 107(28):12587–12592.
7. Varley KE, Mitra RD. Bisulfite Patch PCR enables multiplexed sequen-
cing of promoter methylation across cancer samples. Genome Res 2010,
20(9):1279–1287.
8. Deng J, Shoemaker R, Xie B, Gore A, LeProust EM, Antosiewicz-Bourget J,
Egli D, Maherali N, Park IH, Yu J et al. Targeted bisulfite sequencing reveals
changes in DNA methylation associated with nuclear reprogramming. Nat
Biotechnol 2009, 27(4):353–360.
336 Next-Generation Sequencing Data Analysis
22. Flusberg BA, Webster DR, Lee JH, Travers KJ, Olivares EC, Clark TA, Korlach
J, Turner SW. Direct detection of DNA methylation during single-molecule,
real-time sequencing. Nat Methods 2010, 7(6):461–465.
23. Laszlo AH, Derrington IM, Brinkerhoff H, Langford KW, Nova IC, Samson
JM, Bartlett JJ, Pavlenok M, Gundlach JH. Detection and mapping of 5-
methylcytosine and 5-hydroxymethylcytosine with nanopore MspA. Proc Natl
Acad Sci U S A 2013, 110(47):18904–18909.
24. Schreiber J, Wescoe ZL, Abu-Shumays R, Vivian JT, Baatar B, Karplus K, Akeson
M. Error rates for nanopore discrimination among cytosine, methylcytosine,
and hydroxymethylcytosine along individual DNA strands. Proc Natl Acad Sci
U S A 2013, 110(47):18910–18915.
25. Tse OYO, Jiang P, Cheng SH, Peng W, Shang H, Wong J, Chan SL, Poon LCY,
Leung TY, Chan KCA et al. Genome- wide detection of cytosine methyla-
tion by single molecule real-time sequencing. Proc Natl Acad Sci U S A 2021,
118(5):e2019768118.
26. Rand AC, Jain M, Eizenga JM, Musselman-Brown A, Olsen HE, Akeson M,
Paten B. Mapping DNA methylation with high-throughput nanopore sequen-
cing. Nat Methods 2017, 14(4):411–413.
27. Trim Galore! (www.bioinformatics.babraham.ac.uk/projects/trim_galore/)
28. Hansen KD, Langmead B, Irizarry RA. BSmooth: from whole genome bisulfite
sequencing reads to differentially methylated regions. Genome Biol 2012,
13(10):R83.
29. Liang F, Tang B, Wang Y, Wang J, Yu C, Chen X, Zhu J, Yan J, Zhao W, Li
R. WBSA: web service for bisulfite sequencing data analysis. PLoS One 2014,
9(1):e86707.
30. Xi Y, Li W. BSMAP: whole genome bisulfite sequence MAPping program.
BMC Bioinformatics 2009, 10:232.
31. Frith MC, Mori R, Asai K. A mostly traditional approach improves alignment
of bisulfite-converted DNA. Nucleic Acids Res 2012, 40(13):e100.
32. Wu TD, Nacu S. Fast and SNP-tolerant detection of complex variants and spli-
cing in short reads. Bioinformatics 2010, 26(7):873–881.
33. Xi Y, Bock C, Muller F, Sun D, Meissner A, Li W. RRBSMAP: a fast, accurate
and user-friendly alignment tool for reduced representation bisulfite sequen-
cing. Bioinformatics 2012, 28(3):430–432.
34. Krueger F, Andrews SR. Bismark: a flexible aligner and methylation caller for
Bisulfite-Seq applications. Bioinformatics 2011, 27(11):1571–1572.
35. Pedersen BS, Eyring K, De S, Yang IV, Schwartz DA. Fast and accurate
alignment of long bisulfite-seq reads. arXiv preprint arXiv:14011129 2014.
36. Huang KYY, Huang YJ, Chen PY. BS-Seeker3: ultrafast pipeline for bisulfite
sequencing. BMC Bioinformatics 2018, 19(1):111.
37. Guo W, Fiziev P, Yan W, Cokus S, Sun X, Zhang MQ, Chen PY, Pellegrini M.
BS-Seeker2: a versatile aligning pipeline for bisulfite sequencing data. BMC
Genomics 2013, 14 :774.
38. Kunde-Ramamoorthy G, Coarfa C, Laritsky E, Kessler NJ, Harris RA, Xu M,
Chen R, Shen L, Milosavljevic A, Waterland RA. Comparison and quantitative
verification of mapping algorithms for whole-genome bisulfite sequencing.
Nucleic Acids Res 2014, 42(6):e43.
338 Next-Generation Sequencing Data Analysis
58. Piao Y, Xu W, Park KH, Ryu KH, Xiang R. Comprehensive evaluation of dif-
ferential methylation analysis methods for bisulfite sequencing data. Int J
Environ Res Public Health 2021, 18(15):7975.
59. Liu Y, Han Y, Zhou L, Pan X, Sun X, Liu Y, Liang M, Qin J, Lu Y, Liu P. A com-
prehensive evaluation of computational tools to identify differential methyla-
tion regions using RRBS data. Genomics 2020, 112(6):4567–4576.
60. Statham AL, Strbenac D, Coolen MW, Stirzaker C, Clark SJ, Robinson MD.
Repitools: an R package for the analysis of enrichment-based epigenomic
data. Bioinformatics 2010, 26(13):1662–1663.
61. Park Y, Figueroa ME, Rozek LS, Sartor MA. MethylSig: a whole genome DNA
methylation analysis pipeline. Bioinformatics 2014, 30(17):2414–2422.
62. Dolzhenko E, Smith AD. Using beta-binomial regression for high-precision
differential methylation analysis in multifactor whole- genome bisulfite
sequencing experiments. BMC Bioinformatics 2014, 15:215.
63. Gaspar JM, Hart RP. DMRfinder: efficiently identifying differentially
methylated regions from MethylC-seq data. BMC Bioinformatics 2017, 18(1):528.
64. Juhling F, Kretzmer H, Bernhart SH, Otto C, Stadler PF, Hoffmann S.
metilene: fast and sensitive calling of differentially methylated regions from
bisulfite sequencing data. Genome Res 2016, 26(2):256–262.
65. Feng H, Conneely KN, Wu H. A Bayesian hierarchical model to detect differ-
entially methylated loci from single nucleotide resolution sequencing data.
Nucleic Acids Res 2014, 42(8):e69.
66. Halachev K, Bast H, Albrecht F, Lengauer T, Bock C. EpiExplorer: live explor-
ation and global analysis of large epigenomic datasets. Genome Biol 2012,
13(10):R96.
67. McLean CY, Bristor D, Hiller M, Clarke SL, Schaar BT, Lowe CB, Wenger
AM, Bejerano G. GREAT improves functional interpretation of cis-regulatory
regions. Nat Biotechnol 2010, 28(5):495–501.
15
Whole Metagenome Sequencing for
Microbial Community Analysis
15.2 Sequencing Approaches
There are several key factors that need to be considered before the sequencing
process starts. These include sequencing depth, read length, and sequencing
platforms. The depth of sequencing is dependent on the species richness and
abundances in the samples, and the goal to be pursued. For example, a study
that attempts to locate rare members in a highly diverse microbial community
requires deeper sequencing than one that is only focused on more abundant
members in a less diverse environment. With regard to read length, longer
reads are always better than shorter reads in metagenomics for sorting out
the inherent sequence complexity. The read length from most current short
read sequencers can reach 150 bp from each end for paired-end sequencing.
Whole Metagenome Sequencing for Microbial Community Analysis 345
Sample Processing
& Sequencing
Reads Binning
Assembly-Free Assembly-Dependent
Metagenome Assembly
Reads Mapping to Currently
Known Gene Sequences
Contig Binning
Calling of Genes/
Genomic Elements
FIGURE 15.1
Major steps of metagenome analysis.
some step(s) omitted. Since the publication of the first edition of this book,
there has been rapid increase in the number of tools available for metagenome
analysis. Some of the currently available tools, such as those required for
taxonomic profiling based on search against multiple databases, require con-
siderable computing resources and power.
Whole Metagenome Sequencing for Microbial Community Analysis 347
as those generated from the PacBio and ONT platforms, assemblers such as
metaFlye [12], Raven [13], Canu [14], and Hifiasm-meta [15] can be used. For
short reads, more assemblers are available, including SPAdes/metaSPAdes
[16, 17], MEGAHIT [18], IDBA-UD [19], MetaVelvet/MetaVelvet-SL [20, 21],
and Ray Meta [22]. Similar to single-genome assemblers, many of these short
read metagenome assemblers, such as metaSPAdes, MEGAHIT, IDBA-UD,
and Ray Meta, are based on the de Bruijn graph approach (see Chapter 12).
In addition, these methods use multi-k-mer sizes, instead of a fixed k-mer
size, in order to improve assemblies. The difference from the single-genome
assemblers, though, is that they attempt to identify subgraphs within a
mixed de Bruijn graph, each of which is expected to represent an individual
genome. For example, metaSPAdes first builds a large de Bruijn graph from
all metagenomic reads using SPAdes and then transforms it into an assembly
graph. Within the assembly graph, subgraphs that contain alternative paths
are identified, corresponding to large fragments from individual genomes.
Besides these assemblers that are designed for either long or short reads, other
assemblers combine long and short reads in an effort to increased assembly
quality. These hybrid assemblers include MaSuRCA [23], hybridSPAdes [24],
and OPERA-MS [25].
After the assembly process, a metagenome usually comprises mostly of
contigs of various sizes. To evaluate the assembly quality, traditional evalu-
ation metrics, such as N50, are not as informative and representative as in
evaluating single-genome assemblies. Instead, aggregate statistics such as
the total number of contigs, the percentage of reads mapped to them, and
the maximum, median, and average lengths of the contigs are often used.
Further inspection of the assembly quality includes looking for chimeric or
mis-assemblies. There are currently a number of tools available to assess MAG
quality, including CheckM [26], MetaQUAST [27], and BUSCO [28], all of
which rely on reference genomes. Reference-free tools include DeepMAsED
[29] and ALE (Assembly Likelihood Evaluation) [30].
After contig assembly, if paired reads are available, metagenome scaffolds
can be built from the contigs. Many of the metagenome assemblers have
a module to carry out scaffolding. Besides these modules, dedicated
metagenome scaffolding tools like Bambus 2 [31] may be used to determine if
additional scaffolding is needed. Bambus 2 accepts contigs constructed with
most assemblers using reads from all sequencing platforms. In the process
of building scaffolds from contigs, ambiguous and inconsistent contigs may
also be identified. Besides scaffolding, another approach for the assembly of
MAGs is contig binning, which places contigs derived from the same genome
into the same bin. Contigs in the same bin are then reassembled into a MAG.
15.5.2 Sequence Binning
As indicated above, metagenomic sequence binning refers to the pro-
cess of clustering sequence fragments in a mixture into different “bins”
Whole Metagenome Sequencing for Microbial Community Analysis 349
TABLE 15.1
Commonly Used Binning Algorithms
Name Description Reference
Genome Binning
MaxBin 2.0 Classifies contigs into different bins using an Expectation- [33]
Maximization algorithm on the basis of their tetranucleotide
frequencies and coverages
CONCOCT Combines coverage and tetramer frequency for contig binning [36]
using GMM
MetaBAT 2 Uses adaptive binning to group the most reliable contigs [32]
first (such as those of high length), and then gradually add
remaining contigs
MetaWatt Uses multivariate statistics of tetramer frequencies and [37]
differential coverage information for binning. Also assesses
binning quality using taxonomic annotation of contigs in
each bin
VAMB A machine-learning based binner that encodes k-mer [38]
distribution and sequence co-abundance information using
variational autoencoders for subsequent binning
GroopM Bins sequences by primarily leveraging differential coverage [34]
information
MetaBinner An ensemble binner that integrates component binning results [39]
generated with multiple features and initiations
MetaWRAP An ensemble binner to generate hybrid bin sets from other [40]
binners and select final bins based on CheckM results
Taxonomy Binning
MEGAN Aligns sequences against the NCBI-nr reference database [43]
and then performs taxonomic binning using the naïve LCA
algorithm
Kraken 2 Assigns taxonomic labels to sequences based on search of [45, 48]
k-mers within the sequences against a database of indexed
and sorted k-mers (or their minimizers) extracted from all
genomes
PhyloPythiaS+ Achieves taxonomic binning through building sample- [46]
specific support vector machine taxonomic classifier using
most relevant taxa and training sequences determined
automatically
Whole Metagenome Sequencing for Microbial Community Analysis 351
to a taxon at a higher level (e.g., phylum or class), while another read that
aligns to a less conserved gene that is limited to a select group of organisms
is assigned to a lower-level taxon (such as genus or species). As it is based on
the current annotation of catalogued sequences, this approach is not suitable
to find currently unknown species or taxa.
15.5.4 Taxonomic Profiling
One goal of metagenome analysis is profile taxonomic composition and rela-
tive abundance of each taxon in a microbial community. This is related to
but different from taxonomic binning, which aims to group metagenomic
sequences into different bins. The CAMI II challenge finds that taxonomic
profilers MetaPhlAn [61] and mOTUs [62] had the best overall perform-
ance, both of which are based on the use of phylogenetic gene markers not
taxonomic binning of reads. Phylogenetic gene markers are composed of
ubiquitous but phylogenetically diverse genes, with good examples being
the rRNA genes (e.g., 16S), recA (DNA recombinase A), rpoB (RNA poly-
merase beta subunit), fusA (protein chain elongation factor), and gyrB (DNA
352 Next-Generation Sequencing Data Analysis
system (IMG/M) [80], GhostKOALA [81], or MGnify [82]. Some tools, such
as eggNOG-mapper [83] provide both online and stand-alone versions.
References
1. Lloyd KG, Steen AD, Ladau J, Yin J, Crosby L. Phylogenetically novel
uncultured microbial cells dominate earth microbiomes. mSystems 2018,
3(5):e00055-18.
2. Yarza P, Ludwig W, Euzeby J, Amann R, Schleifer KH, Glockner FO, Rossello-
Mora R. Update of the All-Species Living Tree Project based on 16S and 23S
rRNA sequence analyses. Syst Appl Microbiol 2010, 33(6):291–299.
3. Delmont TO, Robe P, Clark I, Simonet P, Vogel TM. Metagenomic comparison
of direct and indirect soil DNA extraction approaches. J Microbiol Methods
2011, 86(3):397–400.
4. McIver LJ, Abu-Ali G, Franzosa EA, Schwager R, Morgan XC, Waldron L,
Segata N, Huttenhower C. bioBakery: a meta’omic analysis environment.
Bioinformatics 2018, 34(7):1235–1237.
5. BMTagger (http://biowulf.nih.gov/apps/bmtagger.html)
6. Schmieder R, Edwards R. Fast identification and removal of sequence contam-
ination from genomic and metagenomic datasets. PLoS One 2011, 6(3):e17288.
7. Bushnell B. BBMap: a fast, accurate, splice- aware aligner. In.: Lawrence
Berkeley National Lab.(LBNL), Berkeley, CA (United States); 2014.
8. Xu H, Luo X, Qian J, Pang X, Song J, Qian G, Chen J, Chen S. FastUniq: a
fast de novo duplicates removal tool for paired short reads. PLoS One 2012,
7(12):e52249.
9. Nayfach S, Shi ZJ, Seshadri R, Pollard KS, Kyrpides NC. New insights from
uncultivated genomes of the global human gut microbiome. Nature 2019,
568(7753):505–510.
10. Almeida A, Mitchell AL, Boland M, Forster SC, Gloor GB, Tarkowska A,
Lawley TD, Finn RD. A new genomic blueprint of the human gut microbiota.
Nature 2019, 568(7753):499–504.
Whole Metagenome Sequencing for Microbial Community Analysis 357
28. Simao FA, Waterhouse RM, Ioannidis P, Kriventseva EV, Zdobnov EM.
BUSCO: assessing genome assembly and annotation completeness with
single-copy orthologs. Bioinformatics 2015, 31(19):3210–3212.
29. Mineeva O, Rojas- Carulla M, Ley RE, Scholkopf B, Youngblut ND.
DeepMAsED: evaluating the quality of metagenomic assemblies. Bioinformatics
2020, 36(10):3011–3017.
30. Clark SC, Egan R, Frazier PI, Wang Z. ALE: a generic assembly likelihood
evaluation framework for assessing the accuracy of genome and metagenome
assemblies. Bioinformatics 2013, 29(4):435–443.
31. Koren S, Treangen TJ, Pop M. Bambus 2: scaffolding metagenomes.
Bioinformatics 2011, 27(21):2964–2971.
32. Kang DD, Li F, Kirton E, Thomas A, Egan R, An H, Wang Z. MetaBAT 2: an
adaptive binning algorithm for robust and efficient genome reconstruction
from metagenome assemblies. PeerJ 2019, 7:e7359.
33. Wu YW, Simmons BA, Singer SW. MaxBin 2.0: an automated binning algo-
rithm to recover genomes from multiple metagenomic datasets. Bioinformatics
2016, 32(4):605–607.
34. Imelfort M, Parks D, Woodcroft BJ, Dennis P, Hugenholtz P, Tyson GW.
GroopM: an automated tool for the recovery of population genomes from
related metagenomes. PeerJ 2014, 2:e603.
35. Rosella (https://github.com/rhysnewell/rosella)
36. Alneberg J, Bjarnason BS, de Bruijn I, Schirmer M, Quick J, Ijaz UZ, Lahti
L, Loman NJ, Andersson AF, Quince C. Binning metagenomic contigs by
coverage and composition. Nat Methods 2014, 11(11):1144–1146.
37. Strous M, Kraft B, Bisdorf R, Tegetmeyer HE. The binning of metagenomic
contigs for microbial physiology of mixed cultures. Front Microbiol 2012, 3:410.
38. Nissen JN, Johansen J, Allesoe RL, Sonderby CK, Armenteros JJA, Gronbech
CH, Jensen LJ, Nielsen HB, Petersen TN, Winther O et al. Improved
metagenome binning and assembly using deep variational autoencoders. Nat
Biotechnol 2021, 39(5):555–560.
39. Wang Z, Huang P, You R, Sun F, Zhu S. MetaBinner: a high-performance and
stand-alone ensemble binning method to recover individual genomes from
complex microbial communities. Genome Biol 2023, 24(1):1.
40. Uritskiy GV, DiRuggiero J, Taylor J. MetaWRAP-a flexible pipeline for genome-
resolved metagenomic data analysis. Microbiome 2018, 6(1):158.
41. Sieber CMK, Probst AJ, Sharrar A, Thomas BC, Hess M, Tringe SG, Banfield JF.
Recovery of genomes from metagenomes via a dereplication, aggregation and
scoring strategy. Nat Microbiol 2018, 3(7):836–843.
42. Meyer F, Fritz A, Deng ZL, Koslicki D, Lesker TR, Gurevich A, Robertson
G, Alser M, Antipov D, Beghini F et al. Critical Assessment of Metagenome
Interpretation: the second round of challenges. Nat Methods 2022, 19(4):429–440.
43. Huson DH, Beier S, Flade I, Gorska A, El-Hadidi M, Mitra S, Ruscheweyh
HJ, Tappu R. MEGAN Community Edition –Interactive Exploration and
Analysis of Large-Scale Microbiome Sequencing Data. PLoS Comput Biol 2016,
12(6):e1004957.
44. Huson DH, Albrecht B, Bagci C, Bessarab I, Gorska A, Jolic D, Williams RBH.
MEGAN-LR: new algorithms allow accurate binning and easy interactive
exploration of metagenomic long reads and contigs. Biol Direct 2018, 13(1):6.
Whole Metagenome Sequencing for Microbial Community Analysis 359
63. Menzel P, Ng KL, Krogh A. Fast and sensitive taxonomic classification for
metagenomics with Kaiju. Nat Commun 2016, 7:11257.
64. Kim D, Song L, Breitwieser FP, Salzberg SL. Centrifuge: rapid and sensitive
classification of metagenomic sequences. Genome Res 2016, 26(12):1721–1729.
65. Lu J, Breitwieser FP, Thielen P, Salzberg SL. Bracken: estimating species abun-
dance in metagenomics data. PeerJ Comput Sci 2017, 30:e104.
66. Chaumeil PA, Mussig AJ, Hugenholtz P, Parks DH. GTDB-Tk: a toolkit to
classify genomes with the Genome Taxonomy Database. Bioinformatics 2019,
36(6):1925–1927.
67. Parks DH, Chuvochina M, Waite DW, Rinke C, Skarshewski A, Chaumeil PA,
Hugenholtz P. A standardized bacterial taxonomy based on genome phyl-
ogeny substantially revises the tree of life. Nat Biotechnol 2018, 36(10):996–1004.
68. Dilthey AT, Jain C, Koren S, Phillippy AM. Strain- level metagenomic
assignment and compositional estimation for long reads with MetaMaps. Nat
Commun 2019, 10(1):3066.
69. Fan J, Huang S, Chorlton SD. BugSeq: a highly accurate cloud platform for
long-read metagenomic analyses. BMC Bioinformatics 2021, 22(1):160.
70. Portik DM, Brown CT, Pierce-Ward NT. Evaluation of taxonomic classifica-
tion and profiling methods for long-read shotgun metagenomic sequencing
datasets. BMC Bioinformatics 2022, 23(1):541.
71. UniProt C. UniProt: the universal protein knowledgebase in 2021. Nucleic
Acids Res 2021, 49(D1):D480–D489.
72. Blum M, Chang HY, Chuguransky S, Grego T, Kandasaamy S, Mitchell A,
Nuka G, Paysan-Lafosse T, Qureshi M, Raj S et al. The InterPro protein families
and domains database: 20 years on. Nucleic Acids Res 2021, 49(D1):D 344–D354.
73. Galperin MY, Wolf YI, Makarova KS, Vera Alvarez R, Landsman D, Koonin
EV. COG database update: focus on microbial diversity, model organisms, and
widespread pathogens. Nucleic Acids Res 2021, 49(D1):D274–D281.
74. Huerta- Cepas J, Szklarczyk D, Heller D, Hernandez- Plaza A, Forslund
SK, Cook H, Mende DR, Letunic I, Rattei T, Jensen LJ et al. eggNOG
5.0: a hierarchical, functionally and phylogenetically annotated orthology
resource based on 5090 organisms and 2502 viruses. Nucleic Acids Res 2019,
47(D1):D309–D314.
75. Seemann T. Prokka: rapid prokaryotic genome annotation. Bioinformatics 2014,
30(14):2068–2069.
76. Shaffer M, Borton MA, McGivern BB, Zayed AA, La Rosa SL, Solden LM, Liu
P, Narrowe AB, Rodriguez-Ramos J, Bolduc B et al. DRAM for distilling micro-
bial metabolism to automate the curation of microbiome function. Nucleic
Acids Res 2020, 48(16):8883–8900.
77. Tanizawa Y, Fujisawa T, Nakamura Y. DFAST: a flexible prokaryotic genome
annotation pipeline for faster genome publication. Bioinformatics 2018,
34(6):1037–1039.
78. Tatusova T, DiCuccio M, Badretdin A, Chetvernin V, Nawrocki EP, Zaslavsky
L, Lomsadze A, Pruitt KD, Borodovsky M, Ostell J. NCBI prokaryotic genome
annotation pipeline. Nucleic Acids Res 2016, 44(14):6614–6624.
79. Keegan KP, Glass EM, Meyer F. MG- RAST, a Metagenomics Service for
Analysis of Microbial Community Structure and Function. Methods Mol Biol
2016, 1399:207–233.
Whole Metagenome Sequencing for Microbial Community Analysis 361
94. Noecker C, Eng A, Srinivasan S, Theriot CM, Young VB, Jansson JK, Fredricks
DN, Borenstein E. Metabolic model-based integration of microbiome taxo-
nomic and metabolomic profiles elucidates mechanistic links between eco-
logical and metabolic variation. mSystems 2016, 1(1):e00013–e00015.
95. Mallick H, Franzosa EA, McLver LJ, Banerjee S, Sirota-Madi A, Kostic AD,
Clish CB, Vlamakis H, Xavier RJ, Huttenhower C. Predictive metabolomic
profiling of microbial communities using amplicon or metagenomic
sequences. Nat Commun 2019, 10(1):3136.
96. Paulson JN, Stine OC, Bravo HC, Pop M. Differential abundance analysis for
microbial marker-gene surveys. Nat Methods 2013, 10(12):1200–1202.
97. McMurdie PJ, Holmes S. Waste not, want not: why rarefying microbiome
data is inadmissible. PLoS Comput Biol 2014, 10(4):e1003531.
98. Segata N, Izard J, Waldron L, Gevers D, Miropolsky L, Garrett WS,
Huttenhower C. Metagenomic biomarker discovery and explanation.
Genome Biol 2011, 12(6):R60.
99. Parks DH, Tyson GW, Hugenholtz P, Beiko RG. STAMP: statistical analysis
of taxonomic and functional profiles. Bioinformatics 2014, 30(21):3123–3124.
100. Mandal S, Van Treuren W, White RA, Eggesbo M, Knight R, Peddada SD.
Analysis of composition of microbiomes: a novel method for studying
microbial composition. Microb Ecol Health Dis 2015, 26:27663.
101. Lin H, Peddada SD. Analysis of compositions of microbiomes with bias
correction. Nat Commun 2020, 11(1):3514.
102. Martin BD, Witten D, Willis AD. Modeling microbial abundances and
dysbiosis with beta-binomial regression. Ann Appl Stat 2020, 14(1):94–115.
103. Mallick H, Rahnavard A, McIver LJ, Ma S, Zhang Y, Nguyen LH, Tickle TL,
Weingart G, Ren B, Schwager EH et al. Multivariable association discovery in
population-scale meta-omics studies. PLoS Comput Biol 2021, 17(11):e1009442.
104. Bolyen E, Rideout JR, Dillon MR, Bokulich NA, Abnet CC, Al-Ghalith GA,
Alexander H, Alm EJ, Arumugam M, Asnicar F et al. Reproducible, inter-
active, scalable and extensible microbiome data science using QIIME 2. Nat
Biotechnol 2019, 37(8):852–857.
105. Shi W, Qi H, Sun Q, Fan G, Liu S, Wang J, Zhu B, Liu H, Zhao F, Wang X
et al. gcMeta: a Global Catalogue of Metagenomics platform to support the
archiving, standardization and analysis of microbiome data. Nucleic Acids
Res 2019, 47(D1):D637–D648.
363
Part IV
make inroads into building long reads from short reads, while long-
read platforms have also started to provide short reads.
5) Reduction in sequencing and sample preparation time: Different
platforms use different strategies from chemistry updates, hardware
upgrades, to algorithmic improvements, to achieve quicker sequen-
cing turnaround time. To cut library prep time, constant improvements
are made to chemistries and protocols with increased amenability to
automation.
6) Decreased requirement on the amount of starting material: Historically
NGS requires large amounts of DNA or RNA to start. With the drive
to accommodate more sample types, such as those that do not gen-
erate much DNA or RNA (e.g., liquid biopsy and single cells), the
sensitivity of library making reagents and procedures has been sig-
nificantly improved.
FIGURE 16.1
MGI/BGI nanoball sequencing and CoolMPS chemistry. A. Nanoball sequencing starts from
circularization of DNA target molecules. After rolling circle amplification, nanoballs are formed
from circularized targets, and subsequently deposited onto a silicon-based sequencing chip for
sequencing using the cPAS process. B. In the CoolMPS chemistry (initially called CoolNGS),
nucleotide-specific antibodies are used for detection of incorporated nucleotides. This detection
mechanism avoids DNA “scars” derived from labeling of nucleotides with fluorescent tags,
potentially leading to increased read length. (From Gao, G., Smith, D.I. Clinical Massively
Parallel Sequencing. Clinical Chemistry, 2020, 66(1): 77– 88, by permission from Oxford
University Press.)
each labeled with a specific type of fluorescent dye for the detection step in
each cycle (Figure 16.1). To increase sequencing signal to background noise
ratio, multiple fluorescence dyes are attached to each antibody molecule. It
has been shown to have the potential to produce longer and more accurate
reads at lower cost [5].
Among emerging technologies, Element Biosciences offers a system based
on Avidity chemistry. Although still based on the same basic SBS process, on
the Element system the signal detection and incorporation of nucleotides are
separate, as the system does not collect sequencing signal from the nucleo-
tide incorporation process. Instead the signal is collected from binding of
nucleotides to the sequencing template. With Avidity chemistry [6], each
nucleotide attaches to a fluorescence-emitting core, and each type of nucleo-
tide (A, C, G, or T) attaches to their own cores that emit specific fluorescence
signal for detection. The most innovative aspect of this chemistry is that
each core contains multiple fluorophores and connects to multiple copies
368 Next-Generation Sequencing Data Analysis
FIGURE 16.2
The increase in the number of single-cell RNA-seq algorithm-related publications from 2013 till
2021. (Data source: Google Scholar, using “single cell RNA-seq” AND (algorithm OR method OR
tool) as query term.)
rich information it can provide for analysis of gene regulatory programs [28].
Compared to tools developed for scRNA-seq, scATAC-seq data analysis tools
are still low in numbers and not as well developed. Currently available tools
include SnapATAC [29], cisTopic [30], SCALE [31], and Signac [32].
Single-cell genome/exome sequencing, with the goal of tracking som-
atic evolution and revealing genetic heterogeneity at the single-cell level,
has been made possible with whole genome amplification methods. These
methods include multiple displacement amplification (MDA) [33], multiple
annealing and loop-based amplification cycles (MALBAC) [34], degenerate
oligonucleotide-primed PCR (DOP-PCR) [35], or commercially available kits
such as PicoPLEX and RepliG [36]. As each diploid cell only has two sets of
chromosomes and therefore a very low amount of DNA (~6.5 pg in a typical
human cell), the coverage of whole genome amplification is typically uneven
across the genome due to stochastic effects, amplification errors, and locus-
specific amplification bias. Such issues have prevented scaling-up of single-
cell genomics to a level that can be comparable to single-cell transcriptomics.
Progress has been made to overcome these issues with the development of
emulsion MDA (or eMDA) [37], single droplet MDA (or sd-MDA) [38], and
direct library preparation (DLP) [39]. To call SNVs from single-cell genomics
data, currently a relatively short list of tools is available, including SCcaller
[40], SCAN-SNV [41], and Single Cell Genotyper [42]. SNV Calling can also
be made from scRNA-seq data, or coupled scDNA-seq and scRNA-seq data,
using tools such as SSrGE [43]. Tools for calling CNVs or structural variants
include AneuFinder [44] and Ginkgo [45].
Single-cell epigenomics offers another dimension for single-cell sequencing.
Strategies such as single-cell whole genome bisulfite sequencing (scWGBS)
[46], single-cell reduced representation bisulfite sequencing (scRRBS) [47],
and single-nucleus methylome sequencing ver2 (snmC-seq2) [48] have been
used to detect DNA methylation as a epigenetic marker for cell typing. Tools
such as scBS-map [49] can be used for reads alignment, and Methylpy [50]
for calling of unmethylated and methylated cytosines. Simultaneous inter-
rogation of both the epigenome and genome of single cells has also been
made possible with methods such as epi-gSCAR [51]. Compared to scRNA-
seq and scATAC-seq conducted on high-throughput platforms such as 10×
Chromium, the throughput of single-cell genome and epigenome sequencing
is still lower, although this may well change over time.
While single-cell sequencing offers unprecedented resolution, isolation of
single cells (see Chapter 8) typically leads to the loss of contextual informa-
tion about their original location in their native tissue microenvironment.
Investigation into such spatial information provides insights on regional
specificity and cross-region heterogeneity, e.g., when comparing a patho-
genic region with the surrounding normal region on the same slide. Spatial
transcriptomics, enabled by rapid technology development from both aca-
demia and industry [52], is increasingly used to provide this additional layer
374 Next-Generation Sequencing Data Analysis
References
1. Drmanac R, Sparks AB, Callow MJ, Halpern AL, Burns NL, Kermani BG,
Carnevali P, Nazarenko I, Nilsen GB, Yeung G et al. Human genome sequen-
cing using unchained base reads on self-assembling DNA nanoarrays. Science
2010, 327(5961):78–81.
What’s Next for Next-Generation Sequencing (NGS)? 377
33. Dean FB, Nelson JR, Giesler TL, Lasken RS. Rapid amplification of plasmid
and phage DNA using Phi 29 DNA polymerase and multiply-primed rolling
circle amplification. Genome Res 2001, 11(6):1095–1099.
34. Zong C, Lu S, Chapman AR, Xie XS. Genome- wide detection of single-
nucleotide and copy-number variations of a single human cell. Science 2012,
338(6114):1622–1626.
35. Telenius H, Carter NP, Bebb CE, Nordenskjold M, Ponder BA, Tunnacliffe A.
Degenerate oligonucleotide-primed PCR: general amplification of target DNA
by a single degenerate primer. Genomics 1992, 13(3):718–725.
36. Imamura H, Monsieurs P, Jara M, Sanders M, Maes I, Vanaerschot M,
Berriman M, Cotton JA, Dujardin JC, Domagalska MA. Evaluation of whole
genome amplification and bioinformatic methods for the characterization of
Leishmania genomes at a single cell level. Sci Rep 2020, 10(1):15043.
37. Fu Y, Zhang F, Zhang X, Yin J, Du M, Jiang M, Liu L, Li J, Huang Y, Wang J.
High-throughput single-cell whole-genome amplification through centrifugal
emulsification and eMDA. Commun Biol 2019, 2:147.
38. Hosokawa M, Nishikawa Y, Kogawa M, Takeyama H. Massively parallel whole
genome amplification for single-cell sequencing using droplet microfluidics.
Sci Rep 2017, 7(1):5199.
39. Zahn H, Steif A, Laks E, Eirew P, VanInsberghe M, Shah SP, Aparicio S,
Hansen CL. Scalable whole-genome single-cell library preparation without
preamplification. Nat Methods 2017, 14(2):167–173.
40. Dong X, Zhang L, Milholland B, Lee M, Maslov AY, Wang T, Vijg J. Accurate
identification of single-nucleotide variants in whole-genome-amplified single
cells. Nat Methods 2017, 14(5):491–493.
41. Luquette LJ, Bohrson CL, Sherman MA, Park PJ. Identification of somatic
mutations in single cell DNA-seq using a spatial model of allelic imbalance.
Nat Commun 2019, 10(1):3908.
42. Roth A, McPherson A, Laks E, Biele J, Yap D, Wan A, Smith MA, Nielsen
CB, McAlpine JN, Aparicio S et al. Clonal genotype and population struc-
ture inference from single- cell tumor sequencing. Nat Methods 2016,
13(7):573–576.
43. Poirion O, Zhu X, Ching T, Garmire LX. Using single nucleotide variations
in single-cell RNA-seq to identify subpopulations and genotype-phenotype
linkage. Nat Commun 2018, 9(1):4892.
44. Bakker B, Taudt A, Belderbos ME, Porubsky D, Spierings DC, de Jong TV,
Halsema N, Kazemier HG, Hoekstra-Wakker K, Bradley A et al. Single-cell
sequencing reveals karyotype heterogeneity in murine and human malignan-
cies. Genome Biol 2016, 17(1):115.
45. Garvin T, Aboukhalil R, Kendall J, Baslan T, Atwal GS, Hicks J, Wigler M,
Schatz MC. Interactive analysis and assessment of single-cell copy-number
variations. Nat Methods 2015, 12(11):1058–1060.
46. Smallwood SA, Lee HJ, Angermueller C, Krueger F, Saadeh H, Peat J, Andrews
SR, Stegle O, Reik W, Kelsey G. Single-cell genome-wide bisulfite sequencing
for assessing epigenetic heterogeneity. Nat Methods 2014, 11(8):817–820.
47. Guo H, Zhu P, Guo F, Li X, Wu X, Fan X, Wen L, Tang F. Profiling DNA
methylome landscapes of mammalian cells with single- cell reduced-
representation bisulfite sequencing. Nat Protoc 2015, 10(5):645–659.
380 Next-Generation Sequencing Data Analysis
48. Luo C, Rivkin A, Zhou J, Sandoval JP, Kurihara L, Lucero J, Castanon R, Nery
JR, Pinto-Duarte A, Bui B et al. Robust single-cell DNA methylome profiling
with snmC-seq2. Nat Commun 2018, 9(1):3824.
49. Wu P, Gao Y, Guo W, Zhu P. Using local alignment to enhance single-cell
bisulfite sequencing data efficiency. Bioinformatics 2019, 35(18):3273–3278.
50. Schultz MD, He Y, Whitaker JW, Hariharan M, Mukamel EA, Leung D,
Rajagopal N, Nery JR, Urich MA, Chen H et al. Human body epigenome
maps reveal noncanonical DNA methylation variation. Nature 2015,
523(7559):212–216.
51. Niemoller C, Wehrle J, Riba J, Claus R, Renz N, Rhein J, Bleul S, Stosch JM,
Duyster J, Plass C et al. Bisulfite-free epigenomics and genomics of single cells
through methylation-sensitive restriction. Commun Biol 2021, 4(1):153.
52. Liao J, Lu X, Shao X, Zhu L, Fan X. Uncovering an organ’s molecular archi-
tecture at single-cell resolution by spatially resolved transcriptomics. Trends
Biotechnol 2021, 39(1):43–58.
53. Stickels RR, Murray E, Kumar P, Li J, Marshall JL, Di Bella DJ, Arlotta P,
Macosko EZ, Chen F. Highly sensitive spatial transcriptomics at near-cellular
resolution with Slide-seqV2. Nat Biotechnol 2021, 39(3):313–319.
54. Dries R, Zhu Q, Dong R, Linus Eng C-H, Li H, Liu K, Fu Y, Zhao T, Sarkar A,
Bao F et al. Giotto: a toolbox for integrative analysis and visualization of spa-
tial expression data. Genome Biol 2021, 22(1):78.
55. Zhao E, Stone MR, Ren X, Guenthoer J, Smythe KS, Pulliam T, Williams SR,
Uytingco CR, Taylor SEB, Nghiem P et al. Spatial transcriptomics at subspot
resolution with BayesSpace. Nat Biotechnol 2021, 39(11):1375–1384.
56. Stephens ZD, Lee SY, Faghri F, Campbell RH, Zhai CX, Efron MJ, Iyer R, Schatz
MC, Sinha S, Robinson GE. Big Data: Astronomical or Genomical? PLoS Biol
2015, 13(7).
57. Schmidt B, Hildebrandt A. Next-generation sequencing: big data meets high
performance computing. Drug Discov Today 2017, 22(4):712–717.
58. Olson ND, Wagner J, McDaniel J, Stephens SH, Westreich ST, Prasanna AG,
Johanson E, Boja E, Maier EJ, Serang O et al. PrecisionFDA Truth Challenge
V2: Calling variants from short-and long-reads in difficult-to-map regions.
Cell Genom 2022, 2(5):100129.
59. Poplin R, Chang PC, Alexander D, Schwartz S, Colthurst T, Ku A, Newburger
D, Dijamco J, Nguyen N, Afshar PT et al. A universal SNP and small-indel
variant caller using deep neural networks. Nat Biotechnol 2018, 36(10):983–987.
60. Shafin K, Pesout T, Chang PC, Nattestad M, Kolesnikov A, Goel S, Baid G,
Kolmogorov M, Eizenga JM, Miga KH et al. Haplotype-aware variant calling
with PEPPER-Margin-DeepVariant enables high accuracy in nanopore long-
reads. Nat Methods 2021, 18(11):1322–1332.
61. Luo R, Wong C-L, Wong Y-S, Tang C-I, Liu C-M, Leung C-M, Lam T-W.
Exploring the limit of using a deep neural network on pileup data for germline
variant calling. Nat Mach Intell 2020, 2(4):220–227.
62. Zheng Z, Li S, Su J, Leung AW-S, Lam T-W, Luo R. Symphonizing pileup and
full-alignment for deep learning-based long-read variant calling. Nat Comput
Sci 2021, 2(12):797–803.
63. Ahsan MU, Liu Q, Fang L, Wang K. NanoCaller for accurate detection of
SNPs and indels in difficult-to-map regions from long-read sequencing by
haplotype-aware deep neural networks. Genome Biol 2021, 22(1):261.
What’s Next for Next-Generation Sequencing (NGS)? 381
64. Edge P, Bansal V. Longshot enables accurate variant calling in diploid genomes
from single-molecule long read sequencing. Nat Commun 2019, 10(1):4660.
65. Cai L, Wu Y, Gao J. DeepSV: accurate calling of genomic deletions from high-
throughput sequencing data using deep convolutional neural network. BMC
Bioinformatics 2019, 20(1):665.
66. Hill T, Unckless RL. A Deep Learning Approach for Detecting Copy Number
Variation in Next-Generation Sequencing Data. G3 2019, 9(11):3575–3582.
67. Wang L, Xi Y, Sung S, Qiao H. RNA-seq assistant: machine learning based
methods to identify more transcriptional regulated genes. BMC Genomics
2018, 19(1):546.
68. Bonet J, Chen M, Dabad M, Heath S, Gonzalez-Perez A, Lopez-Bigas N,
Lagergren J. DeepMP: a deep learning tool to detect DNA base modifications
on Nanopore sequencing data. Bioinformatics 2022, 38(5):1235–1243.
69. Ni P, Huang N, Nie F, Zhang J, Zhang Z, Wu B, Bai L, Liu W, Xiao CL, Luo F
et al. Genome-wide detection of cytosine methylations in plant from Nanopore
data using deep learning. Nat Commun 2021, 12(1):5976.
70. Tse OYO, Jiang P, Cheng SH, Peng W, Shang H, Wong J, Chan SL, Poon LCY,
Leung TY, Chan KCA et al. Genome-wide detection of cytosine methyla-
tion by single molecule real-time sequencing. Proc Natl Acad Sci U S A 2021,
118(5):e2019768118.
71. Hollister EB, Oezguen N, Chumpitazi BP, Luna RA, Weidler EM, Rubio-
Gonzales M, Dahdouli M, Cope JL, Mistretta TA, Raza S et al. Leveraging
human microbiome features to diagnose and stratify children with irritable
bowel syndrome. J Mol Diagn 2019, 21(3):449–461.
72. Abraham J, Heimberger AB, Marshall J, Heath E, Drabick J, Helmstetter A, Xiu
J, Magee D, Stafford P, Nabhan C et al. Machine learning analysis using 77,044
genomic and transcriptomic profiles to accurately predict tumor type. Transl
Oncol 2021, 14(3):101016.
383
Appendix I
Common File Types Used in NGS
Data Analysis
BAM: A file format for storing reads alignment data. It is the binary version
of the SAM format (see below). Compared to its equivalent SAM file, a
BAM file is considerably smaller in size and much faster to load. Unlike
SAM files, however, the BAM format is not human-readable. BAM files
have a file extension of .bam. Some tools require BAM files to be indexed.
Besides the .bam file, an indexed BAM file also has a companion index
file of the same name but with a different file extension (.bai).
BCF: Binary VCF (see VCF). While it is equivalent to VCF, BCF is much
smaller in file size due to compression, and therefore achieves high effi-
ciency in file transfer and parsing.
BCL: Binary basecall files generated from Illumina’s proprietary basecalling
process.
BED: Browser Extensible Display format used to describe genes or other
genomic features in a genome browser. It is a tab-delimited text format
that defines how genes or genomic features are displayed as an anno-
tation track in a genome browser such as the UCSC Genome Browser.
Each entry line contains three mandatory fields (chrom, chromStart, and
chromEnd, specifying for each genomic feature the particular chromo-
some it is located on and the start and end coordinates) and nine optional
fields. Binary PED files (see below) are also referred to as BED files, but
this is a totally different file format.
bedGraph: Similar to the BED format, bedGraph provides descriptions of
genomic features for their display in a genome browser. Distinctively
the bedGraph format allows display of continuous values, such as prob-
ability scores and coverage depth, in a genome.
bigBed: A format similar to BED, but bigBed files are binary, compressed,
and indexed. Display of bigBed files in a genome browser is significantly
faster due to the compression and indexing, which allow transmittal of
only the part of the file that is needed for the current view instead of the
entire file.
bigWig: A format for visualization of dense, continuous data, such as GC
content, in a genome browser. A newer format from the WIG format (see
below), bigWig is a compressed and indexed binary file format and loads
significantly faster.
383
384 Appendix I
HDF5 (or H5): Standing for Hierarchical Data Format version 5. HDF5 is an
open-source file format designed to store and organize large and com-
plex data. The hierarchical structure it uses is similar to a file system, in
that its two major objects, groups and datasets, are similar to directories
and files, respectively. The FAST5 file format (see FAST5) used by Oxford
Nanopore sequencers and the single-cell RNA-seq gene-cell matrix data
used by the 10× Genomics Cell Ranger software are based on the HDF5
format.
MEX: Market Exchange format. In 10× Genomics single-cell RNA-seq, the
MEX file format is used by the Cell Ranger software to output gene-cell
data matrix (besides HDF5). This is a sparse matrix format because of the
large number of 0’s contained in the file. This file format comprises three
files, i.e., matrix.mtx that contains the gene-cell barcode matrix, barcodes.
tsv for storing cell barcodes, and genes.tsv for genes.
PED: A file format used by PLINK (a toolset for genome-wide association
analysis) that contains pedigree/phenotype data.
SAM: Standing for Sequence Alignment/Map, SAM is a standard NGS
reads alignment file format, describing how reads are mapped to a ref-
erence genome. It is a tab-delimited text format and human-readable.
SAM files can be converted into its compressed binary version (BAM)
for faster parsing and file size reduction. SAM files have a file extension
of .sam. An indexed SAM file also has an accompanying index file that
has an file extension of .sai.
SFF (Standard Flowgram Format): A type of binary sequencing file generated
by 454 sequencers. Can be converted to the FASTQ format using utilities
such as sff2fastq.
VCF: Stands for Variant Call Format. A commonly used file format for
storing variant calls. It is a tab-delimited, human-readable text format
that contains meta-information lines, a header line, and data lines that
describes each variant.
WIG: Wiggle Track Format. It is used for displaying continuous data track,
such as GC content, in a genome viewer such as the UCSC Genome
Browser. The WIG format is similar to the bedGraph format (see above),
but a major difference between the two is that data exported from a WIG
track is not as well preserved as that from a bedGraph track. The WIG
format can be converted to bigWig (see above) for improved performance.
387
Appendix II
Glossary
387
388 Appendix II
Paired-End Reads: Reads obtained from the two ends of a DNA fragment.
Since the length of the DNA fragment, i.e., the distance between the
reads, is known, use of paired-end reads provides additional positional
information in mapping or assembly of the reads. In comparison to
Single-End Reads.
Pathway: A succession of molecular events that leads to a cellular response
or product. Each of such events is usually carried out by a gene
product. Many biological pathways are involved in metabolism, signal
transduction, and gene expression regulation.
PCA: Principal Component Analysis. A dimensionality reduction technique
to help summarize and visualize large and complex datasets. PCA is
widely used in next-gen sequencing applications such as bulk and single-
cell RNA-seq.
PCR Bottleneck Coefficient (PBC): An index of sequencing library
complexity. It is calculated after the read mapping step as the ratio
between the number of genome locations to which only one unique
sequence read maps and the total number of genome locations to which
one or more unique reads maps. PBC measures the distribution of read
counts towards one read per location.
Phred Quality Score (Q Score): An integer value that is used to estimate
the probability of making an error, i.e., calling a base incorrectly. It is
calculated as Q =−10xlog(10)P(Err). For example, a Q score of 20 (Q20)
means a 1/100 chance of making a wrong call. Q30 represents a 1/
1000 chance of making a wrong call, which is considered to be a high-
confidence score. Q scores are often represented as ASCII characters for
brevity.
Picard: A set of tools written in Java for handling NGS data and file formats.
Pileup: A file format created with SAMtools showing how each genomic
coordinate is covered by reference sequence-matching or -unmatching
bases from all aligned reads.
piRNA: Piwi-interacting RNA. See Small RNA.
Polymerase Chain Reaction (PCR): A molecular biology technique that
amplifies the amount of a DNA or RNA fragment, with the use of specific
oligonucleotide primers that flank the two ends of the target fragment.
Promoter: DNA sequence upstream of the open reading frame of a gene. The
promoter region is recognized by RNA polymerase during initiation of
transcription. Contains highly conserved sequence motifs.
Proteome: The complete set of proteins in a cell, tissue, or organ at a certain
point of time. Proteomics analyzes a proteome via identifying individual
component proteins in the repertoire and their abundance.
Quality Score: See Basecall Quality Score.
Read: Sequence readout of a DNA (or RNA) fragment.
RNA-Seq: Stands for RNA sequencing. Also referred to as whole
transcriptome shotgun sequencing. RNA-seq is a major technology for
transcriptome analysis and a major application of NGS.
Appendix II 393
Index
A Ascorbic acid, 29
AsmVar, 227
Ab initio splice junction detection, 123
Assay for Transposase-Accessible
A-Bruijn variant, 279
Chromatin, 372
ABySS, 271, 279, 282
Assembling contigs, into scaffold, 281
ACDtool, 132
Assembly Likelihood Evaluation (ALE),
Adapter ligation, 61, 73, 74, 75
348
AI-based decision support tools, 256
Association for Molecular Pathology
Alignment methods, 92
(AMP), 248
Allele-specific expression (ASE) analysis,
ATAC-seq data, 300
186
AT-overhang-based adapter ligation
ALLPATHS-LG, 275, 282, 285
process, 75
error correction module, 275
ATP synthesis
reference-assisted assembly approach,
cytoplasmic membrane, 11
285
proton gradient, 12
Alu element, 29
AU-rich element, 43
Alzheimer’s disease (AD), 11, 31
Automated Cell Type Identification
Amazon, 107
using Neural Networks (ACTINN),
Amazon Web Services (AWS)
176
Management Console, 108
Average silhouette width (ASW), 164
American Association for Cancer
Avidity chemistry, 367
Research (AACR), 246
American College of Medical Genetics
B
and Genomics (ACMG), 248
AMP rules, 255 Balrog, 351
pathogenicity/benignity evidence, Barcode, 73
combination of, 255 BaseRecalibrator, 217
American Society of Clinical Oncology Batch effects, correction, 163
(ASCO), 254 BatchQC, 129
ANNOVAR tool, 245 Bayesian approach, 178, 352
Anti-oxidant response element (ARE), 26 Bayesian mixed, 315
AnVIL, 109 Bayesian modeling, scDD employs, 180
APP gene, 31 BaySeq, 130
Application Programmer Interfaces BBTools suite, 347
(APIs), 112, 246 BCFtools, 218
Application-specific integrated circuit Bcftools mpileup, 218
(ASIC), 67 “BED-file” mode, 308
ApplyBQSR, 217 BedGraph
ARAGORN, 351 file format, 332
Argonaute processing, 206 track files, 309
Artificial intelligence, in variant BEDTools, 309
reporting, 256–257 Benjamini–Hochberg approach, 301, 306
Artificial neural networks (ANNs), 111, Binning algorithms, 350
168, 217 Biocontainers, 372
397
398 Index
RNA-seq data, 129, 178, 284, 315 Sanger sequencing method, 57–59, 77,
batch effect removal, 129 259, 271
data analysis, 98, 132 Savant, 223
data distribution, 121 Scaffolding algorithms, 281
DE analysis, 313 SCALE, 373
differentially expressed genes, Scanorama, 163
identification of, 129–133 Scanpy, 173
differential splicing analysis, 136–137 SCAN-SNV, 373
discovery tool, 137–138 scATAC-seq data analysis tools, 373
experimental design, 118 SCcaller, 373
gene clustering, 134 SciDAP, 371
identified genes, functional analysis Scmap-cell, 176
of, 134–136 Scmap-cluster mode, 176
multiple testing correction, 133–134 scPred, 176
normalization, 127–128 scRNA-seq analysis, goal of, 171
overdispersion problem, 131 scRNA-seq data analysis, 152, 176
reads, subsequent mapping of, 122 algorithms, 370
visualization of, 137 DE analysis, 179
RnaSeqSampleSize, 121 normalization approaches, 161
RNA-seq sequencing libraries, 120 single nucleotide variation (SNV), 185
RNA-seq study, 106 structure of, 154
rnaSPAdes, de novo transcriptome workflow, 155
assemblers, 125 scRNA-seq tools, 181
RNA splicing, non-canonical, 138 SeattleSeq, 229
RNA transcripts, 41, 117 Seed-and-extend methods, 86
“RNA world” hypothesis, 45 Self-Organizing Map (SOM), 132
Roche’s NAVIFY Mutation Profiler, 256 Sentieon, 243
R package, 109–111, 137, 307 SeqMonk, 97
rpoB (RNA polymerase beta subunit), 351 Sequence Read Archive (SRA), 104
rRNA depletion, 120, 125 Sequencing-by-desynthesis process, 368
rRNA genes, 351 Sequencing error correction, 274, 276
RSAT peak-motifs, 314 Sequencing library preparation
RSEM, 126 protocols, 344
RTG Tools, 225 Seurat, 172, 173
Rungs, 17 Seurat Integration, 163
RUVSeq, 129 Seven Bridges, 371
Seven Bridges GRAF pipeline, 375
SGA, 282
S
Shannon entropy, in QDMR, 332
SAM/BAM alignment section, 93 Shared Nearest Neighbor (SNN) graph,
SAM/BAM files, 95, 97 172
file format, 93 Short Oligonucleotide Alignment
FLAG status, 94 Program (SOAP), 87
for storing NGS read alignment, 94 Short Tandem Repeats (STR), 240
SAMtools, 97, 185 Signac, 373
packages, 96 Signal imputation, 164
pileup file format, 97 Signal sparsity, challenges of, 186
Sanger sequence assemblers, 271 Signal-to-noise ratio, 368
Index 413