Bioinformatics/Computationa L Tools For NGS Data Analysis: An Overview
Bioinformatics/Computationa L Tools For NGS Data Analysis: An Overview
Dr.S.Balaji
Bauplechain Technologies Private Limited
www.bauplechain.com
Agenda
• Next Generation Sequencing (NGS)
• Second Generation Sequencing
Platforms
• Third Generation Sequencing
Platforms
• NGS Bioinformatics
• Tools for Primary Analysis
• Tools for Secondary Analysis
• Tools for Tertiary Analysis
• NGS Pitfalls
• Concluding Remarks
2
• Genetics is extremely important to medical practice.
• Provides a definitive diagnosis for many clinically
heterogeneous diseases;
• Enables accurate disease prognosis;
• Provides guidance towards the selection of the best
possible options of care for the patients.
Genetics and • Current potential derives from the capacity to
Medical interrogate the human genome at different levels.
• Chromosomal to the single-base alterations.
Practice • Pioneering works on DNA sequencing made possible
several progresses resulting in Sanger sequencing
followed by automated DNA sequencer which
facilitated human genome sequencing in 2001.
• R&D in nanotechnology and informatics, contributed
to the new generation of sequencing methods.
3
DNA
Sequencin
g Timeline
NG—next generation
PCR—polymerase chain
reaction
SMS—single molecule
sequencing
SeqLL—sequence the 4
lower limit
• New approaches targeted to complement
and eventually replace Sanger sequencing.
• This technology is collectively referred to as
Next-Generation Sequencing (NGS) or
Next Massively Parallel Sequencing (MPS).
Generation • An umbrella to designate a wide diversity of
approaches.
Sequencing • Through this technology, it is possible to
(NGS) generate massive amounts of data per
instrument run, in a faster and cost-effective
way.
• Now possible to stream in parallel huge data of
several genes or even the entire genome.
5
• Global market is projected to reach 21.62 billion US dollars by
2025, growing at about 20% from 2017 to 2025.
• Several brands are presently on the NGS market – top
sequencing companies.
• Illumina
• Ion Torrent (Thermo Fischer Scientific)
6
• A library is a collection of DNA/RNA fragments that
represents either the entire genome/transcriptome or a
target region.
• Each NGS platform has its specificities, but, in simple
terms, the preparation of an NGS library starts with the
7
• The first step to prepare libraries in most
NGS workflows is the fragmentation of
nucleic acid.
• Fragmentation can be done either by
physical or enzymatic methods.
NGS • Physical methods include acoustic shearing,
Library sonication and hydrodynamic shearing.
• Enzymatic methods include digestion by DNase
Preparation or Fragmentase.
9
• Upon nucleic acid fragmentation, the fragments are selected
according to the desired library size.
• Limited either by the type of NGS instrument and by the specific
sequencing application.
• Short-read sequencers, such as Illumina and Ion Torrent, present best
NGS results when DNA libraries contain shorter fragments of similar sizes.
• Illumina fragments are longer than that in Ion Torrent.
Library • In Illumina, the fragments can go up to 1500 bases in length.
• In Ion Torrent, the fragments can go up to 400 bases in length.
Preparation • Long-read sequencers, like PacBio RS II produce ultra-long reads
NGS • Two methods are commonly used for such targeted approaches:
• Capture hybridization-based sequencing
Library • In the hybrid capture method, upon the fragmentation step, the
fragmented molecules are hybridized specifically to DNA
12
• Illumina commercializes several integrated systems for the
analysis of genetic variation and biological function that can be
Second- applied in multiple biological systems from agriculture to
medicine.
Generation • The process of Illumina sequencing is based on the
Sequencing sequencing-by-synthesis (SBS) concept.
• Capture, on a solid surface, of individual molecules, followed by
Platforms bridge PCR that allows its amplification into small clusters of
identical molecules.
Second-
of dense clusters of double-stranded DNA in each channel of the
flow cell.
Generation • The sequencing can then occur, by the addition to the template (on
the flow cell) of a single labelled complementary deoxynucleotide
Sequencing triphosphate (dNTP), which serves as a “reversible terminator”.
Sequencing • The first sequencers using the SMRT technology faced the
drawbacks of having a limited high-throughput, higher costs
Platforms and higher error rate.
• However, significant improvements have been made to
(contd.) overcome these limitations.
• More recently, PacBio launched the Sequel II System, which
claims to reduce the project costs and timelines with highly
accurate individual long reads (HiFi reads, up to 175 kb).
• Using a PacBio System, researchers demonstrated a successful
application of long-read genome sequencing to identify a
pathogenic variant in a patient with Mendelian disease.
• Suggests that this technology has significant potential for the
identification of disease-causing structural variation.
19
• A second approach to single-molecule sequencing was
commercialized by Oxford Nanopore Technologies, named
Third- MinION, and commercially made available in 2015.
• This sequencer does not rely on SBS but instead relies on the
Generation electrical changes in current as each nucleotide (A, C, T and G)
passes through the nanopore.
Sequencing • Nanopore sequencing uses electrophoresis to transport an
Platforms unknown sample through a small opening, then an ionic
current passes through nanopores and the current is changed
(contd.) as the bases pass through the pore in different combinations.
• This information allows the identification of each molecule and
to perform the sequencing.
• In May 2017, Oxford Nanopore Technologies launched the
GridION Mk1, a flexible benchtop sequencing and analysis
device that offers real-time, long-read, high-fidelity DNA and
RNA sequencing.
• It was designed to allow up to five experiments run
simultaneously or individually, with simple library preparation,
enabling the generation of up to 150 Gb of data during a run.
20
• New advances were launched with PromethION 48 system
that offers 48 flow cells, and each flow cell allows up to 3000
Third- nanopores sequence simultaneously, which can deliver up to
7.6 Tb yields in 72 hours.
Generation • Longer reads are of utmost importance to unveil repetitive
Sequencing elements and complex sequences, such as transposable
elements, segmental duplications and telomeric/centromeric
Platforms regions that are difficult to address with short reads.
22
• For structural variant analysis, the short-read libraries are
computationally reconstructed to be able to perform
Third- megabase-scale haplotype genome sequences using small
amounts of input DNA.
Generation • This technology:
• allows linked-read phasing of SVs that distinguishes true SVs
Sequencing from false predictions,
• has the potential to be applied to de novo genome assembly,
Platforms • can remap difficult regions of the genome,
27
• The principle for signal detection relies on fluorescence.
• Therefore, the base-calling is apparently much simpler, and is
Primary Analysis made directly from fluorescent signal intensity measurements
resulting from the incorporated nucleotides during each cycle.
• Illumina’s SBS technology delivers the highest percentage of
Illumina error-free reads.
• Latest versions of Illumina’s chemistry have been reoptimized
to enable accurate base-calling in difficult genomic regions.
• The dNTPs have been chemically modified to contain a
reversible blocking group that acts as a temporary terminator
for DNA polymerization.
• After each dNTP incorporation, the image is processed to
identify the corresponding base and then enzymatically
cleaved-off to allow incorporation of the next one.
• A single flow cell often contains billions of DNA clusters tightly
and randomly packed into a very small area.
• Such physical proximity could lead to crosstalk events
between neighbouring DNA clusters.
28
• As fluorophores attached to each base produce light emissions, there can be some
degree of interference between the nucleotide signals, which can overlap with the
optimal emissions of the fluorophores of the surrounding clusters.
• Although the base-calling is simpler than in Ion Torrent, the image processing step is
Primary Analysis
Illumina (contd.)
quite complex.
• The overall process requires aligning each image to the template of cluster position
on the flow cell, image extraction to assign an intensity value for each DNA cluster,
followed by intensity correction.
• Besides this crosstalk correction, other problematic aspects occur during the
sequencing process and influence the base-calling process:
• phasing (failures in nucleotide incorporation),
• fading (or signal decay) and
• T accumulation (thymine fluorophores are not always efficiently removed after each
iteration, causing a build-up of the signal along the sequencing run).
• Over many cycles, these errors accumulate and decrease the overall signal to noise
ratio per single cluster, causing a decrease in quality towards the ends of the reads.
• Some of the initial base-callers for the Illumina platform were Alta-Cyclic and
Bustard.
• Currently, there are multiple other base-callers differing in the statistical and
computational methodologies used to infer the correct base.
• Despite this variability, the most widely used base-caller is the Bustard and several
base-calling algorithms were built using Bustard as the starting point. 29
Widely used
base caller
software for
the Illumina
platform
INT – intermediate
executable code for old
platforms
CIF – cluster intensity
files for recent
platforms
30
• Bustard algorithm is based on fluorescence signals conversion into actual
sequence data.
• The intensities of four channels for every cluster in each cycle are taken,
Primary Analysis
Illumina (contd.)
which allows the determination of concentration of each base.
• Bustard algorithm is based on a parametric model and applies the Markov
algorithm to determine transition matrix modelling probability of phasing
(no new base synthesized), prephasing (two new bases synthesized) and
normal incorporation.
• Bustard algorithm assumes a crosstalk matrix constant for a given
sequencing run and that phasing affects all nucleotides in the same way.
• Aiming to improve performance and decreasing error rate, several base-
callers have been developed.
• There is no evidence that a given base-caller is better than the other.
• Comparison between the performance of different base-callers regarding
the alignment rates, error rate and running time, shows that:
• AYB presents the lowest error rate, the BlindCal is the fastest while BayesCall has the best
alignment rate.
• BayesCall, freeIbis, Ibis, naiveBayesCall, Softy FB and Softy SOVA did not show significant
differences among each other, but all showed improvements in the error rates compared
to the standard Bustard.
• 3Dec is recently developed base-caller for Illumina sequencing platforms,
which claims to reduce the base-calling errors by 44–69% compared to the
previous ones. 31
• Performed in the Ion Torrent Suite Software.
• Starts with signal processing, in which the signal of nucleotide
Primary Analysis incorporation is detected by the sensor at the bottom of chip cell,
converted to voltage and transferred from the sequencer to the server as a
Ion Torremt raw voltage data, named DAT file.
• For each nucleotide flow, one acquisition file is generated that contains the
raw signal measurement in each chip well for the given nucleotide flow.
• During the analysis pipeline, these raw signal measurements are converted
into incorporation measures, named WELLS file.
• The base-calling is the final step of primary analysis and is performed by a
base-caller module.
• Objective is to determine the most likely base sequence, the best match for the
incorporation signal stored in a WELLS file.
• Mathematical models behind this base-calling are complex and comprises
three sub-steps:
• key-sequence based normalization,
• iterative/adaptive normalization and
• phase correction.
32
Primary Analysis
Workflow in Ion
• Signal emitted from nucleotide incorporation is inspected
Torrent
by the sensor, which converts the raw voltage data into a
DAT file.
• DAT file serves as input to the server, which converts into a
WELLS file.
• WELLS file is used as input on the Ion Torrent Base-caller
module that gives a final BAM file, ready for the secondary 33
analysis.
• Such a procedure is required to address some of the errors occurring during
the SBS process, namely phasing or signal droop (DR).
Primary Analysis bead become terminated and there is no more nucleotide incorporation.
• These errors occur quite frequently and thus, as an initial step, Ion Torrent
performs phase-correction and signal decay normalization algorithms.
• The three parameters that are involved in the signal generation are:
• the carry-forward (CF, that is, an incorrect nucleotide-binding),
• incomplete extension (IE, e.g., the flown nucleotide did not attach to the correct position
in the template) and
• droop (DR).
• CF and IE regulate the rate of non-phase polymerase build-up, while DR
measures DNA polymerase loss rate during a sequencing run.
• The chip area is divided into specific regions, and each well is further
divided into two groups (odd- and even-numbered) each of which receives
its own, independent set of estimates.
• Then, some training wells are selected and used to find optimal CF, IE and
DR, using the Nelder–Mead optimization algorithm.
• Nelder–Mead optimization algorithm uses a triangle shape or a simplex, e.g., a
generalized triangle in N dimensions, to search for an optimal solution in a
multidimensional space.
34
Ion Torremt (contd.)
• The CF, IE and DR measurements, as well as, the normalized
Primary Analysis signals, are used by Solver, which follows the branch-and-
bound algorithm.
• A branch-and-bound algorithm consists of a systematic listing
of all partial sequences forming a rooted tree with the set of
candidate solutions being placed as branches of this tree.
• The algorithm expands each branch by checking it against the
optimal solution (the theoretical one) and goes on-and-on until
finding the closer to the optimal solution.
• Before the first performance of the Solver, key normalization
occurs.
• The key normalization is based on the premise that the signal
emitted during nucleotide incorporation is theoretically 1 and
for the non-incorporation, the signal emitted is 0.
35
• Key normalization scales the measurements by a constant
they contain the raw sequencing reads, the filenames and the
quality values, with higher numbers indicative of higher
qualities.
• The quality of the raw sequence is critical for the overall
success of NGS analysis.
• Several bioinformatics tools were developed to evaluate the
quality of raw data.
• NGS QC toolkit
• QC-Chain
• FastQC
37
• FastQC is the most popular.
• As output, FastQC gives a report containing well-structured and graphically
illustrated information about different aspects of the read quality.
adapters’ sequences.
• The trimming step, although reduces the overall number and the length of
reads, raises quality to acceptable levels.
• Several tools were developed to perform trimming with Illumina data.
• BTrim, IeeHom, AdapterRemoval and Trimmonatic
• The choice of the tool is highly dependent on the dataset, downstream
analysis and parameters used.
• In Ion Torrent, sequencing and data management are processed in Torrent
Suite Software, which has a Torrent Browser as a web interface.
• To further analyze the sequences, a demultiplexing process is required,
which separates the sequencing reads into separate files according to the
barcodes used for each sample.
• Most of the demultiplexing protocols are specific to NGS platform
38
manufacturers.
• Demultiplexing is the adapter trimming step, whose function is
Quality Control: Read Filtering and
Primary Analysis to remove the remaining library adapter sequences from the
end of the reads.
Trimming (contd.) • In most cases from 3’ end but can depend on the library preparation.
• This step is necessary, since residual adaptor sequences in the
reads may interfere with mapping and assembly.
• Trimming is important in RNA-Seq, SNP identification and
genome assembly procedures to increase the quality and
reliability of the analysis and optimize the execution time and
computational resources needed.
• Several tools are used to perform the trimming with Illumina
data.
• There is no best tool, instead, the choice is dependent on downstream
analysis and user-decided parameter-dependent trade-offs.
• In Ion Torrent, this is also done in Torrent Suite Software.
39
Secondary Analysis
• Secondary analysis in NGS data analysis pipeline includes the reads
alignment against the reference human genome (typically hg19 or
hg38) and variants calling.
• To map sequencing reads, two different alternatives can be followed:
• read alignment, that is the alignment of the sequenced fragments
against a reference genome, or
• de novo assembly that involves assembling a genome from scratch
without the help of external data.
• Choice between approaches relies on the existence or not of a
reference genome.
• For most NGS applications, in clinical genetics, mapping against a
reference sequence is the first choice.
• As for de novo assembly, it is still mostly confined to more specific
projects, especially targeting to correct inaccuracies in the reference
genome and to improve the identification of SVs and other complex
rearrangements. 40
Secondary Analysis (contd.)
• Sequence alignment is a classic problem addressed by bioinformatics.
• Sequencing reads from most NGS platforms are short, therefore to
sequence a genome, billions of DNA/RNA fragments are generated
that must be assembled, like a puzzle.
• This represents a great computational challenge, especially when
dealing with the existence of reads derived from repetitive elements.
• The algorithm must choose from which repeat copy the read belongs to.
• In such a context, it is impossible to make high-confidence calls, the
decision must be taken either by the user or software through a
heuristic approach.
• Sequence alignment errors may emerge from multiple reasons.
• Errors in sequencing (caused by a process such as fading and signal
droop)
• Discrepancies between the sequenced data and the reference genome
also cause misalignment problems.
• Major difficulty is to establish a threshold between what is a real
variation and a misalignment. 41
Secondary Analysis (contd.)
• Most widely accepted data input file format for an assembly is FASTQ.
• Typical output files produced by various sequencing platforms.
• Read aligners are in binary alignment map (BAM) and sequence
alignment map (SAM) formats.
• Both formats include basically the same information, namely, read
sequence, base quality scores, location of alignments, differences
relative to reference sequence and mapping quality scores (MAPQ).
• The main distinction between them is that SAM format is a text file,
created to be informatically easier to process with simple tools, while
BAM format provides binary versions of the same data.
• Alignments can be viewed using user-friendly and freely available
software, such as the Interactive Genome Viewer (IGV) or Genome
Browse
(http://goldenhelix.com/products/GenomeBrowse/index.html).
42
• The preferential assembly method when the reference
genome is known is the alignment against the reference
genome.
• A mapping algorithm will try to locate a location in the
reference sequence that matches the read, tolerating a certain
Secondary number of mismatches to allow subsequent variation
detection.
Analysis - • More than 60 tools for genome mapping have been developed
and the number is increasing.
Sequence • As the NGS platforms are updated, more and more tools
appear as an on-going evolving process.
Alignment • Commonly used methods to perform short reads alignments:
• Burrows–Wheeler Aligners (BWAs) and Bowtie are mostly used
for Illumina;
• For Ion Torrent, the Torrent Mapping Alignment Program (TMAP)
is the most common alignment software as it was specifically
optimized for this platform.
43
• BWA uses the Burrows–Wheeler transform
algorithm.
• Data transformation algorithm that restructures data to be
more compressible.
Secondary • Initially developed to prepare data for compression
techniques such as bzip2.
Analysis - • It is a fast and efficient aligner performing very well for
both short and long reads.
Sequence • Bowtie (now Bowtie 2) has the advantage of being
faster than BWA for some types of alignment, but it
Alignment may compromise the quality, with the reduction of
sensitivity and accuracy.
(contd.) • Bowtie may fail to align some reads with valid
mappings when configured for maximum speed.
• It is usually applied to align reads derived from RNA
sequencing experiments.
44
• The simulation and evaluation suite, Seal, runs to compare the most
widely used tools for mapping, such as Bowtie, BWA, mr- and mrsFAST,
Novoalign, SHRiMP and SOAPv2.
• Compares different parameters, including sequencing error, indels and
coverage.
Secondary • There is no perfect tool as every tool has advantages and disadvantages.
47
• The greedy algorithm starts by adding a read to another identical one.
• This process is repeated until all options of assembly for that fragment
is achieved and is repeated to all fragments.
• Each operation uses the next highest-scoring overlap to make the next
join.
Secondary • To make the scoring, the algorithm measures the number of matching
bases in the overlap.
Analysis – • This algorithm is suitable for a small number of reads and smaller
genomes.
De Novo • In contrast, the OLC method is optimized for the low-coverage long
reads.
Assembly • The OLC begins by identifying the overlaps between pairs of reads and
builds a graph of the relationships between those reads, which is
50
• SAMtools, Genome Analysis Toolkit (GATK) and Picard
are some of the bioinformatic tools used to perform this
post-alignment processing.
• Since variant calling algorithms assume that, in the case
of fragmentation-based libraries, all reads are
independent, removal of PCR duplicates and non-unique
Secondary
alignments (i.e., reads with more than one optimal
alignment) is critical.
Analysis -
• This step can be performed using Picard tools (e.g.,
MarkDuplicates).
Post-
• If not removed, a given fragment will be considered as a
different read, increasing the number of incorrect
Alignment
variant calls and leading to an incorrect coverage and/or
genotype assessment.
Processing
• Reads spanning INDELs impose further processing.
• Given the fact that each read is independently aligned to
(contd.)
the reference genome when an INDEL is part of the read,
there is a higher chance for alignment mismatches.
51
• The realigner tool firstly determines suspicious intervals
requiring a realignment due to the presence of INDELs.
• Next, the realigner runs over those intervals combining
shreds of evidence to generate a consensus score to
support the presence of the INDEL.
• IndelRealigner from the GATK suite can be used to run
Secondary
this step.
• The confidence of the base-call is given by the Phred-
Analysis -
scaled quality score, which is generated by the
sequencer machine and represents the raw quality
Post-
score.
• This score may be influenced by multiple factors like the
Alignment
sequencing platform and the sequence composition, and
not reflecting the base-calling error rate.
Processing
• It is necessary to recalibrate this score to improve variant
calling accuracy.
(contd.)
• BaseRecalibrator from the GATK suite is one of the most
used tools.
52
Secondary Analysis - Variant
Calling
• The variant calling step has the main objective of
identifying variants using the post-processed BAM file.
• Several tools are available for variant calling, some
identify variants based on the number of high
confidence base calls that disagree with the reference
genome position of interest.
• Others use Bayesian, likelihood, or machine learning
and statistical methods that use factor parameters, such
as base and mapping quality scores, to identify variant
differences.
• Machine learning algorithms have evolved greatly in
recent years and will be critical to assist scientists and
clinicians to handle large amounts of data and to solve
complex biological challenges.
53
Secondary Analysis - Variant
Calling (contd.)
• SAMtools, GATK and Freebayes are among the most
widely used toolkits for Illumina data.
• Ion Torrent has its own variant caller known as the
Torrent Variant Caller (TVC).
• Running as a plugin in the Ion Torrent server, TVC calls
single-nucleotide polymorphisms (SNVs), multi-
nucleotide variants (MNVs), INDELS in a sample across a
reference or within a targeted subset of that reference.
• Several parameters can be customized, but often
predefined configurations (germ-line vs. somatic, high
vs. low stringency) can be used depending on the type
of experiment performed.
54
Secondary Analysis - Variant
Calling (contd.)
• Most of these tools use the SAM/BAM format as input and
generate a variant calling format (VCF) file as their output.
• VCF format is a standard format file, currently in version 4.2,
developed by the large sequencing projects such as the 1000
genomes project.
• VCF is basically a text file containing meta-information lines, a
header line, followed by data lines each containing information
on chromosomal position, the reference base, the identified
alternative base or bases.
• The format also contains genotype information on samples for
each position.
• VCFtools provide the possibility to easily manipulate VCF files,
e.g., merge multiple files or extracting SNPs from specific regions.
55
Secondary Analysis –
Structural Variant Calling
• Genetic variations can occur in the human genome ranging from SNVs and INDELS to more
complex (submicroscopic) SVs.
• These SVs include both large insertions/duplications and deletions (also known as copy
number variants, CNVs) and large inversions and can have a great impact on health.
• Longer-read sequencers hold the promise to identify large structural variations and the
causative mutations in unsolved genetic diseases.
• Incorporating the calling of such SVs would increase the diagnostic yield of these NGS
approaches, overcoming some of the limitations present in other methods and with the
potential to eventually replace them.
• Reflecting this growing tendency, several bioinformatics tools have been developed to detect
CNVs from NGS data.
• Currently, to detect CNVs from NGS data, five approaches are used, according to type
algorithms/strategies used: paired-end mapping, split read, read depth, de novo genome
assembly and combinatorial approach.
56
Main Methods for
Calling Structural
Variants (SVs) and
Copy Number
Variations (CNVs)
from NGS Data
57
Secondary Analysis –
Structural Variant Calling (contd.)
• Detection of CNV mainly relies on whole-genome sequencing (WGS) data since
it includes non-coding regions which are known to encompass a significant
percentage of SVs.
• Whole-exome sequencing (WES) has emerged as a more cost-effective
alternative to WGS and the interest in detecting CNVs from WES data has
grown considerably.
• Since only a small fraction of the human genome is sequenced by WES, it is
not able to detect the complete spectrum of CNVs.
• The lower uniformity of WES as compared with WGS may reduce its sensitivity
to detect CNVs.
• Usually, WES generates higher depth for targeted regions as compared with
WGS.
• Most of the tools developed for CNVs detection using WES data, have depth-
based calling algorithms implemented and require multiple samples or
matched case-control samples as input.
• Ion Torrent also developed a proprietary algorithm, as part of the Ion Reporter
software, for the detection of CNVs in NGS data derived from amplicon-based
61
libraries.
The third main step of the NGS Finding, in the human clinical
genetics' context, the
analysis pipeline addresses the
fundamental link between
important issue of “making variant data and the phenotype
sense” or data interpretation. observed in a patient.
62
• Variant annotation is a key initial step for the analysis of
sequencing variants.
• Output of the variant calling is a VCF file.
• Each line in such a file contains high-level information about a
variant, such as genomic position, reference and alternate
Tertiary bases, but no data about its biological consequences.
Analysis - • Variant annotation offers such biological context for all the
variants found.
63
• One basic step in the annotation is to provide the variant’s context.
• In which gene the variant is located, its position within the gene and
the impact of the variation (missense, nonsense, synonymous, stop-
loss, etc.).
Variant
Condel, which compute the consequence scores for each variant
based on various parameters.
Annotation
• Degree of conservation of amino acid residues
• Sequence homology
Analysis -
rare variants causing Mendelian diseases.
• Like ANNOVAR, VEP from Ensembl (EMBL-EBI) can provide genomic annotation
for numerous species.
Variant • VEP has a user-friendly interface through a dedicated web-based genome
browser, although it can have programmatic access via a standalone Perl script
or a REST API.
Annotation • A wider range of input file formats are supported, and it can annotate SNPs,
indels, CNVs or SVs.
(contd.) • VEP searches the Ensembl Core database and determines where in the
genomic structure the variant falls and depending on that gives a consequence
prediction.
• snpEff is another widely used annotation tool, standalone or integrated with
other tools commonly used in sequencing data analysis pipelines such as
Galaxy, GATK and GKNO.
• In contrast with VEP and ANNOVAR, snpEff does not annotate CNVs but has the
capability to annotate non-coding regions.
• snpEff can perform annotation for multiple variants being faster than VEP.
65
• Variant annotation may seem like a simple and
straightforward process.
• It can be very complex considering the genetic organization’s
intricacy.
Tertiary • In theory, the exonic regions of the genome are transcribed
Analysis - into RNA, which in turn is translated into a protein.
• One gene would originate only one transcript and ultimately a
Variant single protein.
• Such a concept (one gene–one enzyme hypothesis) is completely
Annotation outdated as the genetic organization and its machinery are much
more complex.
(contd.) • Due to a process known as alternative splicing, from the same
gene, several transcripts and thus different proteins can be
produced.
• Alternative splicing is the major mechanism for the
enrichment of transcriptome and proteome diversity.
66
• Depending on the transcript choice, the biological information and
implications concerning the variant can be very different.
• Blurriness concerning annotation tools is caused by the existence of a
diversity of databases and reference genome datasets, which are not
Variant • This makes the classification of variants different even though the
same transcript was used.
Annotation • There exist significant differences in VEP and ANNOVAR annotations of the same
transcript.
Tertiary • To make clinical sense of so many variants and to identify the disease-
causing variant(s), some filtering strategies are required.
Analysis - • Although quality control was performed in previous steps, several
false-positive variants will still be present.
Variant • When starting the third-level of NGS analysis, it is highly
recommended to, based on quality parameters or previous
Filtering, knowledge of artifacts, reduce the number of false-positive calls and
variant call errors.
Prioritization • Parameters such as the total number of independent reads and the
percentage of reads showing the variant and the homopolymer
and length (particularly for Ion Torrent, with stretches longer than five
bases being suspicious) are examples of filters that could be applied.
Visualization • The user should define the threshold based on observed data and
research question but, relative to the first parameter, less than 10
independent reads are usually rejected since it is likely due to
sequencing bias or low coverage.
69
• One commonly used NGS filter is the population frequency filter.
Tertiary
• Minor allele frequency (MAF), one of the metrics used to filter based on allele
frequency, can sort variants in three groups:
Analysis -
• rare variants (MAF < 0.5, usually selected when studying Mendelian
diseases),
Variant
• low frequency variants (MAF between 0.5% and 5%) and
• common variants (MAF > 5%).
Filtering,
• Populational databases that include data from thousands of individuals from
several populations represent a powerful source of variant information about
Prioritization
the global patterns of human genetic variation.
• It helps, not only to better identify disease alleles but also are important to
understand the populational origins, migrations, relationships, admixtures and
and changes in population size, which could be useful to understand some disease
patterns.
70
• As carriers of recessive disorders, carriers do not show any
Tertiary signs of the disease, the frequency of damaging alleles in
populational variant databases can be higher than the
Analysis - established threshold.
71
Tertiary • For a recognizable inheritance pattern, it is advisable to perform
family inheritance-based model filtering.
Analysis - • These are especially useful if more than one patient of such
families are available for study, as it would greatly reduce the
Variant
number of variants to be thoroughly analyzed.
• For instance, for diseases with an autosomal dominant (AD)
inheritance pattern the ideal situation would be testing at least
Filtering, three patients, each from a different generation, and select only the
heterozygous variants located in the autosomes.
Prioritization • If a pedigree indicates a likely X-linked disease, variants located in
the X chromosome are selected and those in other chromosomes
and are not primarily inspected.
• As for autosomal recessive (AR) diseases, with more than one
Visualization affected sibling, it would be important to study as many patients
as possible and to select homozygous variants in patients that
(contd.) were found in heterozygosity in both parents, or genes with two
heterozygous variants with distinct parental origins.
72
• For sporadic cases (and cases in which the disease pattern is not known), the
trio analysis can constitute as extremely useful to reduce the analytical
burden.
Tertiary • In such a context, heterozygous variants found only in the patient and not
present in both parents would indicate a de novo origin.
Analysis - • Even in non-related cases with very homogeneous phenotypes, such as those
typically syndromic, it is possible to use an overlap-based strategy, assuming
that the same gene or even the same variant is shared among all the patients.
Variant • An additional filter, useful when many variants persist after applying others, is
based on the predicted impact of variants (functional filter).
Filtering, • In some pipelines intronic or synonymous variants are analyzed, based on the
assumption that they are likely to be benign (non-disease associated).
• Care should be taken since numerous intronic and apparent synonymous
Prioritization variants, have been implicated in human diseases.
• A functional filter is applied in which the variants are prioritized based on its
and genomic location (exonic or splice-sites).
• Additional information for filtering missense variants includes evolutionary
Visualization
conservation, predicted effect on protein structure, function or interactions.
• To enable such filtering the scores generated by algorithms to evaluate
missense variants (for instance PolyPhen-2, SIFT and CADD) are annotated in
(contd.) the VCF.
• The same applies to variants that might have an effect over splicing, as
prediction algorithms are being incorporated in VCF annotation, such as the
Human Splice finder in VarAFT.
• More examples are given in the table next. 73
Software Description
Some SIFT
Sorting Intolerant from
Predicts based on sequence homology, if an AA substitution will
affect protein function and potentially alter the phenotype.
Software Tolerant Scores less than 0.05 indicate a variant as deleterious.
Tools to PolyPhen-2
Polymorphism
Phenotyping v2
Predicts the functional impact of an AA replacement from its
individual features using a naive Bayes classifier. Includes two
tools HumDiv (designed to be applied in complex phenotypes)
Perform and HumVar (designed to diagnostic of Mendelian diseases).
Higher scores (>0.85) predicts, more confidently, damaging
NGS
variants.
CADD Combined Integrates diverse genome annotations and scores all human SNV
Functional Annotation Dependent
Depletion
and Indel. It prioritizes functional, deleterious, and disease causal
variants according to functional categories, effect sizes and
Human Splice Finder Predict the e↵ects of mutations on splicing signals or to identify
splicing motifs in any human sequence.
Some
Software
nsSNPAnalyzer Extracts structural and evolutionary information from a query
nsSNP and uses a machine learning method (Random Forest) to
predict its phenotypic e↵ect. Classifies the variant as neutral and
Tools to disease.
Perform
TopoSNP Topographic Analyze SNP based on its geometric location and conservation
mapping of SNP information, produces an interactive visualization of disease and
non-disease associated with each SNP.
NGS Condel Consensus Condel integrates the output of di↵erent methods to predict the
Functional Deleteriousness impact of nsSNP on protein function. The algorithm based on the
weighted average of the normalized scores classifies the variants
as neutral or deleterious.
Filtering
(contd.) ANNOVAR Annotate
Variation
Annotates the variants based on several parameters, such as
identification whether SNPs or CNVs a↵ect the protein (gene-
based), identification of variants in specific genomic regions
outside protein-coding regions (region-based) and identification
of known variants documented in public and licensed database
(filter-based)
75
Some
Software
Software Description
VEP Determines the effect of multiple variants (SNPs, insertions,
Tools to Variant Effect Predictor deletions, CNVs or structural variants) on genes, transcripts and
protein sequence, as well as regulatory regions.
(contd.)
76
• Although functional annotation adds an important layer of
information for filtering, the fundamental question to be answered,
Tertiary especially in the context of gene discovery, is whether a specific
variant or mutated gene is indeed the disease-causing one.
Analysis - • To address this complex question, a new generation of tools is being
developed, that instead of merely excluding information, perform
Filtering, • For instance, PHIVE explores the similarity between human disease
phenotype and those derived from knockout experiments in animal
and
different way, through the computation of a deleteriousness score
(also known as burden score) for each gene, based on how intolerant
genes are to normal variation and using data from population
Visualization variation databases.
• Human disease genes are much more intolerant to variants than non-
(contd.) disease associated genes.
• Human phenotype ontology (HPO) enables the hierarchical sorting by
disease names and clinical features (symptoms) for describing medical
conditions.
77
• HPO can also provide an association between symptoms and known disease
genes.
• Several tools attempt to use these phenotype descriptions to generate a
Tertiary ranking of potential candidates in variant prioritization.
• As an example, some attempt to simplify analysis in a clinical context,
Analysis - such as the phenotypic interpretation of exomes that only reports genes
previously associated with genetic diseases.
Variant • While others can also be used to identify novel genes, such as Phevor that use
data gathered in other related ontologies, gene ontology (GO) for example, to
suggest novel gene–disease associations.
Filtering, • The main goal of these tools is to end-up with few variants for further
validation with molecular techniques.
(contd.) • VarElect .
• Besides those tools that aid in interpretation and variant analysis, currently
clinicians have at their disposal several medical genetics companies, such as
Invitae (https://www.invitae.com/en/) and CENTOGENE (https://
www.centogene.com/) that provide to clinicians a precise medical diagnosis.
78
NGS Pitfalls
• Seventeen years have passed since the introduction of the
first commercially available NGS platform, 454 GS FLX from
Life Sciences.
• Since then, the “genomics” field has greatly expanded our
knowledge about structural and functional genomics and
the underlying genetics of many diseases.
• Besides, it allows the creation of the concepts of “omics”
(transcriptomics, genomics, metabolomics, etc.), which
provide new insights into the knowledge of all living beings,
to know how different organisms use genetics and
molecular biology to survive and reproduce in healthy and
disease situations, to know about their population networks
and changes in environmental conditions.
• This information is very useful also to understand human
health.
• NGS brought a panoply of benefits and solutions for
medicine and to other areas, such as agriculture that
helped to increase the quality and productivity.
• However, it has also brought new challenges. 79
NGS Pitfalls (contd.)
• First challenge is regarding the sequencing costs.
• Although the overall costs of NGS is coming down, an NGS
experiment is not cheap and still not accessible to all laboratories.
• Imposes high initial costs with the acquisition of the sequencing
machine, plus consumables and reagents.
• Costs with experimental design, sample collection and sequencing
library preparation also must be considered.
• Many times, costs of development of sequencing pipelines, and
the development of bioinformatics tools to improve those
pipelines and to perform the downstream sequence analysis, costs
of data management, informatics equipment and downstream
data analysis, are not considered in the overall NGS costs.
• A typical BAM file from a single WES experience consumes up to
30 Gb of space, thus storing and analyzing data of several patients
require higher computational power and storage space, which
clearly add significant costs.
• Expert bioinformaticians are needed to deal with data analysis.
• These additional costs are evidently part of NGS workflow and
must be accounted for.
80
NGS Pitfalls (contd.)
• Concerns about data sharing and confidentiality arise with the
massive amount of data that is generated with NGS and analysis.
• It is debatable which degree of protection should be adopted to
genomic data.
• Should genomic data be or not be shared between multiple
parties (including laboratory staff, bioinformaticians,
researchers, clinicians, patients and their family members)?
• When analyzing NGS data, it is important to be aware of its
technical limitations, namely, PCR amplification bias (a significant
source of bias due to random errors that can be introduced), and
sequencing errors.
• High coverage is needed to understand which variants are
true and which are caused by sequencing or PCR errors.
• Limitations also exist in downstream analysis of the read
alignment/mapping, especially for indels in which some alignment
tools have low detection capabilities or not detected at all.
• Though bioinformatics tools have helped and made the data
analysis more automatic, a manual inspection of variants in the
BAM file are frequently needed.
• Thus, it is critical to understand the limitations of NGS platform
and workflow to try to overcome these limitations and increase
the quality of variant detection. 81
NGS Pitfalls (contd.)
• Another major challenge to clinicians and researchers is to
correlate the findings with the relevant medical
information.
• May not be a simpler task, especially when dealing with new
variants or new genes not previously associated with disease.
• Requires additional efforts to validate the pathogenicity of
variants (which in a clinical setting may not be feasible).
• More importantly, both clinicians and patients must be
clearly conscious that a positive result, although providing
an answer that often terminates a long and expensive
diagnostic journey, does not necessarily mean that a better
treatment will be offered nor that it will be possible to find
a cure.
• In many cases that genetic information may not alter the
prognosis or the outcome for an affected individual.
• This is an inconvenient and hard truth that clinicians should
clearly explain to patients.
• Nevertheless, huge efforts have been made to increase the
choice of the best therapeutic options based on DNA
sequencing results, for cancer and a growing number of
82
rare diseases.
Concluding Remarks
• Despite all the accomplishments made so far, a long journey is ahead before
genetics can provide a definitive answer towards the diagnoses of all
genetic diseases.
• Further improvements in sequencing platforms and data handling strategies
are required to reduce error rates and to increase variant detection quality.
• To increase our understanding about the disease, especially the complex
and heterogeneous diseases, scientists and clinicians must combine
information from multiple -omics sources (such as genomics,
transcriptomics, proteomics and epigenomics).
• NGS is evolving rapidly to deal with the classic genomic approach and is
rapidly gaining broad acceptance.
• Major challenge continues to be dealing with and interpreting all the
distinct layers of information.
• Current computational methods may not be able to handle and extract the
full potential of large genomic and epigenomic datasets being generated.
• Bioinformaticians, scientists and clinicians will have to work together to
interpret the data and to develop novel tools for integrated systems level
analysis.
• Machine learning algorithms as well as the emerging developments in
artificial intelligence, will be decisive to improve NGS platforms and
software.
• This will help scientists and clinicians to solve complex biological challenges,
thus improving clinical diagnostics and opening new avenues for novel
therapies development. 83
Thank You.
84