Next Generation Sequencing Data Analysis
Next Generation Sequencing Data Analysis
Introduction
The era of next generation sequencing (NGS) began following the first reports of this innovative technique in 2005 (Margulies
et al., 2005; Shendure and Ji, 2008). The high-throughput NGS technology, which uses parallel amplification and sequencing
yields shorter read lengths, giving an average raw error rates of 1%–1.5% (Shendure and Ji, 2008) when compared to conventional
Sanger sequencing protocols, which can generate to 1000 bp of 99.999% per base accuracies. While automation attempts of
traditional dideoxy DNA sequencing by Sanger method improved the efficiency of DNA sequencing; however, in terms of cost and
time, NGS technology was regarded as superior. An early method called massively parallel sequencing (MPS) that was introduced
by Lynxgen Therapeutics set the stage for high throughput sequencing (Brenner et al., 2000). The first NGS machine, the GS20, that
was made available for researchers in 2005 by 454 Life Sciences (Basel, Switzerland) is based on large-scale parallel pyr-
osequencing by microbeads in micro-droplets of water in oil emulsion (Henson et al., 2012).
The major NGS technology platforms in the market for whole-genome sequencing are primarily from brand names such as
Illumina, Roche 454, Solid and IonTorrent. Each platform has its advantages and disadvantages and cost implications in terms of
reliability, time and money (Table 1). Depending on each NGS platform, they offer values that may be attractive for specific
purposes. For instance, the Ion Torrent is often positioned as a general purpose sequencer as well as in diagnostic protocols due to
the quicker turnaround time (Tarabeux et al., 2014). However, longer reads technology offered by Illumina and Roche are
desirable but the cost involved for Roche 454 FLX is very steep rendering it impractical for large scale genome projects. Reads from
Pacific Bioscience machine, PacBio, are generally not used in large genome projects for direct sequencing but can be useful in
resolving repetitive regions and ambiguous regions due to capability of generating very long read lengths (Table 1).
The common processes involved in general NGS platforms are library preparation, library amplification, and sequencing
(Figure 2.3). The starting material for library preparation can be from either RNA or DNA (genomic source or PCR-amplified). In
the case of RNA, it has to be transcribed into cDNA because as of now, NGS machines sequence only DNA directly. Since target
library molecules sequenced on each NGS platform are required to be in specific lengths, genomic DNA requires fractionation and
size selection, which is performed by sonication, nebulization, or enzymatic techniques followed by gel electrophoresis and
excision. For instance, Illumina NGs platform’s standard fragment size is in the range of 300 and 550 bp including adapters.
Generally, libraries are built by adding NGS platform-specific DNA adapters to the DNA molecules. These adapters facilitate the
binding of the library fragments to a surface such as a microbead (454, Ion PGM, SOLiD) or a glass slide (Illumina, SOLiD).
However, depending on the specific NGS platform, the library construction step has to be customised to fit the sequencing
protocol. Generally, DNA library construction directly depend on its applications and can be divided mainly into fragment
libraries and mate-paired libraries. In fragment libraries, target genomic sequences are fragmented to smaller sizes typically up to 5
times the NGS platform’s read length capabilities. Subsequently, sequencing adapters are attached to these fragments allowing the
NGS platform to sequence from the adaptor tags. Typically, fragment libraries are of single end and paired-end. With adaptor tags
at both forward and reverse sites, NGS platforms are able to sequence from both ends in paired-end libraries. The applications of
fragment libraries are mainly for variant calling, copy number detection and genome reconstruction. While other types of NGS
NGS analysis utilize bioinformatics approaches in order to convert signals from the machine to meaningful information that
involve signal conversion to data, annotations or catalogued information, and actionable knowledge. Primarily, the NGS
bioinformatics analysis is divided into three distinct phases as primary, secondary and tertiary analyses. In the primary analyses,
raw data from the sequencers are converted into nucleotide base and short read data. Secondary analyses apply detailed bioin-
formatics methodology specific to the NGS technique that was employed which may involve read alignments or read assembly.
Usually, secondary analyses have the most complex analyses workflow and is usually run in a sequence as a pipeline. Besides,
depending on the type of NGS technique that are employed, the analysis pipeline may differ greatly. For example, RNA-seq data,
which characterizes transcriptomes, differ in secondary analysis approach compared to ChIP-seq data, which investigates genome-
wide epigenetic mechanisms. Lastly, by tertiary analyses, the previous results obtained can be associated and understood in a
biological context. However, tertiary bioinformatics analyses can be an iterative process that involve rigorous statistical and
computational biology methods.
Sequence Generation
The initial primary analysis is usually transient as the sequencing machines detectors receives the signals obtained from the high
throughput reactions. Thus, the base-calling and recording process is tightly integrated with the sequencing instruments resulting
in quality scores corresponding to the short reads nucleotide sequence being output parallelly. The primary analysis software is
installed by machine vendors on the workstation supporting the sequencing instrument. The software can be run in high-
performance cluster systems for faster results. Besides converting raw signals to base calls, some software tools include demulti-
plexing of multiple samples into a single pooled and indexed run (Dodt et al., 2012).
Fig. 1 Schematic diagram of the process involved in common NGS platforms. Adapted from Knief, C., 2014. Analysis of plant microbe interactions
in the era of next generation sequencing technologies. Plant Genet. Genom. 5, 216. Available at: https://doi.org/10.3389/fpls.2014.00216.
As described earlier, NGS platforms suffer from higher error rates compared to Sanger sequencing (Nakamura et al., 2011;
Shendure and Ji, 2008). However, as part of primary analysis, different approaches and algorithms have been developed to
compensate and spot these errors (Margulies et al., 2005). Moreover, bases that have inaccuracy less than 0.1% can be carefully
chosen algorithmically. As a simple approach, error-rates can be decreased by performing the DNA sequencing with high coverage,
of at least 20–60-fold, depending on the sequencing project’s goal (Luo et al., 2012; Margulies et al., 2005; Voelkerding et al.,
2009). Notably, each sequencing read can be categorised as a distinct genotype but in fact could be the result of sequencing error.
Thus, it is very important to use established methods in differentiating these two causes of variation as it may lead to inaccurate
results when flawed.
Fig. 2 Workflow of NGS data analysis in three phases: primary, secondary and tertiary. The tertiary phase of Comparison and Discovery is
indicated as an iterative process.
To improve the quality of the data after base-calling, Phred-based filtering algorithms can be used to filter or remove low
quality sequencing reads (Margulies et al., 2005). These filters discard reads with low-quality, uncalled, and ambiguous bases
besides clipping the lower quality 30 -ends of reads. All such filters use the quality information contained in the FASTQ file that are
computed by the NGS platform at each base during the base calling procedure. Previous studies (Minoche et al., 2011) have shown
the effect of different filtering approaches on Illumina data and suggests that it can reduce error rates to less than 0.2% by
eliminating around 15%–20% of the low-quality bases, mostly via 30 -end trimming that are prone to errors. Another study has
supported the findings that a 5-fold decrease of error rate can be observed by applying a filter (Phred score of Q30, with 0.1%
likelihood of a false basecall) that eliminated reads with low quality bases (Nguyen et al., 2011). It may be useful to note that low
quality bases are sometimes localised in specific regions of a genome. It is important to note that removal of these reads may
introduce potential bias in the quantitative studies undertaken (Minoche et al., 2011; Nakamura et al., 2011). Therefore, read
clipping strategy can be used to remove the erroneous bases from the left or right edges of the reads without filtering the whole
reads in order to address errors that are usually present in the reads edges alone.
Apart from read clipping and filtering methods, several error correction tools (e.g., Coral, HiTEC, Musket, Quake, RACER,
Reptile, or SHREC) could be used as a complementary strategy to reduce sequencing error rates in reads (Knief, 2014). Generally,
these error correction methods make use of high sequencing coverage in order to identify and correct errors using the laws of
probability and statistics. Moreover, these algorithms often consider quality scores of the examined bases besides looking at
neighbouring base quality values. For instance, some of these tools are able to correct substitution errors in Illumina sequencing
data (Ilie and Molnar, 2013; Liu et al., 2013; Yang et al., 2010) while others (Coral, HSHREC, KEC, and ET) are designed to include
indel correction algorithms that are available for the analysis of Roche’s 454 and IonTorrent data (Salmela, 2010; Salmela and
Schröder, 2011; Skums et al., 2012). The relevance of error correction tools is seen as a very useful strategy in de novo genome
sequencing, resequencing and amplicon sequencing projects with benefits ranging from finding more optimal assembly in the
DBG and reducing overall memory footprint to perform the assembly stage (Skums et al., 2012; Yang et al., 2010).
Genome Assembly
After the pre-processing of sequence reads, we can assemble the pre-processed reads into contigs. The process of assembly of
sequencing reads generated from NGS technology involves the reduction of redundant data by contiguously placing reads by
overlapping them adjacent to each other in an optimal way (Miller et al., 2010). Instead of reads, when contigs (assembled set of
reads) undergo the previously described process with long length information, it is known as scaffolding. In other words, it is a
process of reconstructing the target as such to groups reads into contigs and contigs into scaffolds.
Generally, the size and accuracy of the contigs and scaffolds are important statistics in genome assemblies (Miller et al., 2010). The
quality of genome assemblies is usually described by maximum length, average length, combined total length, and N50. The contig N50
is the length of the smallest contig in the set that contains the fewest (of the largest) contigs whose joint length represents at least 50% of
the assembly (Miller et al., 2010). Generally, larger N50 values imply a higher quality genome assembly that describes lesser overall
fragments. Typical high-coverage genome projects have N50 values that range in megabases; however, they are dependent on the genome
size and is not a good measure to compare between unrelated assemblies instead of the same. Assembly accuracy is tough to quantity.
Nevertheless, mapping the assembled contigs/scaffolds to reference genomes is useful to examine its quality if the references exist.
As outlined earlier, an assembly is an ordered data construction that maps the sequencing data to a supposed reconstruction of
the target (He et al., 2013). Contigs are reconstructed sequences from sequence alignment of reads which give rise to a consensus
sequence. The scaffolds is a higher order organisation of sequences which define the contig order and orientation and the sizes of
the gaps between contigs. The scaffolds represent more contiguous sequences mimicking the physical genome composition.
Scaffold sequences could have N's in the gaps between contigs. The number of consecutive N's may show the gap length estimate
during the assembly process based on the bridging of mate pair reads (Miller et al., 2010).
There are many well-established software for assembling sequencing reads into contigs/scaffolds. In general, these genome
assemblers can be grouped into three categories based on their approaches (Miller et al., 2010): (1) The Overlap/Layout/Consensus
(OLC) approaches depend on an overlap graph; (2) the de Bruijn Graph (DBG) use some form of k-mer graph; and, (3) the greedy
graph algorithms can use OLC or DBG (Table 2).
There are many factors that need to be considered when choosing the most appropriate genome assembler especially when
considering for large whole genome sequencing project. Among these factors are the choice of algorithm, compatibility with the
NGS platform, the support of the assembly of large genomes, and parallel-computing support for speeding-up the assembly
(Zerbino and Birney, 2008). Generally, the choice of algorithm and software will directly determine the memory requirements and
speed of assembly. In general, DBG assemblers are faster but require large amount of memory compared to OLC assemblers.
Read Mapping
Whenever a reference genome becomes available, instead of a de novo assembly strategy, reads are mapped or aligned to the reference
genome prior to subsequent analysis steps. The goal of mapping is to realign the vast number of reads back to the respective regions it
likely originated from. The mapping of the reads to the reference genome typically involves the alignment of millions of short reads
to the genome using fast algorithms. The algorithms are able to function parallelly while taking into account mutations such as
polymorphisms, insertions and deletions in order to produce the alignment. In well-known aligners, for example BLAST, the
individual query sequence is searched against a reference database using hash tables and seed and extend approaches. With NGS data,
often similar methods are adapted to scale the alignment of short query sequences that are in the millions against a single reference
genome of large sequences. Advances in mapping algorithms using various other techniques has improved alignment speed while
reducing memory and space requirements. Examples of mapping software that are well known and used in NGS data includes
SOAP2 (Short Oligonucleotide Alignment Program), BWA (Burrow-Wheels Alignment), NovoAlign and Bowtie 2.
The widely used format of storing mapping information of the reads to the genome is SAM (Sequence Alignment Map) format
or its compressed binary form called BAM. While the BAM file is smaller and optimized for machine reading, the SAM file is
human readable albeit slower for computer operations. There are 11 mandatory fields in the SAM format specification. Com-
monly, SAMtools software is used to manipulate and read both BAM and SAM formats.
Typically, mapping of reads to the reference genome is followed by collecting data about the mapping statistics. The summary
statistics that is mainly of interest is the percentage of aligned reads, or the mapping rate. The mapping rate of reads to the reference
genome are usually only 60%–75%. Besides the limitation due to the intrinsic properties of NGS data and technique that it was
generated from, the inability to map to the reference genome can be ascribed to challenging regions in the genome, such as repeat
rich regions, that aligners are not able to map to. Moreover, short read lengths in most high throughput NGS technology limits the
alignment in mapping to span a small region hence limiting its coverage to convenient areas of the genome. Besides that, limitations
such as NGS sequencing error, algorithmic robustness, mutational load and variation contributes to the low mapping rate.
The mapping file generated can be further inspected by region in-depth using visualization tools such as genome viewers which
plot pileups (the stacked alignment of the reads). Visualization of reads mapped, for instance, can be important in diagnosing
problems in read alignment in certain regions, detecting duplicates, and visualizing variations. Commonly used genome browsers
that enable the reading of SAM files include Integrated Genome Viewer (IGV), and Tablet while some web-based browser that
achieve similar visualizations include JBrowse, NGB and UCSC genome browser.
The tertiary process of analysing NGS data can be quite diverse depending on the scenario and context of a study. Generally, the
reads that are representative of the underlying annotations will characterize the functional aspects of the study. As such, the
corresponding statistics used are usually descriptive. In other cases, where a comparison is being made to a reference or a control,
rigorous statistical tests are employed taking into account read counts in the target regions representative in each treatment groups.
Applying statistical models to identify bias, accounting covariates and testing for significant difference are common steps in
comparative analysis. The resulting outcomes of the analysis may provide a collection of target annotations or genes. Such a gene
list can be subsequently analysed for enrichment in regards to gene ontology (GO) terms to infer the collective function, biological
process and cellular compartmentalization. With such a gene list, the pathways being affected can also be mapped to grasp a better
biological understanding of the process. In other derivative NGS techniques such as ChIP-seq or Hi-C, tertiary analysis may
additionally involve deriving profiles that are commonly occurring in the interactions being observed. Thus, new motifs, structural
interactions and regulatory signals that are being generated for a particular condition can be characterized. NGS techniques when
applied to a population could derive meaningful interaction of evolutionary forces besides characterizing the differences in genetic
composition either in terms of polymorphisms when observing the same species or taxonomy when studying metagenomics.
Conclusion
Approaches to NGS data analysis are diverse and dependant on the technology and methods being employed. Nevertheless,
among the multiple stages that are involved in NGS data analysis, steps in primary and secondary analysis is generally a
prerequisite in all NGS projects. Therefore, primary and secondary analyses must be carefully performed in order to prevent errors
being carried over to tertiary analysis. In tertiary analysis, insights can be generated from the inclusion of annotation, network, and
interaction information from external databases to expand on gene lists and profiles found in earlier steps. Collectively, these steps
are required frequently, hence, independent bioinformatics labs create analysis pipelines for their in-house routine analyses.
However, as NGS technology grows and improves, the methods for analyses may require further evaluation and integration with
latest technologies.
Reference
Brenner, S., Johnson, M., Bridgham, J., et al., 2000. Gene expression analysis by massively parallel signature sequencing (MPSS) on microbead arrays. Nat. Biotechnol. 18,
630–634. Available at: https://doi.org/10.1038/76469.
Dodt, M., Roehr, J.T., Ahmed, R., Dieterich, C., 2012. FLEXBAR – Flexible barcode and adapter processing for next-generation sequencing platforms. Biology 1, 895–905.
Available at: https://doi.org/10.3390/biology1030895.
Henson, J., Tischler, G., Ning, Z., 2012. Next-generation sequencing and large genome assemblies. Pharmacogenomics 13, 901–915. Available at: https://doi.org/10.2217/
pgs.12.72.
He, Y., Zhang, Z., Peng, X., Wu, F., Wang, J., 2013. De novo assembly methods for next generation sequencing data. Tsinghua Sci. Technol. 18, 500–514. Available at:
https://doi.org/10.1109/TST.2013.6616523.
Ilie, L., Molnar, M., 2013. RACER: Rapid and accurate correction of errors in reads. Bioinformatics 29, 2490–2493. https://doi.org/10.1093/bioinformatics/btt407.
Knief, C., 2014. Analysis of plant microbe interactions in the era of next generation sequencing technologies. Plant Genet. Genom. 5, 216. Available at: https://doi.org/10.3389/
fpls.2014.00216.
Liu, Y., Schröder, J., Schmidt, B., 2013. Musket: a multistage k-mer spectrum-based error corrector for Illumina sequence data. Bioinformatics 29. https://doi.org/10.1093/
bioinformatics/bts690.
Luo, C., Tsementzi, D., Kyrpides, N., Read, T., Konstantinidis, K.T., 2012. Direct comparisons of Illumina vs. Roche 454 sequencing technologies on the same microbial
community DNA sample. PLOS ONE 7, e30087. Available at: https://doi.org/10.1371/journal.pone.0030087.
Margulies, E.H., Maduro, V.V.B., Thomas, P.J., et al., 2005. Comparative sequencing provides insights about the structure and conservation of marsupial and monotreme
genomes. Proc. Natl. Acad. Sci. USA 102, 3354–3359. Available at: https://doi.org/10.1073/pnas.0408539102.
Merriman, B., Rothberg, J.M., Ion Torrent R&D Team, 2012. Progress in ion torrent semiconductor chip based sequencing. Electrophoresis 33, 3397–3417. Available at:
https://doi.org/10.1002/elps.201200424.
Miller, J.R., Koren, S., Sutton, G., 2010. Assembly algorithms for next-generation sequencing data. Genomics 95, 315–327. Available at: https://doi.org/10.1016/j.
ygeno.2010.03.001.
Minoche, A.E., Dohm, J.C., Himmelbauer, H., 2011. Evaluation of genomic high-throughput sequencing data generated on Illumina HiSeq and genome analyzer systems.
Genome Biol. 12, R112. Available at: https://doi.org/10.1186/gb-2011-12-11-r112.
Nakamura, K., Oshima, T., Morimoto, T., et al., 2011. Sequence-specific error profile of Illumina sequencers. Nucleic Acids Res. 39, e90. Available at: https://doi.org/10.1093/
nar/gkr344.
Nguyen, P., Ma, J., Pei, D., et al., 2011. Identification of errors introduced during high throughput sequencing of the T cell receptor repertoire. BMC Genom. 12, 106. Available
at: https://doi.org/10.1186/1471-2164-12-106.
Ronaghi, M., Uhlén, M., Nyrén, P., 1998. A sequencing method based on real-time pyrophosphate. Science 281, 363–365. Available at: https://doi.org/10.1126/
science.281.5375.363.
Salmela, L., 2010. Correction of sequencing errors in a mixed set of reads. Bioinformatics 26, 1284–1290.
Salmela, L., Schröder, J., 2011. Correcting errors in short reads by multiple alignments. Bioinformatics 27, 1455–1461.
Shendure, J., Ji, H., 2008. Next-generation DNA sequencing. Nat. Biotechnol. 26, 1135–1145. Available at: https://doi.org/10.1038/nbt1486.
Skums, P., Dimitrova, Z., Campo, D.S., et al., 2012. Efficient error correction for next-generation sequencing of viral amplicons, in: BMC Bioinformatics. BioMed Central. p. S6.
Tarabeux, J., Zeitouni, B., Moncoutier, V., et al., 2014. Streamlined ion torrent PGM-based diagnostics: BRCA1 and BRCA2 genes as a model. Eur. J. Hum. Genet. 22,
535–541. Available at: https://doi.org/10.1038/ejhg.2013.181.
Voelkerding, K.V., Dames, S.A., Durtschi, J.D., 2009. Next-generation sequencing: From basic research to diagnostics. Clin. Chem. 55, 641–658. Available at: https://doi.org/
10.1373/clinchem.2008.112789.
Yang, X., Dorman, K.S., Aluru, S., 2010. Reptile: representative tiling for short read error correction. Bioinformatics 26, 2526–2533.
Zerbino, D.R., Birney, E., 2008. Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18, 821–829. Available at: https://doi.org/10.1101/
gr.074492.107.