Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

A Review of Deep Learning Applications in Human Genomics Using Next-Generation Sequencing Data

Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

Alharbi and Rashid Human Genomics (2022) 16:26

https://doi.org/10.1186/s40246-022-00396-x

REVIEW Open Access

A review of deep learning applications


in human genomics using next‑generation
sequencing data
Wardah S. Alharbi and Mamoon Rashid*   

Abstract
Genomics is advancing towards data-driven science. Through the advent of high-throughput data generating
technologies in human genomics, we are overwhelmed with the heap of genomic data. To extract knowledge and
pattern out of this genomic data, artificial intelligence especially deep learning methods has been instrumental. In
the current review, we address development and application of deep learning methods/models in different subarea
of human genomics. We assessed over- and under-charted area of genomics by deep learning techniques. Deep
learning algorithms underlying the genomic tools have been discussed briefly in later part of this review. Finally, we
discussed briefly about the late application of deep learning tools in genomic. Conclusively, this review is timely for
biotechnology or genomic scientists in order to guide them why, when and how to use deep learning methods to
analyse human genomic data.
Keywords: Human genomics, Deep learning applications, Disease variants, Gene expression, Epigenomics,
Pharmacogenomics, Variant calling, NGS

Introduction accumulation of these omics data, increased attention


Understanding the genomes of diverse species, specifi- has been paid to bioinformatics and machine learning
cally, the examination of more than 3 billion base-pairs of (ML) tools with established superior performance in sev-
Homo sapiens DNA, is a crucial aim of genomic studies. eral genomics implementations [6]. These implementa-
Genomics takes a comprehensive view that implicates all tions involve finding a genotype–phenotype correlation,
the genes within an organism, including protein-coding biomarker identification and gene function prediction, as
genes, RNA genes, cis- and trans-elements, etc. It is a well as mapping the biomedically active genomic regions,
data-driven science involving the high-throughput tech- for example, transcriptional enhancers [7–10].
nological development of next-generation sequencing Machine learning (ML) has been deliberated as a core
(NGS) that generates the entire DNA data of an organ- technology in artificial intelligence (AI), which ena-
ism. These techniques include whole genome sequencing bles the use of algorithms and makes critical predic-
(WGS), whole exome sequencing (WES), transcriptomic tions based on data learning and not simply following
and proteomic profiling [1–5]. With the recent rapid instructions. It has broad technology applications; how-
ever, standard ML methods are too narrow to deal with
complex, natural, highly dimensional raw data, such
*Correspondence: rashidmamoon@gmail.com; rashidma@ngha.med.sa as those of genomics. Alternatively, the deep learning
Department of AI and Bioinformatics, King Abdullah International Medical (DL) approach is a promising and exciting field cur-
Research Center (KAIMRC), King Saud Bin Abdulaziz University for Health rently employed in genomics. It is an ML derivative
Sciences (KSAU-HS), King Abdulaziz Medical City, Ministry of National Guard that extracts features by applying neural networks (NN)
Health Affairs, P.O. Box 22490, Riyadh 11426, Saudi Arabia

© The Author(s) 2022. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which
permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the
original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or
other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line
to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory
regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this
licence, visit http://​creat​iveco​mmons.​org/​licen​ses/​by/4.​0/. The Creative Commons Public Domain Dedication waiver (http://​creat​iveco​
mmons.​org/​publi​cdoma​in/​zero/1.​0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
Alharbi and Rashid Human Genomics (2022) 16:26 Page 2 of 20

automatically [11–14]. Deep learning has been effectively data [18] and to identifying trait-associated single-nucle-
applied in fields such as image recognition, audio classi- otide polymorphisms (SNPs) [19].
fication, natural language processing, online web tools, Although the first concept of the DL theory originated
chatbots and robotics. In this regard, the utilisation of DL in the 1980s was based on the perceptron model and neu-
as a genomic methodology is totally apt to analyse a large ron concept [20], within the last decade, DL algorithms
amount of data. While it is still in its infant stages, DL in have become a state-of-the-art predictive technology
genomics holds the promise of updating arenas such as for big data [21–23]. The initial efficient implementation
clinical genetics and functional genomics [15]. Undoubt- of DL prediction models in genomics was in the 2000s
edly, DL algorithms have dominated computational mod- (Fig. 1) [24]. The difficulty associated with the require-
elling approaches in which they are currently regularly ment of DL models to train an enormous amount of
expanded to report a variety of genomics questions rang- training datasets and the need for powerful computing
ing from understanding the effects of mutations on pro- resources limited their applications until the introduction
tein–RNA binding [16], prioritising variants and genes, of modern hardware, such as the high-efficiency graphi-
diagnosing patients with rare genetic disorders [17], pre- cal processing units (GPUs) with equivalent structures.
dicting gene expression levels from histone modification Now, the architectures of DL models (also known as

Fig. 1 Timeline of implementing deep learning algorithms in genomics. This timeline plot demonstrated the delay of implementing DL tools in
genomics; for example, both (LSTM) and (BLSTM) algorithms have been invented in 1997 and the first genomic application was implemented in
2015. Similar observations are for the rest of the deep learning algorithms (Table 6)
Alharbi and Rashid Human Genomics (2022) 16:26 Page 3 of 20

DNNs) are implemented in diverse areas, as mentioned in genomics and for comparing different models’ perfor-
earlier. Classical neural networks consist of only two to mance. For example, AUC is the most widely used metric
three hidden layers; however, DL networks extend this up for evaluating the model performance ranging from [0,
to 200 layers. Thus, the word “deep” reflects the number 1]. It measures the true-positive rate (TPR) or sensitiv-
of layers that the information passes through. However, ity, true-negative rate (TNR) or specificity and the false-
DL requires superior hardware and substantial paral- positive rate (FPR). Additionally, the F1-score is used to
lelism to be applicable [25]. Due to overwhelmed hard- test the model accuracy in highly imbalanced dataset and
ware limitations and demanding resources, several DL is the harmonic mean between the precision and recall
packages and resources were introduced to facilitate DL (also ranging from [0, 1]). For both AUC and F1-score,
model implementation (discussed in section deep learn- a greater value reflects better model performance. Also,
ing resources for genomics). the confusion matrix describes the complete model per-
The evolution of software, hardware (GPUs) and big formance by measuring the model accuracy to calculate
data in genomics has facilitated the development of deep true-positive values plus true-negative values and divid-
learning-based prediction models for the prediction of ing the sum over the total number of samples [35, 36].
functional elements in genomes. These genetic variants For a greater understanding of the ML evaluation met-
from NGS data predict splice sites in genomic DNA, rics—purpose, calculation, etc.—recommended papers
predict the transcription factor binding sites (TFBSs) include Handelman et al. (2019) and England and Cheng
via classification tasks, classify the pathogenicity of mis- (2019).
sense mutations and predict drug response and synergy This article reviews deep learning tools/methods based
[26–31]. An example of a technological evolution that on their current applications in human genomics. We
has enhanced DL implementation is cloud platforms, began by collecting recent (i.e. published in 2015–2020)
which provide GPU resources as a DL solution. GPUs DL tools in five main genomics areas: variant calling and
can considerably escalate the training speed as the neural annotation, disease variants, gene expression and regu-
network training style can be more adaptable in certain lation, epigenomics and pharmacogenomics. Then, we
model architecture situations, thus permitting fast math- briefly discussed DL genomics-based algorithms and
ematical processes through the use of larger process- their application strategies and data structure. Finally, we
ing unit numbers and high-memory capacities. Primary mentioned DL-based practical resources to facilitate DL
examples of cloud computing platforms include Amazon adoption that would be extremely beneficial mostly to
Web Services, Google Compute Engine and Microsoft biomedical researchers and scientists working in human
Azure. However, these elucidations still require users to genomics. For further information on the field of DL
implement model codes [32]. applications in genomics, we recommend: [37–39].
For all ML models, the evaluation metrics are essential
in understanding the model performance. Basically, these Deep learning tools/software/pipelines
metrics are crucial to be considered in case of genomic in genomics
datasets which generate naturally a highly imbalanced Multiple genomic disciplines (e.g. variant calling and
classes that makes them demanding to be applied by ML annotation, disease variant prediction, gene expres-
and DL models. A sufficient number of solutions usu- sion and regulation, epigenomics and pharmacogenom-
ally applied in this case such as transfer learning [33] and ics) take advantage of generating high-throughput data
Matthews correlation coefficient (MCC) [34]. In common and utilising the power of deep learning algorithms for
sense, every ML task can be divided into a regression task sophisticated predictions (Fig. 2). The modern evolu-
(e.g. predicting certain outcomes/effects of a disease) or tion of DNA/RNA sequencing technologies and machine
a classification task (e.g. predicting the presence/absence learning algorithms especially deep learning opens a new
of a disease); additionally, multiple measurement metrics chapter of research capable of transforming big biological
are obtained from those tasks. Generally, some, but not data into new knowledge or novel findings in all subareas
all, performance metrics used in ML regression-based of genomics. The following sections will discuss the latest
methods include: mean absolute error (MAE), mean software/tools/pipelines developed using deep learning
squared error (MSE), root-mean-squared error (RMSE) algorithms in various genomics areas.
and coefficient of determination ­(R2). In contrast, the
performance metrics in ML classification-based methods Variant calling and annotation
include: accuracy, confusion matrix, area under the curve This first section discusses the applications of the latest
(AUC) or/and area under receiver operating characteris- DL algorithms in variant calling and annotation. We pro-
tics (AUROC) and F1-score. The classification tasks are vided a short list of tools/algorithms for variant calling
most commonly applied to problems in research areas and annotation with their source code links, if available
Alharbi and Rashid Human Genomics (2022) 16:26 Page 4 of 20

Fig. 2 Deep learning applications in genomics. This figure represents the application of deep learning tools in five major subareas of genomics.
One example deep learning tool and underlying network architecture has been shown for each of the genomic subareas, and its input data type
and the predictive output were mentioned briefly. Each bar plot depicts the frequency of most used deep learning algorithms underlying deep
learning tools in that subarea of genomics (Tables 1, 2, 3, 4, 5)

(Table 1), to facilitate the selection of the most suitable along with its known implications in Mendelian disease
DL tool for a particular data type. research. With the advent of massively parallel, high-
NGS, including whole genome or exome, sets the throughput sequencing, sequencing thousands of human
stage for early developments in personalised medicine, genomes to identify genetic variations has become a

Table 1 Genomic tools/algorithm based on deep learning architecture for variant calling and annotations
Tools DL model Application Input/Output Website Code Source References

Clairvoyante CNN To predict variant type, zygosity, alter- BAM/VCF https://​github.​com/​aquas​kyline/​Clair​ [145]
native allele and Indel length voyan​te
DeepVariant CNN To call genetic variants from next- BAM,CRAM/VCF https://​github.​com/​google/​deepv​ [30]
generation DNA sequencing data ariant
GARFIELD-NGS DNN + MLP To classify true and false variants from VCF/VCF https://​github.​com/​gedoa​rdo83/​ [146]
WES data GARFI​ELD-​NGS
Intelli-NGS ANN To define good and bad variant calls VCF/xlsx https://​github.​com/​aditya-​88/​intel​ [147]
from Ion Torrent sequencer data li-​ngs
DAVI (Deep Alignment CNN + RNN To identify variants in NGS reads FASTQ/VCF N/A [116]
and Variant Identifica-
tion)
DeepSV CNN To call genomic deletions by visualis- BAM/VCF https://​github.​com/​CSupe​rlei/​DeepSV [52]
ing sequence reads
Alharbi and Rashid Human Genomics (2022) 16:26 Page 5 of 20

routine practice in genomics, including cancer research. on an MLP algorithm to investigate the true and false
Sophisticated bioinformatics and statistical frameworks variants in exome sequencing datasets generated from
are available for variant calling. the Ion Torrent and Illumina platforms. It represents a
The weakness of high-throughput sequencing proce- robust performance at low coverage data (up to 30X) by
dures is represented by significantly high technical and handling standard VCF file, resulting in another VCF file.
bioinformatics error rates [40–42]. Numerous compu- Ravasio et al. (2018) observed that the GARFIELD-NGS
tational problems have originated due to the enormous model recorded a significant reduction in the false can-
amounts of medium or low coverage genome sequences, didate variants after applying a canonical pipeline for the
short read fragments and genetic variations among variant prioritisation of disease-related data [53].
individuals [43]. Such weaknesses make the NGS data The Clairvoyante model was introduced to predict var-
dependent on bioinformatics tools for data interpreta- iant type (SNP or Indel), zygosity, allele alternative and
tion. For instance, several variant calling tools are broadly Indel length. Thus, it overcomes the DeepVariant model’s
used in clinical genomic variant analyses, such as genome drawback of lacking the full variant details, including the
analysis toolkit (GATK) [44], SAMtools [45], Freebayes precise alternative allele and variant type. The Clairvoy-
[46] and Torrent Variant Caller (TVC; [47]). However, ante model was specifically designed to utilise long-read
despite the availability of whole genome sequencing, sequencing data generated from SMS technologies (e.g.
some actual variants are yet to be discovered [48]. PacBio and ONT), although it is commonly applicable
Contemporary deep learning tools have been proposed for short read datasets as well [50]. Another variant caller
in the field of next-generation sequencing to overcome and annotation model, Intelli-NGS, was introduced by
the limitations of conventional interpretation pipe- Singh and Bhatia (2019). One variant calling was based
lines. For example, Kumaran et al. demonstrated that on artificial neural network (ANN), which utilises the
combining DeepVariant, a deep learning-based variant data generated from the Ion Torrent platform to identify
caller, with conventional variant callers (e.g. SAMtools true and false effectively. Intelli-NGS takes any number
and GATK) improved the accuracy scores of single- of VCF files as batch inputs and processes them in order.
nucleotide variants and Indel detections [49]. Imple- The processed data results in an excel sheet related to
menting deep learning algorithms in DNA sequencing each VCF file containing the HGVS codes of all variants
data interpretation is in its infancy, as seen with the [54]. All in all, several studies confirmed the capabilities
recent pioneering example, DeepVariant, developed by of deep learning in genetic variant calling and annotation
Google. DeepVariant relies on the graphical dissimilari- from sequencing data.
ties in input images to perform the classification task for
genetic variant calling from NGS short reads. It treats the Disease variants
mapped sequencing datasets as images and converts the Deep learning-based models for the prediction of patho-
variant calls into image classification tasks [30]. However, genic variants, their application and input/output for-
this model does not provide details about the variant mats with source codes (if available) are listed in Table 2.
information, for example, the exact alternative allele and Considering extra data from patient relatives or rel-
type of variant. As such, it is classified as an incomplete evant cohorts, medical geneticists frequently prioritise
variant caller model [50]. and filter the observed genetic variants after variant
Later, several DL models for variant calling and anno- calling and annotation (Müller et al. [55]). Variant pri-
tation were introduced. For instance, Cai et al. (2019) oritisation is a method of determining the most likely
introduced DeepSV, a genetic variant caller that aims to pathogenic variant within genetic screening that dam-
predict long genomic deletions (> 50 bp) extracted from ages gene function and underlying the disease phenotype
sequencing read images but not other types of structural [56]. Variant prioritisation involves variant annotation to
variants, such as long insertions or inversions. It pro- discover clinically insignificant variants, such as synony-
cesses the BAM format or VCF files as inputs and out- mous, deep-intronic variants and benign polymorphisms.
puts the results in the VCF form. In terms of evaluating Subsequently, the remaining variants, such as known
DeepSV, it was compared with another eight deletion variants or variants of unknown clinical significance
calling tools and one machine learning-based tool called (VUSs), become attainable [57]. Furthermore, complica-
Concod [51]. The results reveal that although Concod has tions in interpreting rare genetic variants in individuals,
shorter training times in the case of fewer trained sam- for example, and understanding their impacts on dis-
ples, DeepSV shows a higher accuracy score and fewer order risk influence the clinical capability of diagnostic
training losses using the same dataset [52]. Another sequencing. For example, the numerous and infrequent
genomic variant filtering tool, GARFIELD-NGS, can VUSs in rare genetic diseases represent a challenging
be applied directly to the variant caller outputs. It relies obstacle in sequencing implementation for personalised
Alharbi and Rashid Human Genomics (2022) 16:26 Page 6 of 20

Table 2 Genomic tools/algorithm based on deep learning architecture for disease variants
Tools DL model Application Input/Output Website Code Source References

DeepPVP (PhenomeNet Variant ANN to identify the variants in both VCF / VCF https://​github.​com/​bio-​ontol​ [61]
Predictor) whole exome or whole genome ogy-​resea​rch-​group/​pheno​
sequence data menet-​vp
ExPecto CNN Accurately predict tissue-specific VCF/ CSV https://​github.​com/​Funct​ [138]
transcriptional effects of muta- ionLab/​ExPec​to
tions/functional
SNPs
PEDIA (Prioritisation of exome CNN To prioritise variants and genes VCF / CSV https://​github.​com/​PEDIA-​Chari​ [148]
data by image analysis) for diagnosis of patients with te/​PEDIA-​workf​l ow
rare genetic disorders
DeepMILO (Deep learning for CNN + RNN to predict the impact of non- FASTA / TSV https://​github.​com/​khura​nalab/​ [119]
Modeling Insulator Loops) coding sequence variants on 3D DeepM​ILO
chromatin structure
DeepWAS CNN To identify disease or trait-asso- TSV / TSV https://​github.​com/​cellm​apslab/​ [19]
ciated SNPs DeepW​AS
PrimateAI CNN To classify the pathogenicity of CSV / CSV + txt https://​github.​com/​Illum​ina/​ [27]
missense mutations Prima​teAI
DeepGestalt CNN To Identifying facial phenotypes Image / txt Is available through the Face- [149]
of genetic disorders 2Gene application, http://​face2​
gene.​com
DeepMiRGene RNN, LSTM To predict miRNA precursor FASTA / Cross-Val- https://​github.​com/​eleve​nth83/​ [150]
idation (CV)-Splits deepM​iRGene
file
Basset CNN To predict the causative SNP BED, FASTA/ VCF https://​github.​com/​davek​44/​ [151]
with sets of related variants Basset

medicine and healthy population assessment (Sunda- of its power to interpret genomic variations in both
ram et al., 2018). Although statistical methods, such coding and non-coding fragments within the entire
as GWAS, have had huge success in combining genetic human genome. Recently, several ML-based methods
variants to disorders, they still require heavy sampling have offered to prioritise non-coding variants; still, the
to distinguish rare genetic variants and cannot deliver recognition of disease-associated variants in complex
information about de novo variants (Fu et al., 2014). traits, such as cancers, is challenging. Plus, the major-
Thus, current annotation approaches, such as PolyPhen ity of positive variants associated with a certain phe-
[58], SIFT [59] and GERP [60], represent beneficial meth- notype is required to predict general and precise novel
ods for prioritising the causative variants, despite facing correlations (Schubach et al., 2017). Lately, several DL
some drawbacks. For such problems, DL-based models approaches have been proposed to overcome these
have been implemented to enable a powerful method for challenges. For example, the DeepWAS model relies on
exploiting the deep neural network (DNN) architecture a CNN algorithm that allows regulatory impact pre-
to prioritise variants, for instance, the Basset model, a diction of each variant on numerous cell-type-specific
variant annotator, that relies on a CNN algorithm and is chromatin features. The key result of the DeepWAS
designed to predict the causative SNP exploiting DNase model is the direct determination of the disease-asso-
I hypersensitivity sequencing data as an input (Kelley, ciated SNPs with a common effect on a certain chro-
Snoek and Rinn, 2016). matin trait in the related tissue. The DeepWAS model
The clinical and molecular validations cannot be demonstrated the ability to detect the disease-relevant,
replaced by in silico prediction models; however, in a transcriptionally active genomic position after combin-
sense, they can contribute to decrease waiting times for ing the expression and methylation quantitative-trait
results and can prioritise variants for further functional loci data (eQTL and meQTL, respectively) of various
analysis. These predictable models are mainly suitable resources and tissues [19]. Nevertheless, several deep
when several poorly understood candidate variants learning algorithms have been described as discovering
convey certain phenotypes [27]. Medical genetics has novel genes. For this reason, deep learning approaches
been significantly transformed following the proposi- are particularly suited for variant investigation for
tion of NGS technology, particularly with WGS because genes not yet related to specific disease phenotypes [61,
62].
Alharbi and Rashid Human Genomics (2022) 16:26 Page 7 of 20

Gene expression and regulation bulk scRNA-seq data. This approach permits quantify-
In this section, we focused on the most efficient deep ing a particular type of immune cells such as CD8 + and
learning-based tools in the area of gene expression and CD4Tmem plus the general population of lymphocytes
regulation in the genome. We listed several models apply- together with Stromal content and B cells [76].
ing various deep learning algorithms and summarised the Recently, Jaganathan et al. (2019) constructed SpliceAI,
information and source codes mostly in splicing and gene a deep residual neural network that predicts splice func-
expression applications, if available (Table 3). tion using only pre-mRNA transcript sequencing as
Gene expression involves the initial transcriptional inputs. An architecture contained a 32-dilated convolu-
regulators (e.g. pre-mRNA splicing, transcription and tional layer employed to identify sequence determinates
polyadenylation) to functional protein production [63]. crossing enormous genomic gaps since there are tens of
The high-throughput screening technologies that test thousands of nucleotides separated splice-donors and
thousands of synthetic sequences have provided rich splice-acceptors [71].
knowledge concerning the quantitative regulation of Many experimental datasets, such as the ChIP-seq
gene expression, although with some limitations. The and DNase-seq assays, do not measure the effects on
main limitation is that huge biological sequence regions gene expression directly; however, they are an ideal
cannot be explored using experimental or computational complement to deep neural network methods. For
techniques [64]. Although recent NGS technology has instance, Movva et al. (2019) introduced the MPRA-
provided great knowledge in the gene-regulation field, DragoNN model, based on CNN architecture for pre-
the majority of natural mRNA screening approaches still diction and analysis of the transcription regulatory
utilise chromatin accessibility, ChIP-seq and DNase-seq activity of non-coding DNA sequencing data meas-
information; they focus on studying promoter regions. ured from (MPRAs) data. Approximately 16 K dis-
Therefore, a robust method is required to understand tinct regulatory regions in K562 and HepG2 cell lines
the relationship between various regions of gene regula- of 295 bp cis-regulatory elements cloned upstream of
tory structures and their networks expression connection either minimal-promoter or strong-promoter used in
[65]. Likewise, the current technology in RNA sequenc- the Sharpr-MPRA evaluation [77]. A very contempo-
ing has empowered the direct sequencing of single cells, rary DL model, introduced by Agarwal and Shendure,
identified as single-cell RNA sequencing (scRNA-seq), named the Xpresso model, a deep convolutional neu-
that permits querying biological systems at unique inten- ral network (CNN), conjointly models the promoter
tion. For example, the data of scRNA-seq produce valu- sequence and its related mRNA stability features to
able information into cellular heterogeneity that could predict the gene expression levels of mRNA. Interest-
expand the interpretation of human diseases and biol- ingly, Xpresso models are simple to train at several
ogy [66, 67]. Its major applications of scRNA-seq data arbitrary cell types, even when they lack experimental
understanding involved in detecting the type and state of information, such as ChIP and DNase [73]. Zhang Z.
the cells [68, 69]. However, the two main computational et al. (2019) developed a deep learning-based model
questions include how to cluster the data and how to called DARTS; deep learning augmented RNA-seq
retrieve them [70]. analysis of transcript splicing, that use a wide-ranging
Deep learning has empowered essential progress for RNA-seq resources of a various alternative splicing.
constructing predictive methods linking regulatory It consists of two main modules: deep neural net-
sequence elements to the molecular phenotypes [71– work (DNN) and Bayesian hypothesis testing (BHT)
74]. Just recently, Gundogdu and his colleagues (2022) [78]. More DL-based models (specifically, four dif-
demonstrate an excellent classification model based on ferent CNN architectures) designed by Bretschneider
deep neural networks (DNNs). It constricted numerous et al. (2018), named the competitive splice site model
types of previous biological information on functional (COSSMO), which adapts to various quantities of
networks between genes to understand a biological sig- alternative splice sites and precisely estimates them
nificant illustration of the scRNA-seq data [70]. Moreo- via genome-wide cross-validation. The frameworks
ver, Li et al. (2020) present a DESC an unsupervised consist of convolutional layers, communication lay-
deep learning algorithm implemented based on python, ers, long short-term memory (LSTM) and residual
which understands iteratively representation of cluster- networks, correspondingly, to discover related motifs
specific gene expression and the scRNA-seq analysis from DNA sequences. In every putative splice site,
cluster tasks [75]. Further, deep learning model has also the used model inputs are DNA and RNA sequences
been applied for single-cell sequencing data. Its deep with 80 nucleotide-wide windows around the alter-
neural network (DNN) model designed to measure the native splice sites and opposite constitutive splice
immune infiltration in both colorectal and breast cancers sites together with the intron length. The outputs of
Alharbi and Rashid Human Genomics (2022) 16:26 Page 8 of 20

Table 3 Genomic tools/algorithm based on deep learning architecture for gene expression regulation
Tools DL model Application Input/Output Website Code Source References

DanQ CNN + BLSTM To predict DNA function .mat /.mat https://​github.​com/​uci-​ [152]
directly from sequence cbcl/​DanQ
data
SPEID CNN + LSTM For enhancer–promoter .mat /.mat https://​github.​com/​ma-​ [153]
interaction (EPI) prediction compb​io/​SPEID
EP2vec NLP + GBRT To predict enhancer–pro- CSV / CSV https://​github.​com/​ [154]
moter interactions (EPIs) wanwe​nzeng/​ep2vec
D-GEX (deep learning for FNN To understand the expres- .cel, txt, BAM / txt https://​github.​com/​uci-​ [155]
gene expression) sion of target genes cbcl/D-​GEX
from the expression of
landmark genes
DeepExpression CNN To predict gene expression .txt /.txt https://​github.​com/​ [156]
using promoter sequences wanwe​nzeng/​DeepE​xpres​
and enhancer–promoter sion
interactions
DeepGSR CNN + ANN To recognise various types FASTA /.txt https://​zenodo.​org/​ [157]
of genomic signals and record/​11171​59#.​Xp4B4​
regions (GSRs) in genomic y2B1p8
DNA (e.g. splice sites and
stop codon)
SpliceAI CNN To identify splice function VCF / VCF https://​github.​com/​Illum​ [71]
from pre-mRNA sequenc- ina/​Splic​eAI
ing
SpliceRover CNN For splice site prediction FASTA /.txt N/A [158]
Splice2Deep CNN For splice site prediction in FASTA /.txt https://​github.​com/​Somay​ [29]
Genomic DNA ahAlb​aradei/​Splice_​Deep
DeepBind CNN To characterise DNA- and FASTA /.txt https://​github.​com/​ [111]
RNA-binding protein MedCh​aabane/​DeepB​ind-​
specificity with-​PyTor​ch
Gene2vec NLP To produce a representa- .txt /.txt https://​github.​com/​jingc​ [130]
tion of genes distribution heng-​du/​Gene2​vec
and predict gene–gene
interaction
MPRA-DragoNN CNN To predict and analyse the N/A https://​github.​com/​kunda​ [77]
regulatory DNA sequences jelab/​MPRA-​Drago​NN
and non-coding genetic
variants
BiRen CNN + GRU + RNN For enhancers predictions BED, BigWig /CSV https://​github.​com/​wenji​ [159]
egroup/​BiRen
APARENT (APA REgression CNN To predict and engineer FASTA / CSV https://​github.​com/​johli/​ [72]
NeT) the human 3’ UTR Alterna- apare​nt
tive Polyadenylation (APA)
and annotate pathoge-
netic variants
LaBranchoR (LSTM Branch- BLSTM To predict the location of FASTA / FASTA https://​github.​com/​ [160]
point Retriever) RNA splicing branchpoint jpaggi/​labra​nchor
COSSMO CNN, BLSTM + ResNet To predict the splice site TSV, CSV /CSV http://​cossmo.​genes.​toron​ [79]
sequencing and splice to.​edu/
factors
Xpresso CNN To predict gene expres- FASTA /.txt https://​github.​com/​vagar​ [73]
sion levels from genomic wal87/​Xpres​so
sequence
DeepLoc CNN + BLSTM To predict subcellular FASTA/ prediction score https://​github.​com/​JJAlm​ [161]
localisation of protein from agro/​subce​llular_​local​
sequencing data izati​on
SPOT-RNA CNN To predict RNA Secondary FASTA /.bpseq,.ct, and. https://​github.​com/​jaswi​ [162]
Structure prob nders​ingh2/​SPOT-​RNA/
DeepCLIP CNN + BLSTM For predicting the effect of FASTA /.txt https://​github.​com/​deepc​ [163]
mutations on protein–RNA lip/​deepc​lip
binding
Alharbi and Rashid Human Genomics (2022) 16:26 Page 9 of 20

Table 3 (continued)
Tools DL model Application Input/Output Website Code Source References

DECRES (DEep learning for MLP + CNN To predict active enhanc- FASTA /.txt https://​github.​com/​ [74]
identifying Cis-Regulatory ers and promoters across yifeng-​li/​DECRES
ElementS) the human genome
DeepChrome CNN For prediction of gene Bam / TSV https://​github.​com/​ [164]
expression levels from QData/​DeepC​hrome
histone modification data
DARTS DNN + BHT Deep learning augmented .txt https://​github.​com/​Xingl​
RNA-seq analysis of tran- ab/​DARTS
script splicing

the model are predictions of percent selected index Epigenomics


(PSI) distribution of every putative splice-site. All of This section discusses some epigenomics challenges and
COSSMO model’s performance exceeds MaxEntS- summarises up-to-date deep learning models in epig-
can; however, there were large performance variances enomics, their implementation, data types and source
among the four frameworks, in which recurrent LSTM code (Table 4). Modifications in phenotypes that are not
reached the best accuracy over the communication based on genotype modifications are referred to as epi-
networks, which did not consider the splice-site order- genetics. It is defined as the study of heritable modifica-
ing [79]. However, to learn the automated relationships tions in gene expressions which does not include DNA
among heterogeneous datasets in imperfect biological sequence modifications [80]. Epigenomic mechanisms,
situations, deep learning models offer unprecedented including DNA methylation, histone modifications
opportunities. and non-coding RNAs, are considered fundamental in
understanding disease developments and finding new

Table 4 Genomic tools/algorithm based on deep learning architecture for epigenomics


Tools DL model Application Input/Output Website Code Source References

DeepSEA CNN To predict multiple chromatin N/A https://​github.​com/​Team-​Neptu​ [165]


effects of DNA sequence altera- ne/​DeepS​ea
tions
FactorNet CNN + RNN For predict the cell-type specific BED / BED, https://​github.​com/​uci-​cbcl/​Facto​ [120]
transcriptional binding factors (TF) gzipped bed- rNet
graph file
DeMo (Deep Motif Dashboard) CNN + RNN For transcription factor binding FASTA / txt https://​github.​com/​const-​ae/​ [166]
site perdition (TFBS) by classifica- Neural_​Netwo​rk_​DNA_​Demo
tion task
DeepCpG CNN + GRU​ To predict the methylation states TSV / TSV https://​github.​com/​cange​rmuel​ [83]
from single-cell data ler/​deepc​pg
DeepHistone CNN To accurately predict histone txt, CSV / CSV https://​github.​com/​ucrbi​oinfo/​ [84]
modification sites based on DeepH​istone
sequences and DNase-Seq (experi-
mental) data
DeepTACT​ CNN To predict 3D chromatin interac- CSV / CSV https://​github.​com/​liwen​ran/​ [167]
tions DeepT​ACT
Basenji CNN To predict cell-type-specific epige- FASTA / VCF https://​github.​com/​calico/​basen​ji [114]
netic and transcriptional profiles in
large mammalian genomes
Deopen CNN To predict the chromatin acces- BED, hkl /hkl https://​github.​com/​kimmo​1019/​ [31]
sibility from DNA sequence/ Deopen
Downstream analysis also included
QTL analysis
DeepFIGV (Deep Functional CNN To predicts impact on chromatin FASTA / TSV http://​deepf​i gv.​mssm.​edu [62]
Interpretation of Genetic accessibility and histone modifica-
Variants) tion
Alharbi and Rashid Human Genomics (2022) 16:26 Page 10 of 20

treatment targets. Although in clinical implementations, thus having the possibility to be utilised for investigating
epigenetics has yet to be completely employed. Recently, functional impacts of putative disorder-related variants
complications initiated in developing data interpreta- [84]. Hence, efficient deep learning models are necessary
tion tools to advances in next-generation sequencing for genome research to elucidate the epigenomic modifi-
and microarray technology to produce epigenetic data. cations’ impact on the downstream outputs.
The insufficiency of suitable and efficient computational
approaches has led current research to focus on a spe- Pharmacogenomics
cific epigenetic mark separately, although several mark We listed the most deliberated deep learning pharmacog-
interactions and genotypes occurred in vivo [81]. Several enomics models, their common purposes, input/output
previous studies have disclosed the fundamental appli- formats and the source of code (Table 5). Although there
cations of deep learning models in epigenomics. They has been a great interest in deep learning approaches in
reached unlimited success in predicting 3D chromatin the last few years, until very recently, deep learning tools
interactions, methylation status from single-cell datasets have been rarely employed for pharmacogenomics prob-
and histone modification sites based on DNase-Seq data lems, such as to predict drug response [86]. Knowledge
[62, 82–84]. concerning the association between genetic variants in
Liu et al. (2018) introduced a hybrid deep CNN model, enormous gene clusters up to whole genomes and the
Deopen, which was applied to predict chromatin acces- impacts of varying drugs is called pharmacogenomics
sibility within a whole genome from learned regulatory [87]. A key challenge in modern therapeutic methods
DNA sequence codes. In order to analytically evaluate is understanding the underlying mechanisms of vari-
Deopen’s function in capturing the accessibility codes of ability. Sometimes the medication response distribu-
a genome, a series of experiments were conducted from tion through a certain population is evidently bimodal,
the perspective of binary classification [31]. As an exam- proposing a dominant function for one variable, which
ple of Deopen applications, in the androgen-sensitive is usually genetic. Nonetheless, an understanding of the
human prostate adenocarcinoma cell lines (LN-CaP), underlying mechanisms of pharmacokinetics or pharma-
the EGR1 recovered by the Deopen model is assumed to codynamics could be utilised to detect candidate genes,
play a critical role as a treatment target in gene therapy wherein the function of those gene variants could expli-
for prostate cancer [31, 85]. Recently, Yin et al. (2019) cate various drug reactions (88). The clinical experiments
proposed the DeepHistone framework, a CNN-based generate various errors during the investigation of drug
algorithm to predict the histone modifications to various combination efficiency, which is time- and cost-intensive.
site-specific markers. For precise predictions, this model Besides, it could expose the patient to excessive risky
combines DNA sequence data with chromatin accessibil- therapy [89, 90]. In order to identify alternative drug syn-
ity information. It has revealed the capability to discrimi- ergy strategies without harming patients, high-through-
nate functional SNPs from their adjacent genetic variants, put screening (HTS) using several concentrations of a

Table 5 Genomic tools/algorithm based on deep learning architecture for pharmacogenomics


Tools Function DL model Application Input/Output Website Code Source References

DeepDR Drug Repositioning DNN To translate pharmacog- txt / txt https://​github.​com/​ [97]
enomics features identified ChengF-​Lab/​deepDR
from in vitro drug screen-
ing to predict the response
of tumours
DNN-DTI (Drug–target Database DNN To predict drug-target txt / txt https://​github.​com/​Johnn​ [168]
interaction prediction) interaction yY8/​DNN-​DTI
DeepBL Antibiotic Resistance CNN To predict the beta-lacta- FASTA / CSV http://​deepbl.​erc.​monash.​ [98]
mase (BLs) using protein or edu.​au
genome sequence datasets
DeepDrug3D Binding Site for drugs CNN To characterise and classify pdb / txt https://​github.​com/​pulim​ [115]
the protein 3D binding eng/​DeepD​rug3D
pockets
DrugCell Drug response and syn- CNN To predict drug response txt / txt https://​github.​com/​ideke​ [26]
ergy for cancer cells and synergy rlab/​DrugC​ell
DeepSynergy Anticancer drug synergy FNN To predict anticancer drug CSV / CSV https://​github.​com/​Krist​ [95]
synergy inaPr​euer/​DeepS​ynergy
Alharbi and Rashid Human Genomics (2022) 16:26 Page 11 of 20

couple of drugs employed to a cancer cell line is utilised separately revealed a drug sub-population with signifi-
[91]. Utilising existing HTS synergy datasets allowed the cant accuracy. This, in turn, competes with the state-of-
use of accurate computational models to investigate an the-art regression methods applied in previous models
enormous synergistic space. Such reliable models would to predict the drug response. Additionally, comparing
provide direction for both in vitro and in vivo studies, DrugCell with a parallel neural network model trained
and they are great steps towards personalised medicine, merely on drug design and labelled tissue extremely
for instance, prediction approaches of anticancer syn- outperformed the tissue-based model. This means that
ergic, systems biology [92], kinetic methods [93] and in DrugCell has learned data from somatic mutations
silico-based models of gene expression screening after exceeding the tissue-only method [26]. Another recent
single-drug and dose-reaction treatments [94]. Nonethe- model called DeepBL is based on deep learning architec-
less, these approaches are limited to particular targets, ture executed based on Small VGGNet structure (a type
pathways or certain cell lines and sometimes need a par- of CNNs) and TensorFlow library. This approach detects
ticular omics dataset of treated cell lines with specific the beta-lactamases (BLs) and their varieties that pro-
compounds [95]. vide resistance to beta-lactam antibiotics, with protein
To investigate these pharmacogenomics associations, sequences as inputs. It is based on well-interpreted mas-
statistical, such as the analysis of variance (ANOVA) sive RefSeq datasets covering > 39 K BLs extracted from
test, is utilised. This can identify, for example, oncogenic the NCBI database. Comparing this model with the other
changes that occur in patients, which are indicators of conventional machine learning-based algorithms, includ-
drug-sensitivity variances in cell lines. In order to move ing SVM, RF, NB and LR, DeepBL outperformed them
beyond the drug’s relations to the actual drug reaction after evaluation on an independent test set comprising
predictions, numerous statistical and machine learning more than 10 K sequences [98]. Until very recently, deep
methods can be employed, from linear regression models learning applications in pharmacogenomics remained
to nonlinear ones, such as kernel methods, neural net- under consideration.
works and SVM. A central weakness of these approaches
is the massive number of inputs feature alongside the low Deep learning algorithms/techniques used
sampling, such as in standard gene expression analysis, in genomics
and the total number of input genes (or features) exceeds The accomplishment of the recent, attainable models
the sample number. An up-to-date strategy to overcome mentioned in deep learning tools/software/pipelines in
the low sampling number issue is to engage multitasking genomics section suggests that deep learning is a pow-
models [96]. erful technique in genomic research. Here, we focus on
Deep learning methods are reportedly well suited to deep learning algorithms recently applied in genomic
treatment response prediction tasks based on cell-line applications: convolutional neural networks (CNNs),
omics datasets [95, 97]. One of the examples is, DrugCell, feedforward neural networks (FNN), natural language
a visible neural network (VNN) interpretation model for processing (NLP), recurrent neural networks (RNNs),
the structure and function of human cancer cells in ther- long short-term memory networks (LSTMs), bidirec-
apy response. It pairs the model’s central mechanisms to tional long short-term memory networks (BLSTMs) and
the human cell-biology structure. Permitting the predic- gated recurrent unit (GRU; Table 6; Fig. 1).
tion of any drug response within any cancer then smartly Deep learning is a contemporary and rapidly expanding
plans the successful combination of treatments. Drug- subarea of machine learning. It endeavours to model con-
Cell was developed to capture both elements of therapy cepts from wide-ranging data by occupying multi-layered
response in an explainable model with two divisions, the DNNs, hence creating data logic, such as pictures, sounds
VNN-integrating cell genotype and the artificial neural and texts. Generally, deep learning has two features: first,
network (ANN)-integrating drug design. The first VNN the structure of nonlinear processing parts is multiple
model inputs comprise text files of the hierarchal asso- layers, and second, the feature extraction fashion on each
ciation between molecular sub-systems in human cells, layer is either the supervised or unsupervised method
which contain 2086 biological process standards in the [99]. In the 1980s, the initial deep learning architecture
Gene Ontology (GO) database. The second ANN model was constructed on artificial neural networks (ANNs)
inputs were conventional ANN integrating text files of [100], but the actual power of deep learning developed
the Morgan fingerprint of medicine, the chemical struc- outward in 2006 [101, 102]. Since then, deep learning has
ture of a canonical vector symbol. The outputs from been functional in various arenas involving genomics,
these two divisions were combined into a single layer of bioinformatics, drug discovery, automated speech detec-
neurons that produced the response of a given genotype tion, image recognition and natural language processing
to a certain therapy. The prediction accuracy of each drug [6, 13, 103].
Table 6 Deep learning algorithms in genomics and their original development and applications
ANN Algorithms Natural Language Feedforward neural Convolutional neural Recurrent neural Bidirectional long Long short-term Gated recurrent unit
Alharbi and Rashid Human Genomics

Processing (NLP) network network (CNN) networks (RNNs) short-term memory memory networks (GRU)
networks (BLSTMs) (LSTMs)

Algorithm Inventor Applied in diction- Frank Rosenblatt It was named as Rumelhart, Hinton and Schuster and Paliwal Hochreiter and Cho et al
ary look-up system “neocognitron “ by Williams Schmidhuber
developed at Birkbeck Fukushima
(2022) 16:26

College, London
Year of Development 1948 1958 1980 1986 1997 1997 2014
Year of Initial Genom- 1996 1993 2015 2005 2015 2015 2017
ics’ Function
First User in Genomics Schuler et al S Eskiizmililer Alipanahi et al Maraziotis, Dragomir Quang and Xie Quang and Xie Angermueller et al
and Bezerianos
First Genomic Applica- Entrez databases Karyotyping architec- DeepBind Predicting the com- DanQ model DanQ model DeepCpG
tion ture based on Artificial plicated causative
Neural Networks associations between
genes from microar-
ray datasets based on
recurrent neuro-fuzzy
technique
Genomic Function Genetic counsellors Karyotyping, Prenatal Prediction of variant Predicting transcrip- DNA function predic- Enhancer–promoter Enhancers and methyla-
Exemplar(s) AI-based chatbots and diagnostic for early impacts on expression tion factor binding tions and prediction interaction (EPI) tion states predictions
EPIs prediction detection of ane- and disease risk, pre- sites, for Alignment of protein localisa- prediction
uploidy syndrome dicting drug response and SNV identification tion, predict miRNA
of tumours from precursor
genomic profiles, and
pharmacogenomics
Landmark References [128, 169, 170] [171–173] [97, 111, 174–176] [24, 116, 118, 177, 178] [122, 123, 179, 180] [16, 121, 123] [126, 181]
Page 12 of 20
Alharbi and Rashid Human Genomics (2022) 16:26 Page 13 of 20

Artificial neural networks (ANNs) were motivated images and were initially considered a fully automated
by the human brain’s neurons and their networks [104]. image network interpreter for classifying handcraft fonts
They consist of clusters of fully connected nodes, or neu- [105].
rons, demonstrating the stimulus circulation of synapses For genomic functions, CNNs considered the domi-
in the brain through the neural networks. This archi- nant algorithm utilised genomic information (Fig. 2). The
tecture of deep learning networks is utilised for feature primary CNN implementation, DeepBind, was proposed
extraction, classification, decreased data dimensions or by [111] for binding protein predictions and showed
sub-elements of a deeper framework such as CNNs [105]. greater prediction power than conventional models
Multi-omics study generates huge volumes of data, (Table 6). More examples of CNN are used as a single
as mentioned earlier, basically because of the evolu- algorithm in gene expression, and regulations include the
tion that has been pursued in genomics and improve- DeepExpression model, which has been effectively used
ments in biotechnology. Symbolic examples involve the to predict gene expression using promoter sequences
high-throughput technology, which extent thousands and enhancer–promoter interactions [112]. The SpliceAI
of gene expression or non-coding transcription, such as model was introduced to identify splice function from
miRNAs. Moreover, the genotyping platforms and NGS pre-mRNA sequencing [71]. Further, the SPOT-RNA
techniques and the associated GWAS that generates model was developed for predicting RNA secondary
measurable gene expression reports, such as RNA-Seq, structure [16]. CNN was also used for DNA sequenc-
discover numerous genetic variants, together with fur- ing in call genetic variants, such as Clairvoyante, Intelli-
ther genomic modifications in various populations [11]. NGS and DeepSV models [52, 54, 113]. In epigenomics,
However, some DL models rely purely on DNA sequence the DeepTACT model was used for predicting the 3D
datasets that seemingly lack the power to create predic- chromatin interactions [82], and the Basenji model was
tions of a cell-line-exclusive method due to the identical employed for predicting cell-type-specific epigenetic and
DNA sequencing of various cell lines. In order to over- transcriptional profiles in large mammalian genomes
come this deficiency, several hybrid deep learning mod- [114]. In disease variants, the ExPecto model was used
els have been advised and revealed obvious enhancement to predict tissue-specific transcriptional effects of muta-
in certain studies through joining DNA sequencing data tions/functions [32], and the DeepWAS model was used
with biological experiments information [84]. to identify disease or trait-associated SNPs [19]. Finally,
Feedforward Neural Networks (FNNs) Are a type of in pharmacogenomics applications, CNN was utilised to
artificial neural network that consists of one forward create the DrugCell model for drug response and synergy
direction network starting from input layers, crossing the predictions [26]. Additionally, the DeepDrug3D model
hidden layers and reaching to the output layer, without was obtained for characterising and classifying the 3D
forming loops such as RNNs [106]. It is used in genomics protein binding pockets [115].
to comprehend the expression of target genes from the Additionally, CNN algorithms were combined with
expression of landmark genes using the D-GEX model other algorithms to build up efficient approaches in epig-
[12]. Moreover, active enhancers and promoters have enomics, combining CNN with GRU to predict the meth-
been predicted across the human genome utilising the ylation states from single-cell data [83], while in terms of
DECRES model [107]. Moreover, anticancer drug syn- gene expression and regulation, [74] linked CNN algo-
ergy predictions have been made via the DeepSynergy rithms with MLP in the DECRES model to predict active
model [95]. enhancers and promoters across the human genome.
Convolutional Neural Networks (CNNs) Also called Besides, [116] used CNN with RNN algorithms in a DNA
ConvNet, CNN is a deep learning algorithm that has sequencing application to create the DAVI model and
a deep feedforward architecture consisting of various identify NGS read variants.
building blocks, such as convolution layers, pooling lay- Recurrent neural networks (RNNs) are ANNs with a
ers and fully connected layers [97, 108]. It illustrates a recurrent layer consisting of typical recurrent layers
fully connected network since each node in a single layer that enable state updates of past and current inputs
is fully connected to the entire node of the next layer. with feedback connections. They are distinguished
The convolution units in the CNN layers can obtain by the internal cycle connections between recurrent
the input data from units of the earlier one, which all layer units and are concerned with sequential data-
together generate a prediction. The key principle of such sets [117, 118]. Recurrent neural networks have regu-
deep construction is that massive processing and connec- larly expended for the task that comprised in learning
tion feature represents inferring nonlinear association sequencing datasets, such as translation languages and
between both inputs and outputs [109, 110]. The most recognising speech. However, it has not been utilised
common analysis uses of CNNs were applied in graphical widely on DNA sequencing data which is the data style
Alharbi and Rashid Human Genomics (2022) 16:26 Page 14 of 20

where the order link between bases are crucial for its and GRU frameworks to predict the methylation states
assessment [119]. Maraziotis et al. [24] initiated RNN from single-cell data [83].
implementation in genomics using microarray experi- Natural Language Processing (NLP) It examines the
mental data based on recurrent the neuro-fuzzy pro- computers usage to recognise human languages for the
tocol to infer the complicated causative relationship purpose of executing beneficial tasks [127]. In the field of
between genes by predicting the time-series of gene NLP, in fact, the “distributed representations” technique
expression (Table 6). is utilised in several state-of-the-art DL models [128]. For
Most RNNs are applied in genomics combined with example, the word2vec model is an achieved NLP that
other algorithms, such as CNNs. For example, to iden- utilises the distribution representation process, “neural
tify NGS read variants, the DAVI model introduced the embedding”. This is because of the embedding task that
combination of CNN and RNN algorithms [116]. The is frequently expressed through neural networks beside
FactorNet model was designed based on both CNN and numerous parameters. The aim of word embedding is to
RNN algorithms and raised to predict the cell-type-spe- convey linear mapping and then generate a direct advan-
cific transcriptional binding factors (TFBSs) [120]. How- tage of representing a single word, thereby distinguish-
ever, CNN algorithms are perfect at capturing local DNA ing vectors in continuous space and hence become open
sequence patterns; contrastingly, RNN derivatives, such for backpropagation-based methods in neural networks
as LSTM, are ideal for capturing long-distance depend- [129]. In terms of deep learning demands in the field of
encies between sequence datasets [119]. gene expression and regulation, Du et al. (2019) explored
Long short-term memory networks (LSTMs) are stand- the Gene2vec model, an idea of distributed represen-
ard recurrent cells with “gates” to handle long-term tation of genes. It engages genes’ natural contexts and
dependency tasks [118]. They deliberate to prevent long- their expression and co-expression patterns from GEO
term dependency difficulties through their competence data. The essential layer of a multilayer neural network
in acquiring long-term dependencies. It has a node, input uses the embedded gene, which predicts gene-to-gene
gate, output gate and forget gate as core LSTM unit. interactions with a 0.72 AUC score. This is an interesting
The node considers values through certain time gaps, outcome because the initial model input is the names of
whereas the input and output gates control information two genes merely. Thus, the distributed representation of
flow [121]. The preliminary implementations of LSTM genes technique is burdened with rich indications about
algorithms in genomics advised the SPEID model, which gene function [130]. Another NLP implementation in the
used a pattern of deep learning algorithms utilising both same field was shown by Zeng et al. (2018), who com-
LSTM and CNN for EPI predictions (Table 6; [18]). Park bined NLP with GBRT and introduced the EP2vec model
et al.[122] obtained DeepMiRGene, a fusion of the RNN to EPIs.
and LSTM models, to predict miRNA precursors. Graphical Neural Network (GNN) Due to the emerging
Bidirectional Long Short-Term Memory Networks biological network data sets in genomics, graph neural
(BLSTMs) In BLSTM, two RNNs with two hidden lay- network has been evolved as an important deep learn-
ers (forward and backward layers) can be trained in both ing method to tackle these data sets[131]. GNN was
time directions in parallel to enable the previous context proposed by Gori et al. (2005) as a novel neural network
usage that cannot be accomplished via standard RNNs model to tackle graph structure data [132]. Out of many
[118]. Quang et al. [123] expressed the DanQ model, the applications of GNN in analysing multi-omics data, the
original employment in genomics that predicted DNA few salient ones are disease gene prediction, drug discov-
function directly from sequence data developed from ery, drug interaction network, protein–protein interac-
CNN and BLSTM constructions (Table 6). Later, [124] tion network and biomedical imaging. GNN is capable
presented DeepCLIP, also utilising CNN and BLSTM, to of modelling both the molecular structure data [133] and
predict the effect of mutations on protein–RNA binding. biological network data[134].
Gated Recurrent Unit (GRU) is categorised as a variant
of the LSTM algorithm with cell has only “two gates”: the Deep learning resources for genomics
update gate and reset gate [118]. It couples neural net- We collected the most efficient user-friendly genomic
works opposing each other. The first network produces resources developed based on deep learning architectures
artificial, accurate information, while the second esti- (Table 7). The adoption of various deep learning solutions
mates the validity of the information [125]. It was initially and models is still limited, despite the enormous suc-
applied in gene expression and regulation by [126], who cess of these tools in genomics and bioinformatics. One
presented the BiRen model, an architecture consisting of reason for this is the lack of deep learning-based pub-
RNNs, CNNs and GRUs, to predict enhancers (Table 6). lished protocols to adapt to new, heterogeneous datasets
After, the DeepCpG model appeared, combining CNN requiring significant data engineering [135]. In genomics,
Alharbi and Rashid Human Genomics (2022) 16:26 Page 15 of 20

Table 7 Deep learning packages and resources


Resource Name Category Application Date created Link Free/paid

Libraries
Janggua Python package facilitates deep learning in 2020 https://​github.​com/​BIMSB​ Free
the context of genomics bioin​fo/​janggu
ExPectoa Python-based repository Contains code for predict- 2018 https://​github.​com/​Funct​ Free
ing expression effects of ionLab/​ExPec​to
human genome variants
ab initio from sequence
Selenea PyTorch-based Library A library for biological 2019 https://​selene.​flati​ronin​stitu​ Free
sequence data training te.​org/
and model architecture
development
Pysstera TensorFlow-based Library Used for learning sequence 2018 https://​github.​com/​ Free
and structure motifs In budach/​pysst​er
biological sequences
using convolutional neural
networks
Kipoia Python package Kipoi is an API and a reposi- 2019 https://​github.​com/​kipoi/​ Free
tory of ready-to-use trained kipoi
models for genomics http://​kipoi.​org/
Compute platform
Google Colaboratory PnP GPUs Colab allows anybody to 2017 https://​colab.​resea​rch.​ Free
(Colab) write and execute arbitrary google.​com/
python code through the
browser, and is especially
well suited to machine
learning, data analysis and
education
IBM Cloud Cloud service Cloud computing platform; 2011 https://​www.​ibm.​com/​ Free tier Cost tier
Design complex neural cloud
networks, then experiment
at scale to deploy optimised
learning models within IBM
Watson Studio
Google CloudML PnP GPUs For extreme scalability in 2008 https://​cloud.​google.​com/​ Paid
the long run ai-​platf​orm
Vertex AI AI platform Google Cloud’s new unified 2021 https://​cloud.​google.​com/​
ML platform vertex-​ai
Amazon EC2 Cloud service A website facility which 2006 https://​aws.​amazon.​com/​ Free Paid
delivers secure, scalable ec2/
compute power in the
cloud
a
These deep learning libraries/packages are specific to Genomic application

high-throughput data (e.g. WGS, WES, RNA-seq, ChIP- could become critical for genomic scientists and biomed-
seq, etc.) are utilised to train neural networks and have ical researchers.
become typical for disease predictions or understand- Janggu is a deep learning python library based on deep
ing regulatory genomics. Similarly, developing new DL CNN for genomic implementations. It aims at a data-
models and testing current models on new datasets face procuring facility and model assessment by supporting
great challenges due to the lack of inclusive, generalis- flexible neural network prototype models. The Janggu
able, practical deep learning libraries for biology [136]. In library provides three use cases: transcriptional factor
this respect, software frameworks and genomic packages predictions, utilising and enhancing the published deep
are necessary to allow rapid progress in adopting a novel learning designs and predicting the CAGE-tag count nor-
research question or hypothesis, combining original data malisation of promoters. This library offers easy access
or investigating using different neural network structures and pre-processing to convert data from standard file
[135]. In order to facilitate the DL model implementation formats (e.g. FASTA, BAM, Bigwig, BED and narrow-
in genomics, the following software packages or libraries Peak) to BigWig files [135].
Alharbi and Rashid Human Genomics (2022) 16:26 Page 16 of 20

Selene is a deep learning library based on PyTorch Conclusion


for biological sequence data training and model archi- This manuscript catalogues different deep learn-
tecture development. Selene supports the prediction of ing tools/software developed in different subareas
genetic variant effects and visualises the variant scores of genomics to fulfil the predictive tasks of various
as a Manhattan plot. It also automatically generates genomic analyses. We discussed, in detail, the data
training, testing and validation split from the given types in different genomics assays so that readers could
input dataset. Further, Selene automatically trains the have primary knowledge of the basic requirements to
data and can examine the model on a test set, thereby develop deep learning-based prediction models using
producing a visualised figure to display the model’s human genomics datasets. In the later part of the
performance [137]. manuscript, different deep learning architectures were
ExPecto is a variant prioritisation model for predict- briefly introduced to genomic scientists in order to help
ing the gene expression levels from a broad regulatory them decide the deep learning network architecture
region (~ 40 kb) range of promoter-proximal sequenc- for their specific data types and/or problems. We also
ing regions. It relies on CNN to convert the input briefly discussed the late application of the deep learn-
sequences into epigenomic features. ExPecto facili- ing technique in genomics and its underlying causes
tates rare variants or unprecedented variants predic- and solutions. Towards the end of the manuscript,
tion. This is because of its unique design architecture, various computational resources, software packages or
which does not utilise any variant information during libraries and web-based computational platforms are
the training process. ExPecto processes VCF files and provided to act as pointers for researchers to create
outputs CSV files [138]. their very first deep learning model utilising genomic
Pysster is a python library package based on CNN for datasets. In conclusion, this timely review holds the
biological sequencing data training and classification. potential to assist genomic scientists in adopting state-
Pysster provides automatic hyperparameter optimisa- of-the-art deep learning techniques for the exploration
tion and motif visualisation options along with their of genomic NGS datasets and analyses. This will cer-
position and class enrichment information [139]. tainly be beneficial for biomedicine and human genom-
Kipoi (Greek for “gardens”; pronounced “kípi”) is ics researchers.
a genomic repository for sharing and reusing trained
genome-related models. Kipoi provides more than
Abbreviations
2 K distinctly trained models from 22 different stud- NGS: Next-generation sequencing; WGS: Whole genome sequencing; WES:
ies covering significant predictive genomic tasks. The Whole exome sequencing; SMS: Single-molecule sequencing; RNA-seq: RNA
prediction includes chromatin accessibility determina- sequencing; ChIP-seq: Chromatin immunoprecipitation sequencing; PacBio:
Pacific biosciences; ONT: Oxford nanopore technology; MPRAs: Massively
tion, transcription factor binding and alternative splic- parallel reporter assays; miRNA: MicroRNAs; GWAS: Genome-wide association
ing from DNA sequences [136]. study; PSI: Percent selected index; HGVS: Human genome variation society;
Implementation of these deep learning, genome- IMSGC: International multiple sclerosis genetics consortium; VUS: Variant of
uncertain significance; CADD: Combined annotation dependent depletion;
based libraries/packages requires accessing the com- GATK: Genomic Analysis ToolKit; BAM: Binary alignment map; VCF: Variant call
puter power and familiarity with web-based resources format; FASTA: Text-based format for either nucleotide sequences or amino
(Table 7). Several major cloud-computing platforms acids; BED: Browser extensible data; CSV: Comma-separated values; CAGE: Cap
analysis of gene expression; GEO: Gene expression omnibus; EPI: Enhancer–
have proposed on-demand GPU access in user-friendly promoter interaction; TFBS: Transcription factor binding sites; DL: Deep
manners, including Google CloudML, IBM cloud, Ver- learning; ML: Machine learning; DNN: Deep neural network; MLP: Multilayer
tex AI and Amazon EC2 [140–142]. User configuration perceptron; CNN: Convolutional neural networks; RNN: Recurrent neural
network; LSTM: Long short-term memory network; BLSTM: Bidirectional long
and the installation of the appropriate environments short-term memory network; ANN: Artificial neural network; FNN: Feedforward
for general GPU coding are required in these cloud- neural networks; NLP: Natural language processing; GRU​: Gated recurrent unit;
based machines. Concurrently, for users who need to VGGNet: Visual geometry group networks; GBRT: Gradient boosted regression
trees; LR: Linear regression; RF: Random forest; NB: Naive Bayes; DBN: Deep
avoid semi-manual setup methods, an expert plug- belief networks; SVR: Support vector regression; AUC​: Area under the curve;
and-play (PnP) platform GPU access is offered, such as auPR: Area under the precision–recall curve; auROC: Area under the receiver
Google Colaboratory (Colab). Google Colab is consid- operating characteristic.
ered the simplest alternative python-based notebook Acknowledgements
and provides free K80 GPU utilisation for 12 continu- We duly acknowledge Dr. Mohamed Aly Hussain for his motivation and useful
ous hours [143, 144]. Links to the resources (packages/ discussion regarding the inception of this review article. We also appreciate Dr.
Lamya Alomair for her support during the development of this manuscript.
libraries and web platforms) for the application of deep
learning in genomics are provided in Table 7. Author contributions
WA and MR conceptualised this study. WA collected the data and performed
investigation. MR supervised this study. WA and MR wrote original draft. All
authors read and approved the final manuscript.
Alharbi and Rashid Human Genomics (2022) 16:26 Page 17 of 20

Funding integrating regulatory information using deep learning. PLOS Comput


This study is not funded by any funding source. Biol. 2020;16(2):e1007616.
20. Rosenblatt F. The perceptron: a probabilistic model for information stor-
Availability of data and materials age and organization in the brain. Psychol Rev. 1958;65(6):386–408.
Not applicable. 21. Sarker IH. Machine learning: algorithms, real-world applications and
research directions. SN Comput Sci. 2021;2(3):160.
22. Wang C, Tan XP, Tor SB, Lim CS. Machine learning in additive manufac-
Declarations turing: state-of-the-art and perspectives. Addit Manuf. 2020;36:101538.
23. Muzio G, O’Bray L, Borgwardt K. Biological network analysis with deep
Ethics approval and consent to participate learning. Brief Bioinform. 2021;22(2):1515–30.
Not applicable. 24. Maraziotis I, Dragomir A, Bezerianos A. Gene networks inference from
expression data using a recurrent neuro-fuzzy approach. In: 2005 IEEE
Consent for publication Engineering in Medicine and Biology 27th Annual Conference. IEEE;
Not applicable. 2005. p. 4834–7.
25. LeCun Y. 1.1 Deep learning hardware: past, present, and future. In: 2019
Competing interests IEEE International Solid-State Circuits Conference-(ISSCC). IEEE; 2019. p.
The authors declare that they have no competing interests. 12–9.
26. Kuenzi BM, Park J, Fong SH, Sanchez KS, Lee J, Kreisberg JF, et al. Predict-
Received: 24 November 2021 Accepted: 12 July 2022 ing drug response and synergy using a deep learning model of human
cancer cells. Cancer Cell. 2020;38(5):672-684.e6.
27. Sundaram L, Gao H, Padigepati SR, McRae JF, Li Y, Kosmicki JA, et al.
Predicting the clinical impact of human mutation with deep neural
networks. Nat Genet. 2018;50(8):1161–70.
References 28. Lanchantin J, Singh R, Wang B, Qi Y. Deep motif dashboard: visualizing
1. Auffray C, Imbeaud S, Roux-Rouquié M, Hood L. From functional and understanding genomic sequences using deep neural networks.
genomics to systems biology: concepts and practices. C R Biol. World Sci. 2017;3:254–65.
2003;326(10–11):879–92. 29. Albaradei S, Magana-Mora A, Thafar M, Uludag M, Bajic VB, Gojobori
2. Goldfeder RL, Priest JR, Zook JM, Grove ME, Waggott D, Wheeler MT, T, et al. Splice2Deep: an ensemble of deep convolutional neural
et al. Medical implications of technical accuracy in genome sequenc- networks for improved splice site prediction in genomic DNA. Gene X.
ing. Genome Med. 2016;8(1):24. 2020;5:100035.
3. Goodwin S, McPherson JD, McCombie WR. Coming of age: Ten 30. Poplin R, Chang PC, Alexander D, Schwartz S, Colthurst T, Ku A, et al. A
years of next-generation sequencing technologies. Nat Rev Genet. universal snp and small-indel variant caller using deep neural networks.
2016;17(6):333–51. Nat Biotechnol. 2018;36(10):983.
4. Yue T, Wang H. Deep Learning for Genomics: A Concise Overview. 2018 31. Liu Q, Xia F, Yin Q, Jiang R. Chromatin accessibility prediction via a
5. Honoré B, Østergaard M, Vorum H. Functional genomics studied by hybrid deep convolutional neural network. Bioinformatics. 2018;2:1147.
proteomics. BioEssays. 2004;26(8):901–15. 32. Zou J, Huss M, Abid A, Mohammadi P, Torkamani A, Telenti A. A primer
6. Talukder A, Barham C, Li X, Hu H. Interpretation of deep learning in on deep learning in genomics. Nat Genet. 2019;51(1):12–8.
genomics and epigenomics. Brief Bioinform. 2020;2:447. 33. Al-Stouhi S, Reddy CK. Transfer learning for class imbalance problems
7. Fulco CP, Munschauer M, Anyoha R, Munson G, Grossman SR, Perez EM, with inadequate data. Knowl Inf Syst. 2016;48(1):201–28.
et al. Systematic mapping of functional enhancer–promoter connec- 34. Chicco D, Jurman G. The advantages of the Matthews correlation
tions with CRISPR interference. Science (80-). 2016;354(6313):769–73. coefficient (MCC) over F1 score and accuracy in binary classification
8. Kulasingam V, Pavlou MP, Diamandis EP. Integrating high-throughput evaluation. BMC Genom. 2020;21(1):6.
technologies in the quest for effective biomarkers for ovarian cancer. 35. Handelman GS, Kok HK, Chandra RV, Razavi AH, Huang S, Brooks M,
Nat Rev Cancer. 2010;10(5):371–8. et al. Peering into the black box of artificial intelligence: evaluation met-
9. Nariai N, Kolaczyk ED, Kasif S. Probabilistic protein function prediction rics of machine learning methods. Am J Roentgenol. 2019;212(1):38–43.
from heterogeneous genome-wide data. PLoS One. 2007;2(3):e337. 36. England JR, Cheng PM. Artificial intelligence for medical image analysis:
10. Ritchie MD, Holzinger ER, Li R, Pendergrass SA, Kim D. Methods of a guide for authors and reviewers. Am J Roentgenol. 2019;212(3):513–9.
integrating data to uncover genotype–phenotype interactions. Nat Rev 37. Eraslan G, Avsec Ž, Gagneur J, Theis FJ. Deep learning: new com-
Genet. 2015;16(2):85–97. putational modelling techniques for genomics. Nat Rev Genet.
11. Koumakis L. Deep learning models in genomics; are we there yet? 2019;20(7):389–403.
Comput Struct Biotechnol J. 2020;18:1466–73. 38. Zou J, Huss M, Abid A, Mohammadi P, Torkamani A, Telenti A. A primer
12. Cao C, Liu F, Tan H, Song D, Shu W, Li W, et al. Deep learning and on deep learning in genomics. Nat Genet. 2019;51(1):12–8.
its applications in biomedicine. Genom Proteom Bioinform. 39. Pérez-Enciso M, Zingaretti LM. A guide for using deep learning for
2018;16(1):17–32. complex trait genomic prediction. Genes (Basel). 2019;10(7):12258.
13. Telenti A, Lippert C, Chang PC, DePristo M. Deep learning of 40. Abnizova I, Boekhorst RT, Orlov YL. Computational errors and biases
genomic variation and regulatory network data. Hum Mol Genet. in short read next generation sequencing. J Proteom Bioinform.
2018;27(R1):R63-71. 2017;10(1):400089.
14. Kopp W, Monti R, Tamburrini A, Ohler U, Akalin A. Deep learning for 41. Ma X, Shao Y, Tian L, Flasch DA, Mulder HL, Edmonson MN, et al. Analy-
genomics using Janggu. Nat Commun. 2020;11(1):3488. sis of error profiles in deep next-generation sequencing data. Genome
15. Deep learning for genomics. Nat Genet. 2019;51(1):1–1. Biol. 2019;20(1):50.
16. Singh J, Hanson J, Paliwal K, Zhou Y. RNA secondary structure prediction 42. Pfeiffer F, Gröber C, Blank M, Händler K, Beyer M, Schultze JL, et al.
using an ensemble of two-dimensional deep neural networks and Systematic evaluation of error rates and causes in short samples in
transfer learning. Nat Commun. 2019;10(1):5407. next-generation sequencing. Sci Rep. 2018;8(1):10950.
17. Hsieh T-C, Mensah MA, Pantel JT, Aguilar D, Bar O, Bayat A, et al. 43. Horner DS, Pavesi G, Castrignano T, De Meo PD, Liuni S, Sammeth
PEDIA: prioritization of exome data by image analysis. Genet Med. M, et al. Bioinformatics approaches for genomics and post genom-
2019;21(12):2807–14. ics applications of next-generation sequencing. Brief Bioinform.
18. Singh R, Lanchantin J, Robins G, Qi Y. DeepChrome: deep-learning for 2010;11(2):181–97.
predicting gene expression from histone modifications. Bioinformatics. 44. McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky
2016;32(17):i639–48. A, et al. The genome analysis toolkit: a mapreduce framework for
19. Arloth J, Eraslan G, Andlauer TFM, Martins J, Iurato S, Kühnel B, et al. analyzing next-generation DNA sequencing data. Genome Res.
DeepWAS: multivariate genotype-phenotype associations by directly 2010;20(9):1297–303.
Alharbi and Rashid Human Genomics (2022) 16:26 Page 18 of 20

45. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. The 69. Darmanis S, Sloan SA, Zhang Y, Enge M, Caneda C, Shuer LM, et al. A
sequence alignment/map format and SAMtools. Bioinformatics. survey of human brain transcriptome diversity at the single cell level.
2009;25(16):2078–9. Proc Natl Acad Sci. 2015;112(23):7285–90.
46. Garrison E, Marth G. Haplotype-based variant detection from short-read 70. Gundogdu P, Loucera C, Alamo-Alvarez I, Dopazo J, Nepomuceno
sequencing. Science. 2012;7:4458. I. Integrating pathway knowledge with deep neural networks to
47. Hwang S, Kim E, Lee I, Marcotte EM. Systematic comparison of variant reduce the dimensionality in single-cell RNA-seq data. BioData Min.
calling pipelines using gold standard personal exome variants. Sci Rep. 2022;15(1):1.
2015;5(1):17875. 71. Jaganathan K, Kyriazopoulou Panagiotopoulou S, McRae JF, Darbandi
48. Kotlarz K, Mielczarek M, Suchocki T, Czech B, Guldbrandtsen B, Szyda SF, Knowles D, Li YI, et al. Predicting splicing from primary sequence
J. The application of deep learning for the classification of correct and with deep learning. Cell. 2019;176(3):535-548.e24.
incorrect SNP genotypes from whole-genome DNA sequencing pipe- 72. Bogard N, Linder J, Rosenberg AB, Seelig G. A deep neural network
lines. J Appl Genet. 2020;61(4):607–16. for predicting and engineering alternative polyadenylation. Cell.
49. Kumaran M, Subramanian U, Devarajan B. Performance assessment of 2019;71:9886.
variant calling pipelines using human whole exome sequencing and 73. Agarwal V, Shendure J. Predicting mRNA abundance directly from
simulated data. BMC Bioinform. 2019;20(1):342. genomic sequence using deep convolutional neural networks. Cell
50. Luo R, Sedlazeck FJ, Lam T, Schatz MC, Kong H, Genome H. Clairvoyante: Rep. 2020;31(7):107663.
a multi-task convolutional deep neural network for variant calling in 74. Li Y, Shi W, Wasserman WW. Genome-wide prediction of cis-
single molecule sequencing. Science. 2018;3:7745. regulatory regions using supervised deep learning methods. BMC
51. Cai L, Chu C, Zhang X, Wu Y, Gao J. Concod: an effective integration Bioinform. 2018;19(1):202.
framework of consensus-based calling deletions from next-generation 75. Li X, Wang K, Lyu Y, Pan H, Zhang J, Stambolian D, et al. Deep learning
sequencing data. Int J Data Min Bioinform. 2017;17(2):153. enables accurate clustering with batch effect removal in single-cell
52. Cai L, Wu Y, Gao J. DeepSV: accurate calling of genomic deletions from RNA-seq analysis. Nat Commun. 2020;11(1):2338.
high-throughput sequencing data using deep convolutional neural 76. Torroja C, Sanchez-Cabo F. Digitaldlsorter: deep-learning on
network. BMC Bioinform. 2019;20(1):665. scRNA-seq to deconvolute gene expression data. Front Genet.
53. Ravasio V, Ritelli M, Legati A, Giacopuzzi E. GARFIELD-NGS: genomic 2019;10:77458.
vARiants FIltering by dEep learning moDels in NGS. Bioinformatics. 77. Movva R, Greenside P, Marinov GK, Nair S, Shrikumar A, Kundaje A.
2018;34(17):3038–40. Deciphering regulatory DNA sequences and noncoding genetic
54. Singh A, Bhatia P. Intelli-NGS: intelligent NGS, a deep neural network- variants using neural network models of massively parallel reporter
based artificial intelligence to delineate good and bad variant calls from assays. PLoS One. 2019;71:466689.
IonTorrent sequencer data. bioRxiv. 2019;12:879403. 78. Zhang Z, Pan Z, Ying Y, Xie Z, Adhikari S, Phillips J, et al. Deep-learning
55. Müller H, Jimenez-Heredia R, Krolo A, Hirschmugl T, Dmytrus J, Boztug augmented RNA-seq analysis of transcript splicing. Nat Methods.
K, et al. VCF.Filter: interactive prioritization of disease-linked genetic vari- 2019;16(4):307–10.
ants from sequencing data. Nucleic Acids Res. 2017;45(W1):W567-72. 79. Bretschneider H, Gandhi S, Deshwar AG, Zuberi K, Frey BJ. COSSMO:
56. Eilbeck K, Quinlan A, Yandell M. Settling the score: variant prioritization predicting competitive alternative splice site selection using deep
and Mendelian disease. Nat Rev Genet. 2017;18(10):599–612. learning. In: Bioinformatics. 2018.
57. Roy S, Coldren C, Karunamurthy A, Kip NS, Klee EW, Lincoln SE, et al. 80. Lo Bosco G, Rizzo R, Fiannaca A, La Rosa M, Urso A. A deep learning
Standards and guidelines for validating next-generation sequencing model for epigenomic studies. In: 2016 12th International Confer-
bioinformatics pipelines. J Mol Diagn. 2018;20(1):4–27. ence on Signal-Image Technology & Internet-Based Systems (SITIS).
58. Adzhubei IA, Schmidt S, Peshkin L, Ramensky VE, Gerasimova A, Bork IEEE; 2016. p. 688–92.
P, et al. A method and server for predicting damaging missense muta- 81. Cazaly E, Saad J, Wang W, Heckman C, Ollikainen M, Tang J. Making
tions. Nat Methods. 2010;7(4):248–9. sense of the epigenome using data integration approaches. Front
59. Ng PC. SIFT: predicting amino acid changes that affect protein function. Pharmacol. 2019;19:10.
Nucleic Acids Res. 2003;31(13):3812–4. 82. Li W, Wong WH, Jiang R. DeepTACT: predicting 3D chromatin
60. Cooper GM. Distribution and intensity of constraint in mammalian contacts via bootstrapping deep learning. Nucleic Acids Res.
genomic sequence. Genome Res. 2005;15(7):901–13. 2019;47(10):e60–e60.
61. Boudellioua I, Kulmanov M, Schofield PN, Gkoutos GV, Hoehndorf R. 83. Angermueller C, Lee HJ, Reik W, Stegle O. DeepCpG: accurate predic-
DeepPVP: phenotype-based prioritization of causative variants using tion of single-cell DNA methylation states using deep learning.
deep learning. BMC Bioinform. 2019;20(1):65. Genome Biol. 2017;18(1):67.
62. Hoffman GE, Bendl J, Girdhar K, Schadt EE, Roussos P. Functional 84. Yin Q, Wu M, Liu Q, Lv H, Jiang R. DeepHistone: a deep learning
interpretation of genetic variants using deep learning predicts impact approach to predicting histone modifications. BMC Genomics.
on chromatin accessibility and histone modification. Nucleic Acids Res. 2019;20(2):193.
2019;3:5589. 85. Baron V, Adamson ED, Calogero A, Ragona G, Mercola D. The tran-
63. Tupler R, Perini G, Green MR. Expressing the human genome. Nature. scription factor Egr1 is a direct regulator of multiple tumor suppres-
2001;409(6822):832–3. sors including TGFβ1, PTEN, p53, and fibronectin. Cancer Gene Ther.
64. Zrimec J, Börlin CS, Buric F, Muhammad AS, Chen R, Siewers V, et al. 2006;13(2):115–24.
Deep learning suggests that gene expression is encoded in all parts 86. Baptista D, Ferreira PG, Rocha M. Deep learning for drug response
of a co-evolving interacting gene regulatory structure. Nat Commun. prediction in cancer. Brief Bioinform. 2021;22(1):360–79.
2020;11(1):6141. 87. Lesko LJ, Woodcock J. Translation of pharmacogenomics and
65. Zrimec J, Börlin CS, Buric F, Muhammad AS, Chen R, Siewers V, et al. pharmacogenetics: a regulatory perspective. Nat Rev Drug Discov.
Deep learning suggests that gene expression is encoded in all parts 2004;3(9):763–9.
of a co-evolving interacting gene regulatory structure. Nat Commun. 88. Roden DM. Pharmacogenomics: challenges and opportunities. Ann
2020;11(1):6141. Intern Med. 2006;145(10):749.
66. Angerer P, Simon L, Tritschler S, Wolf FA, Fischer D, Theis FJ. Single cells 89. Pang K, Wan Y-W, Choi WT, Donehower LA, Sun J, Pant D, et al. Com-
make big data: new challenges and opportunities in transcriptomics. binatorial therapy discovery using mixed integer linear program-
Curr Opin Syst Biol. 2017;4:85–91. ming. Bioinformatics. 2014;30(10):1456–63.
67. Falco MM, Peña-Chilet M, Loucera C, Hidalgo MR, Dopazo J. Mechanistic 90. Day D, Siu LL. Approaches to modernize the combination drug
models of signaling pathways deconvolute the glioblastoma single-cell development paradigm. Genome Med. 2016;8(1):115.
functional landscape. NAR Cancer. 2020;2(2):5589. 91. White RE. High-throughput screening in drug metabolism and phar-
68. Poulin J-F, Tasic B, Hjerling-Leffler J, Trimarchi JM, Awatramani R. macokinetic support of drug discovery. Annu Rev Pharmacol Toxicol.
Disentangling neural cell diversity using single-cell transcriptomics. 2000;40(1):133–57.
Nat Neurosci. 2016;19(9):1131–41.
Alharbi and Rashid Human Genomics (2022) 16:26 Page 19 of 20

92. Feala JD, Cortes J, Duxbury PM, Piermarocchi C, McCulloch AD, Pater- 119. Trieu T, Martinez-Fundichely A, Khurana E. DeepMILO: a deep learning
nostro G. Systems approaches and algorithms for discovery of combi- approach to predict the impact of non-coding sequence variants on
natorial therapies. Wiley Interdiscip Rev Syst Biol Med. 2010;2(2):181–93. 3D chromatin structure. Genome Biol. 2020;21(1):79.
93. Sun X, Bao J, You Z, Chen X, Cui J. Modeling of signaling crosstalk- 120. Quang D, Xie X. FactorNet: A deep learning framework for predicting
mediated drug resistance and its implications on drug combination. cell type specific transcription factor binding from nucleotide-resolu-
Oncotarget. 2016;7(39):63995–4006. tion sequential data. Methods. 2019;166:40–7.
94. Goswami CP, Cheng L, Alexander P, Singal A, Li L. A new drug combina- 121. Hochreiter S, Schmidhuber J. Long short-term memory. Neural Com-
tory effect prediction algorithm on the cancer cell based on gene put. 1997;9(8):1735–80.
expression and dose-response curve. CPT Pharmacometrics Syst 122. Park S, Min S, Choi H-S, Yoon S. Deep Recurrent Neural Network-Based
Pharmacol. 2015;4(2):80–90. Identification of Precursor microRNAs. In: Guyon I, Luxburg U V, Bengio
95. Preuer K, Lewis RPI, Hochreiter S, Bender A, Bulusu KC, Klambauer G. S, Wallach H, Fergus R, Vishwanathan S, et al., editors. Advances in
DeepSynergy: predicting anti-cancer drug synergy with deep learning. Neural Information Processing Systems. Curran Associates, Inc.; 2017.
Bioinformatics. 2018;34(9):1538–46. 123. Quang D, Xie X. DanQ: a hybrid convolutional and recurrent deep
96. Kalamara A, Tobalina L, Saez-Rodriguez J. How to find the right drug for neural network for quantifying the function of DNA sequences. Nucleic
each patient? advances and challenges in pharmacogenomics. Curr Acids Res. 2016;44(11):e107–e107.
Opin Syst Biol. 2018;10:53–62. 124. Grønning AGB, Doktor TK, Larsen SJ, Petersen USS, Holm LL, Bruun GH,
97. Chiu Y-C, Chen H-IH, Zhang T, Zhang S, Gorthi A, Wang L-J, et al. Predict- et al. DeepCLIP: predicting the effect of mutations on protein–RNA
ing drug response of tumors from integrated genomic profiles by deep binding with deep learning. Nucleic Acids Res. 2020;22:7449.
neural networks. BMC Med Genom. 2019;12(51):18. 125. Radford A, Metz L, Chintala S. Unsupervised representation learning
98. Wang Y, Li F, Bharathwaj M, Rosas NC, Leier A, Akutsu T, et al. DeepBL: with deep convolutional generative adversarial networks. Science.
a deep learning-based approach for in silico discovery of beta-lacta- 2015;6:7789.
mases. Brief Bioinform. 2020;7:8859. 126. Yang B, Liu F, Ren C, Ouyang Z, Xie Z, Bo X, et al. BiRen: predicting
99. Yu D, Deng L. Deep learning and its applications to signal and enhancers with a deep-learning-based model using the DNA sequence
information processing exploratory DSP. IEEE Signal Process Mag. alone. Bioinformatics. 2017;33(13):1930–6.
2011;28(1):145–54. 127. Deng L, Liu Y. Deep Learning in Natural Language Processing. Singa-
100. Fukushima K, Miyake S. Neocognitron: A Self-Organizing Neural Net- pore: Springer; 2018.
work Model for a Mechanism of Visual Pattern Recognition. In 1982. p. 128. Schuler GD, Epstein JA, Ohkawa H, Kans JA. [10] Entrez: Molecular biol-
267–85. ogy database and retrieval system. In 1996. p. 141–62.
101. Hinton GE. Reducing the dimensionality of data with neural networks. 129. Mikolov T, Chen K, Corrado G, Dean J. Efficient Estimation of Word
Science (80-). 2006;313(5786):504–7. Representations in Vector Space. 2013;
102. Hinton GE, Osindero S, Teh Y-W. A fast learning algorithm for deep 130. Du J, Jia P, Dai Y, Tao C, Zhao Z, Zhi D. Gene2vec: distributed representa-
belief nets. Neural Comput. 2006;18(7):1527–54. tion of genes based on co-expression. BMC Genom. 2019;20(1):82.
103. Shi L, Wang Z. Computational strategies for scalable genomics analysis. 131. Zhang X-M, Liang L, Liu L, Tang M-J. Graph neural networks and their
Genes (Basel). 2019;10(12):1–8. current applications in bioinformatics. Front Genet. 2021;12:4799.
104. Nelson D, Wang J. Introduction to artificial neural systems. Neurocom- 132. Gori M, Monfardini G, Scarselli F. A new model for learning in graph
puting. 1992;4(6):328–30. domains. In: Proceedings 2005 IEEE International Joint Conference on
105. Lecun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied Neural Networks, 2005. IEEE; p. 729–34.
to document recognition. Proc IEEE. 1998;86(11):2278–324. 133. Kwon Y, Yoo J, Choi Y-S, Son W-J, Lee D, Kang S. Efficient learning of
106. Zell A. Simulation Neuronaler Netze. London: Addison-Wesley; 1994. p. non-autoregressive graph variational autoencoders for molecular graph
73. generation. J Cheminform. 2019;11(1):70.
107. Zeng W, Wu M, Jiang R. Prediction of enhancer-promoter interactions 134. Barabási A-L, Gulbahce N, Loscalzo J. Network medicine: a network-
via natural language processing. BMC Genom. 2018;19(S2):84. based approach to human disease. Nat Rev Genet. 2011;12(1):56–68.
108. Indolia S, Goswami AK, Mishra SP, Asopa P. Conceptual understanding 135. Kopp W, Monti R, Tamburrini A, Ohler U, Akalin A. Deep learning for
of convolutional neural network-a deep learning approach. Procedia genomics using Janggu. Nat Commun. 2020;11(1):3488.
Comput Sci. 2018;132:679–88. 136. Avsec Ž, Kreuzhuber R, Israeli J, Xu N, Cheng J, Shrikumar A, et al. The
109. Gu J, Wang Z, Kuen J, Ma L, Shahroudy A, Shuai B, et al. Recent advances Kipoi repository accelerates community exchange and reuse of predic-
in convolutional. Neural Netw. 2015;5:71143. tive models for genomics. Nat Biotechnol. 2019;37(6):592–600.
110. Rawat W, Wang Z. Deep convolutional neural networks for image classi- 137. Chen KM, Cofer EM, Zhou J, Troyanskaya OG. Selene: a PyTorch-
fication: a comprehensive review. Neural Comput. 2017;29(9):2352–449. based deep learning library for sequence data. Nat Methods.
111. Alipanahi B, Delong A, Weirauch MT, Frey BJ. Predicting the sequence 2019;16(4):315–8.
specificities of DNA- and RNA-binding proteins by deep learning. Nat 138. Zhou J, Theesfeld CL, Yao K, Chen KM, Wong AK, Troyanskaya OG. Deep
Biotechnol. 2015;33(8):831–8. learning sequence-based ab initio prediction of variant effects on
112. Zeng W, Wang Y, Jiang R. Integrating distal and proximal information to expression and disease risk. Nat Genet. 2018;50(8):1171–9.
predict gene expression via a densely connected convolutional neural 139. Budach S, Marsico A. pysster: classification of biological sequences
network. Bioinformatics. 2019;6:7110. by learning sequence and structure motifs with convolutional neural
113. Lysenkov V. Introducing deep learning-based methods into the variant networks. Bioinformatics. 2018;34(17):3035–7.
calling analysis pipeline. Science. 2019;6:7789. 140. Neloy AA, Alam S, Bindu RA, Moni NJ. Machine Learning based Health
114. Kelley DR, Reshef YA, Bileschi M, Belanger D, Mclean CY, Snoek J. Prediction System using IBM Cloud as PaaS. In: 2019 3rd International
Sequential regulatory activity prediction across chromosomes with Conference on Trends in Electronics and Informatics (ICOEI). IEEE; 2019.
convolutional neural networks. Science. 2018;71:739–50. p. 444–50.
115. Pu L, Govindaraj RG, Lemoine JM, Wu H, Brylinski M. DeepDrug3D: 141. Ciaburro G, Ayyadevara VK, Perrier A. Hands-On Machine Learning on
classification of ligand-binding pockets in proteins with a convolutional Google Cloud Platform: Implementing smart and efficient analytics
neural network. PLOS Comput Biol. 2019;15(2):e1006718. using Cloud ML Engine. Packt Publishing; 2018. 500 p.
116. Gupta G, Saini S. DAVI: deep learning based tool for alignment and 142. Peng L, Peng M, Liao B, Huang G, Li W, Xie D. The advances and chal-
single nucleotide variant identification. Science. 2019;2:1–27. lenges of deep learning application in biological big data processing.
117. Marhon SA, Cameron CJF, Kremer SC. Recurrent Neural Networks. In Curr Bioinform. 2018;13(4):352–9.
2013. p. 29–65. 143. Carneiro T, Da Medeiros NRV, Nepomuceno T, Bian G-B, De Albuquer-
118. Yu Y, Si X, Hu C, Zhang J. A review of recurrent neural networks: LSTM que VHC, Filho PPR. Performance analysis of google colaboratory
cells and network architectures. Neural Comput. 2019;31(7):1235–70. as a tool for accelerating deep learning applications. IEEE Access.
2018;6:61677–85.
Alharbi and Rashid Human Genomics (2022) 16:26 Page 20 of 20

144. Bisong E. Google Colaboratory. In: Building Machine Learning and Deep 169. Kohut K, Limb S, Crawford G. The changing role of the genetic counsel-
Learning Models on Google Cloud Platform. Berkeley: Apress; 2019. p. lor in the genomics Era. Curr Genet Med Rep. 2019;7(2):75–84.
59–64. 170. Zeng W, Wu M, Jiang R. Prediction of enhancer-promoter interactions
145. Luo R, Sedlazeck FJ, Lam TW, Schatz MC. A multi-task convolutional via natural language processing. BMC Genom. 2018;19(S2):84.
deep neural network for variant calling in single molecule sequencing. 171. Frank H. Guenther. Neural Networks: Biological Models and Applica-
Nat Commun. 2019;10(1):1–11. tions. In: Smel-ser NJ, Baltes PB editors, editor. Oxford: International
146. Ravasio V, Ritelli M, Legati A, Giacopuzzi E. GARFIELD-NGS: genomic Encyclopedia of the Social & Behavioral Sciences; 2001. p. 10534–7.
vARiants fIltering by dEep learning moDels in NGS. Bioinformatics. 172. Eskiizmililer S. An intelligent Karyotyping architecture based on Artificial
2018;34(17):3038–40. Neural Networks and features obtained by automated image analysis.
147. Singh A, Bhatia P. Intelli-NGS: Intelligent NGS, a deep neural network- 1993.
based artificial intelligence to delineate good and bad variant calls from 173. Catic A, Gurbeta L, Kurtovic-Kozaric A, Mehmedbasic S, Badnjevic A.
IonTorrent sequencer data. bioRxiv. 2019;2019:879403. Application of neural networks for classification of patau, edwards,
148. Hsieh T-C, Mensah MA, Pantel JT, Aguilar D, Bar O, Bayat A, et al. down, turner and klinefelter syndrome based on first trimester
PEDIA: prioritization of exome data by image analysis. Genet Med. maternal serum screening data, ultrasonographic findings and patient
2019;21(12):2807–14. demographics. BMC Med Genom. 2018;11(1):19.
149. Gurovich Y, Hanani Y, Bar O, Nadav G, Fleischer N, Gelbman D, et al. 174. Fukushima K. Neocognitron: a self-organizing neural network model for
Identifying facial phenotypes of genetic disorders using deep learning. a mechanism of pattern recognition unaffected by shift in position. Biol
Nat Med. 2019;25(1):60–4. Cybern. 1980;36(4):193–202.
150. Park S, Min S, Choi H, Yoon S. deepMiRGene: deep neural network 175. Sakellaropoulos T, Vougas K, Narang S, Koinis F, Kotsinas A, Polyzos A,
based precursor microRNA prediction. Science. 2016;71:89968. et al. A deep learning framework for predicting response to therapy in
151. Kelley DR, Snoek J, Rinn JL. Basset: learning the regulatory code of the cancer. Cell Rep. 2019;29(11):3367-3373.e4.
accessible genome with deep convolutional neural networks. Genome 176. Kalinin AA, Higgins GA, Reamaroon N, Soroushmehr S, Allyn-Feuer A,
Res. 2016;26(7):990–9. Dinov ID, et al. Deep learning in pharmacogenomics: from gene regula-
152. Quang D, Xie X. DanQ: a hybrid convolutional and recurrent deep tion to patient stratification. Pharmacogenomics. 2018;19(7):629–50.
neural network for quantifying the function of DNA sequences. Nucleic 177. Rumelhart DE, Hinton GE, Williams RJ. Learning representations by
Acids Res. 2016;44(11):e107–e107. back-propagating errors. Nature. 1986;323(6088):533–6.
153. Singh S, Yang Y, Póczos B, Ma J. Predicting enhancer-promoter interac- 178. Shen Z, Bao W, Huang D-S. Recurrent neural network for predicting
tion from genomic sequence with deep neural networks. Quant Biol. transcription factor binding sites. Sci Rep. 2018;8(1):15270.
2019;7(2):122–37. 179. Schuster M, Paliwal KK. Bidirectional recurrent neural networks. IEEE
154. Zeng W, Wu M, Jiang R. Prediction of enhancer-promoter interactions Trans Signal Process. 1997;45(11):2673–81.
via natural language processing. BMC Genom. 2018;19(S2):84. 180. Almagro Armenteros JJ, Sønderby CK, Sønderby SK, Nielsen H, Winther
155. Chen Y, Li Y, Narayan R, Subramanian A, Xie X. Gene expression infer- O. DeepLoc: prediction of protein subcellular localization using deep
ence with deep learning. Bioinformatics. 2016;32(12):1832–9. learning. Bioinformatics. 2017;33(21):3387–95.
156. Zeng W, Wang Y, Jiang R. Integrating distal and proximal information to 181. Chung J, Gulcehre C, Cho K, Bengio Y. Empirical evaluation of
predict gene expression via a densely connected convolutional neural gated recurrent neural networks on sequence modeling. Science.
network. Bioinformatics. 2019;2:7889. 2014;7:44598.
157. Kalkatawi M, Magana-Mora A, Jankovic B, Bajic VB. DeepGSR: an opti-
mized deep-learning structure for the recognition of genomic signals
and regions. Bioinformatics. 2019;35(7):1125–32. Publisher’s Note
158. Zuallaert J, Godin F, Kim M, Soete A, Saeys Y, De Neve W. SpliceRover: Springer Nature remains neutral with regard to jurisdictional claims in pub-
interpretable convolutional neural networks for improved splice site lished maps and institutional affiliations.
prediction. Bioinformatics. 2018;34(24):4180–8.
159. Yang B, Liu F, Ren C, Ouyang Z, Xie Z, Bo X, et al. BiRen: predicting
enhancers with a deep-learning-based model using the DNA sequence
alone. Bioinformatics. 2017;33(13):1930–6.
160. Paggi JM, Bejerano G. A sequence-based, deep learning model accu-
rately predicts RNA splicing branchpoints. RNA. 2018;24(12):1647–58.
161. Almagro AJJ, Sønderby CK, Sønderby SK, Nielsen H, Winther O. Deep-
Loc: prediction of protein subcellular localization using deep learning.
Bioinformatics. 2017;33(21):3387–95.
162. Singh J, Hanson J, Paliwal K, Zhou Y. RNA secondary structure prediction
using an ensemble of two-dimensional deep neural networks and
transfer learning. Nat Commun. 2019;10(1):5407.
163. Grønning AGB, Doktor TK, Larsen SJ, Petersen USS, Holm LL, Bruun GH,
et al. DeepCLIP: predicting the effect of mutations on protein–RNA
binding with deep learning. Nucleic Acids Res. 2020;5:9956.
164. Singh R, Lanchantin J, Robins G, Qi Y. DeepChrome: deep-learning for
Ready to submit your research ? Choose BMC and benefit from:
predicting gene expression from histone modifications. Bioinformatics.
2016;32(17):i639–48.
• fast, convenient online submission
165. Zhou J, Troyanskaya OG. Predicting effects of noncoding vari-
ants with deep learning–based sequence model. Nat Methods. • thorough peer review by experienced researchers in your field
2015;12(10):931–4. • rapid publication on acceptance
166. Lanchantin J, Singh R, Lin Z, Qi Y. Deep Motif: visualizing genomic
• support for research data, including large and complex data types
sequence classifications. Science. 2016;78:1–5.
167. Li W, Wong WH, Jiang R. DeepTACT: predicting 3D chromatin • gold Open Access which fosters wider collaboration and increased citations
contacts via bootstrapping deep learning. Nucleic Acids Res. • maximum visibility for your research: over 100M website views per year
2019;47(10):e60–e60.
168. Xie L, He S, Song X, Bo X, Zhang Z. Deep learning-based transcriptome At BMC, research is always in progress.
data classification for drug-target interaction prediction. BMC Genom.
2018;19(S7):667. Learn more biomedcentral.com/submissions

You might also like