The R Language An Engine For Bioinformatics and Data Science
The R Language An Engine For Bioinformatics and Data Science
Review
The R Language: An Engine for Bioinformatics and
Data Science
Federico M. Giorgi 1, * , Carmine Ceraolo 1,2 and Daniele Mercatelli 1
Abstract: The R programming language is approaching its 30th birthday, and in the last three decades
it has achieved a prominent role in statistics, bioinformatics, and data science in general. It currently
ranks among the top 10 most popular languages worldwide, and its community has produced tens of
thousands of extensions and packages, with scopes ranging from machine learning to transcriptome
data analysis. In this review, we provide an historical chronicle of how R became what it is today,
describing all its current features and capabilities. We also illustrate the major tools of R, such as
the current R editors and integrated development environments (IDEs), the R Shiny web server,
the R methods for machine learning, and its relationship with other programming languages. We
also discuss the role of R in science in general as a driver for reproducibility. Overall, we hope to
provide both a complete snapshot of R today and a practical compendium of the major features and
applications of this programming language.
required a more interactive environment, and the possibility to explore the data being
analyzed [2]. Thus, between 1975 and 1976 (Figure 1), the Bell Labs statistical research teams
(including John Chambers, Richard Becker and Allan Wilks) developed the S language
(S being an informal name indicating “Statistics”), with the expressed aim, stated verbatim
by Chambers, “to turn ideas into software, quickly and faithfully” [3]. The S language
increased in popularity, scope, and efficiency and, by 1988, it was considered a stable and
fully fledged programming language and interactive environment for statistical analysis,
with the publication of a reference book by Chambers, “The New S Language” [4]. Many
of the core functions of the current R were already present in the 1988 S version, and the R
documentation still cites “The New S Language” as the reference for base functions such
as mean (to calculate the arithmetic mean of a vector of numbers) and rnorm (to
Life 2022, 12, x FOR PEER REVIEW
generate
3 of 27
random numbers drawn from a normal distribution).
In June 1997, the R community gave itself an official leadership in the form of the
R core team, constituted by Ross Ihaka, Robert Gentleman, John Chambers (the original
author of S), Martin Maechler and other “wise people” bestowed with write access to the R
source code [12]. As a final step to building solid foundations for R, the core team launched
the official R website (https://r-project.org, accessed on 21 April 2022) in 1999 [13].
in 2002. The Bioconductor project now collects thousands of packages and is the de facto
main aggregator of computational tools to analyze biological data, especially quantitative
omics data (more on Bioconductor in the following paragraphs) [23].
Figure 2. (A) Box plots drawn using the default R boxplot() function in original R (left), R since
Figure 2. (A) Box
4.0.0 (middle) plots
and drawn
ggplot2 using
(right). the
(B) default
Density R boxplot()
distribution plotsfunction
for three in original R (left),
distributions, R since
combined
4.0.0
with(middle) and
the results ofggplot2 (right). (B)test.
the Shapiro–Wilk Density distribution
(C) Default plots
R boxplot for three distributions,
comparing combined
two distributions and
providing the output p-value of the Student’s t-test. (D) Scatter plot indicating the co-expression of
two genes, and the Pearson’s correlation coefficient of the joint distribution. (E) Example of
overlapping different plot types in R: box plot, beeswarm plot and violin plot (BBV Plot) for three
numeric distributions (called Gene 1, Gene 2 and Gene 3).
Life 2022, 12, 648 6 of 25
with the results of the Shapiro–Wilk test. (C) Default R boxplot comparing two distributions and
providing the output p-value of the Student’s t-test. (D) Scatter plot indicating the co-expression
of two genes, and the Pearson’s correlation coefficient of the joint distribution. (E) Example of
overlapping different plot types in R: box plot, beeswarm plot and violin plot (BBV Plot) for three
numeric distributions (called Gene 1, Gene 2 and Gene 3).
ggplot2, together with other packages such as dplyr for data manipulation and tibble for
data storage, also authored by Wickham, were united to form the tidyverse on 15 September
2016, a set of packages that collectively re-imagine the data flow operations in R, introducing
the UNIX concept of piping with the form “% > %” [26].
3. The R Repositories
The R interpreter and console are provided as source code or executables for major
operating systems from the CRAN website at https://cran.r-project.org/ (accessed on
21 April 2022). The basic R contains functions to perform all major statistical tests, plotting,
and matrix operations, and, as of version 4.1.3, it is provided as a combination of 14 different
core packages: base, compiler, datasets, grDevices, graphics, grid, methods, parallel, splines, stats,
stats4, tcltk, tools and utils [8].
The 14 core packages, of course, do not cover the large world of R, which now spans
from bioinformatics to web development. Thus, users have the possibility to install and
load additional packages, or libraries, both from custom and official sources. Currently,
there are three major repositories for additional R packages: CRAN, Bioconductor and
R-Forge. These three require minimum standards of quality and active maintenance, and
perform rigorous testing for all packages worthy of inclusion. However, many useful
packages exist outside the three main resources, often in general code repositories such
as SourceForge and GitHub. Once installed, every package can be then loaded with the
library() command, which will also load all dependencies.
3.1. CRAN
At the date of writing (21 April 2022), the CRAN repository hosts 19,001 packages
(source: https://cran.r-project.org/web/packages/), covering a large scope of applications.
These packages extend the statistical capabilities of R, in addition to implementing novel
graphical and technical methods, providing R with extended capabilities, e.g., for high-
performance and parallel computing [33], in addition to the aforementioned shiny, tidyverse
and Rmarkdown extensions, amongst many others. The process of having a package
submitted, screened and ultimately accepted by CRAN may require several months for
a beginner and requires the user to be fully aware of the CRAN repository policy (available
at https://cran.r-project.org/web/packages/policies.html). Ultimately, packages in CRAN
Life 2022, 12, 648 7 of 25
have the encouraging tendency to be well written and documented, providing cutting-edge
and efficient methods for contemporary statistical analysis. Prior to submission to CRAN,
an author should locally test its package with the following command, which automatically
detects code inconsistencies, both fundamental and in terms of style:
check(args = c(’–as-cran’))
Because CRAN is intrinsically tied with R, installing packages from this repository
has the easiest form of installation, via the install.packages() function. For example, to
install the GeneNet package to infer gene coexpression via partial correlation [34], the user
should simply type:
install.packages("GeneNet")
This function will also install dependencies, in binary format for operating systems
such as Windows or MacOSX, or by compiling them in Linux systems. CRAN is also
excellently integrated with RStudio, which can check for missing libraries and have the
user install them with a simple click.
Although the install.packages() function is, by default, set up to search and work
on CRAN only, it is possible to set it to install packages from the other two repositories. For
example, the function setRepositories() will allow the user to set further locations for
the installation process to look for. Currently (R version 4.1.3), packages from all three major
repositories (CRAN, Bioconductor and R-Forge) can be installed this way.
One of the peculiar functionalities of CRAN is its package checking system, devel-
oped originally by Kurt Hornik [35], which implements simple text-based metadata and
a clear hierarchical system, which has allowed, for decades, the maintenance of a healthy
and consistent repository. CRAN packages are subject to continuous testing on multiple
platforms and operating systems.
3.2. Bioconductor
Bioconductor is the second largest R package repository, hosting at this date (21 April 2022)
3422 packages (2083 software packages and 1339 data packages) (source: https://www.
bioconductor.org/packages/release/BiocViews.html). Unlike CRAN, Bioconductor has
a specific package focus, which revolves around bioinformatics and, in general methods,
tools and data associated with biological studies. The process to have one package accepted
by Bioconductor can be even longer than that of CRAN, following even stricter rules
(including a maximum line width of 80 characters). The following function performs
automatic checks on a package for Bioconductor rule compliance:
BiocCheck()
Bioconductor contains extremely useful tools for dealing with biological data, from
differential gene expression analysis (e.g., DESeq2 [36] and limma [37]) to genome analysis
(e.g., GenomicRanges [38]). The Bioconductor code requires new packages to take advantage
of the existing object classes, to ensure that authors do not have to constantly reinvent the
wheel to represent, e.g., a transcriptomics dataset or a genomic region.
Bioconductor hosts some of most downloaded bioinformatics tools in the world
(source: https://bioconductor.org/packages/stats/, accessed on 21 April 2022), but not
all biocomputational R packages are hosted here; examples are corto for gene network
reverse-engineering [39] and Seurat for single cell RNA-Seq analysis [40], which are hosted
in the more general CRAN.
Bioconductor packages are not released in a continuous way, like those of CRAN, but
in periodic versions of Bioconductor itself, which follow the release calendar of the main R
full releases.
3.3. R-Forge
R-Forge, which hosted at the time of writing (21 April 2022) 2146 packages [41], is
a collaborative R repository focused on developing packages, and providing additional
tools for bug tracking, versioning and branching. It contains several unpublished prototype
R packages, in addition to pre-release versions of CRAN libraries. The role of R-Forge
Life 2022, 12, 648 8 of 25
is, as the name implies, to create and test novelties in R, with the help of a vibrant user
community and before the strict requirements of the other two official repositories.
3.4. Github
Recently, part of the immense work performed in developing R code has moved
onto the general environment of GitHub, arguably the most popular versioning system
and developing location on the Internet. At the date of writing (21 April 2022), GitHub
contains 34,268 active R projects [42], in several states of maturity. Although they may be
frowned upon by the core R community, many scientific tools in R have been published
without being on CRAN or Bioconductor [43,44]. Packages available on GitHub can only
be downloaded and freely explored, but more recently a CRAN package, devtools, provided
a convenient function to install them directly from the GitHub repository. The code to
install the svpluscnv package for analyzing somatic copy number alteration events in cancer
is as follows [43]:
library(devtools)
install_github("gonzolgarcia/svpluscnv")
GitHub (and other unofficial online repositories for R code, such as SourceForge)
allow authors to quickly share their code without the stricter coding rules of CRAN and
Bioconductor, but does not guarantee the validity, usability, execution and long-term
maintenance of any of the code it provides.
4. Practical R
The practicality of R makes it an ideal first programming language to learn, since it is
possible for beginners to obtain an immediate visualization of their own data following
simple steps. Conceptually, the functionality of R can be divided into three classes, which
taken together can summarize its role in bioinformatics, statistics, and data science in
general: data interaction, analysis, and results visualization.
The following paragraphs provide some basic functions to visualize the core potential
of R. The code can be directly executed on a standard R console (version 4.1.3 at the time of
writing), available at the official R website https://cran.r-project.org/.
The following subparagraphs rely on small artificial datasets that are available as
Supplementary File S1.
load("ab.rda")
The RDS file format uses the same archiving algorithm (reducing the effective size of
the data on disk) as that of RDA, but it is commonly used to store an individual object, and
then, upon loading, assign it to a specific object name, e.g.,
saveRDS(example,file="example.rds")
example<-readRDS("example.rds")
R provides convenient functions to quickly visualize the content of any object; for
example, the dim() function provides the size of the object:
dim(example)
In our case, this is a matrix with 1000 rows and 3 columns.
Another convenient function is head(), which visualizes the beginning of any ob-
ject. In our case, R visualizes that the columns correspond to three variables, g1, g2, g3,
representing artificially generated gene expression values across 1000 samples (rows).
head(example)
g1 g2 g3
Sample_1 5.425938 6.827846 0.7478255
Sample_2 6.152623 7.804399 2.1705011
Sample_3 4.990498 6.589876 2.7549418
Sample_4 5.801309 8.791750 6.2522507
Sample_5 4.658917 5.013827 9.6978252
Sample_6 4.833635 5.777115 5.5270391
Both RDA and RDS formats are effective for fast transfer of files between collaborating
R users and as conveniently accessible storages for R data. As an example, we show how
a fairly large numeric matrix (with 200 rows and 50,000 columns) can be saved in RDA or
RDS formats, reducing by an order of magnitude both the disk size and time required for
input/output operations (Supplementary Figure S2).
The choice of RDS or RDA format is ultimately a matter of user taste and context
appropriateness. Technically, RDA can be advantageous to save multiple objects, and, in
fact, saving two objects within a single RDA object saves a few bytes over saving them as
two separate RDS files.
In this particular case, individual columns (corresponding to artificial genes g1, g2 and
g3) can be extracted and saved in three different objects for further analysis:
g1<-example[,"g1"]
g2<-example[,"g2"]
g3<-example[,"g3"]
4.2. Analysis
Even at basic level, R contains dozens of functions to perform statistical analysis to
explore and extract information from data. The function summary(), for example, can
provide a general overview of a numeric distribution, including minimum and maximum
values, and the mean:
summary(g1)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.221 4.349 4.996 5.016 5.682 8.756
In R, it is also possible to perform statistical tests with core functions; for example, the
Shapiro–Wilk test for normality [46]. In this case, the test deems the first two distributions
(g1 and g2) to be likely normal, and the third one (g3) not to be so.
shapiro.test(g1)
p-value = 0.1994
shapiro.test(g2)
p-value = 0.6653
shapiro.test(g3)
p-value < 2.2e-16
Life 2022, 12, 648 10 of 25
4.3. Visualization
Powerful, text-based results are harder to interpret than the visualization of the data
itself, or the test being performed; for example, although the Shapiro–Wilk test provides
a mathematical assessment of the normality of a distribution, it is always better to visualize
the distribution itself. The following code allows the user to show the density distribution
of g1, paired with the result of the Shapiro–Wilk test:
plot(density(g1),lwd=2)
p<-signif(shapiro.test(g1)$p.value,3)
mtext(paste0("Shapiro test p-value = ",p))
The output of this code concept is shown in Figure 2B. By combining statistical
methods and graphical visualizations, R can quickly show, in this example, that the first
two distributions are normal-like, whereas the third distribution is not. The intrinsic power
and success element of R is to provide this combination at all levels, from these simple
tests to highly complex calculations and datasets, such as UMAP performed in single
cell analysis [50].
In a similar fashion, the difference between distributions g1 and g2 can be assessed via
combining graphical representations of the two with boxplots, and t-test results (Figure 2C):
boxplot(g1,g2,names=c("Gene 1","Gene 2")
Life 2022, 12, 648 11 of 25
p<-signif(t.test(g1,g2)$p.value,3)
mtext(paste0("T-test p-value = ",p))
Another example of combining analytical and graphical statistics with R is provided
by the correlation tests. The following lines of code generate a scatter plot with the function
plot() and overlay the results of the correlation test (Figure 2D). The plot() function itself
can be enriched with several parameters, such as pch to control the point shape, or col to
control the color of the points:
plot(g1,g2,pch=20,col="cornflowerblue")
cor<-signif(cor(g1,g2),3)
mtext(paste0("Pearson’s Correlation Coefficient = ",cor))
Finally, one of the exceptional features of R is that it can overlap not only text (in
the previous examples, the outputs of statistical tests) with graphics, but also several
graphical layers. In the following example, the three distributions are shown using three
tracks: standard box plots [51], beeswarm plots [52] and violin plots [53] in order to fully
assess the features of the distributions. The beeswarm() and violin() functions require
the homonymous CRAN packages. This particular instance of overlapping box plots,
beeswarm plots and violin plots is colloquially referred to as “BBV plot” (Figure 2E).
library(vioplot)
library(beeswarm)
boxplot(g1,g2,g3)
beeswarm(list(g1,g2,g3),add=TRUE)
vioplot(list(g1,g2,g3),add=TRUE)
Figure 3. (A) Concept diagram of how Rmarkdown can merge text and code blocks to create
Figure 3. (A)(B)Concept
documents. diagram
Example of an of
R how
IDE, Rmarkdown can merge
RStudio, showing text and
multiple code blocks
elements to create
to assist the R
documents. (B) Example of an R IDE, RStudio, showing multiple elements to assist the R programmer.
programmer. (C) Worldwide popularity of search terms “R”, “Python” and “Perl” in the years 2004–
(C)
2021Worldwide popularity
(source: Google of search
Trends). terms
The topic of“R”, “Python”
the search termand “Perl” in
is limited to the years 2004–2021
“Programming (source:
Language”.
Google Trends). The topic of the search term is limited to “Programming Language”.
Life 2022, 12, 648 13 of 25
–-
title: "Rmarkdown example"
author: "Federico M. Giorgi"
date: "12/6/2021"
output: html_document
–-
Following the header, text blocks and code blocks can be intermixed. The following is
a simple text including a level 2 header (##) following normal text with optional formatting
(the single asterisk transforms the text within in italics).
## Example of Text Block
Descriptive text with *optional* formatting
Complete guides on this language and syntax are available on the official Rmarkdown
website, hosted by RStudio (https://rmarkdown.rstudio.com/, accessed on 21 April 2022).
Code chunks can be included using triple back quotes (‘) specifying the language
(by default r) and including the code itself. The code will run during the generation
of the document, and, if present, graphics will be included in the output. The following
example shows a comment line and the command to graphically print 1000 values randomly
generated from a normal distribution:
“‘{r}
# Example of code block
plot(rnorm(1000))
“‘
Rmarkdown is widely used to produce reports based on standardized pipelines,
which would be repetitive if rewritten manually each time, both in the code parts and in
the visualization of the results. A common example is the analysis of RNA sequencing
experiments that share similar quality checks, data visualizations, differential expression
analysis, pathway enrichment analysis, et cetera. Rmarkdown allows the creation of ready-
made presentations for each new dataset, with minimal effort often limited to changing the
data input parts and the supervision of the final results.
This efficient and diffused system coincides with the ever-growing call for data sharing
reproducibility in data science and computational biology [56]. Rmarkdown is a system
that allows authors to share not only raw data, but also the fully reproducible pipelines that
transformed the uninterpretable complexity of large datasets into processed tables, reports
and figures. Sharing the code and the data is an excellent academic practice, necessary for
scientific transparency and for speeding up the sharing of knowledge. The existence of
Rmarkdown allows scientists to share their work beyond the interpreted results and in the
most usable form.
6.1. IDEs/GUIs
Since graphics is one of the main pillars of R [57], IDEs are naturally well received
and appreciated amongst R programmers, because they allow, in the same window, the
combination of both the code and the generated graphical output. Below, we list a selection
of the most used R IDEs at the time of writing (April 2022).
6.1.1. RStudio
RStudio is one of the most popular R development environments, especially amongst
younger programmers [58]. Launched on 2011-02-28, RStudio started as an R-specific open-
source IDE, written predominantly in C++ and Java (https://github.com/rstudio/rstudio,
accessed on 21 April 2022). Subsequently, it has evolved into a bilingual environment,
focusing on R and Python. The application is available as desktop (RStudio Desktop) or
Linux server-based (RStudio Server) versions; for both, free and commercial editions exist.
It runs on all major platforms, with the same interface. It combines a console, a syntax-
highlighting editor with the tab-completion feature, graphics, history and help, into a single
workbench. It supports reproducible research and literate programming by mixing code
and text documentations. Accordingly, RStudio natively supports an interface with Rmark-
down (see previous paragraph), so that results can be rendered and communicated into
HTML, PDF or Word reports. Moreover, RStudio supports project sharing by interfacing
with GitHub and other versioning systems.
6.1.3. RKward
Launched in 2002, RKWard was one of the first R-specific editors [59], written in C++
for the KDE (https://kde.org/, accessed on 21 April 2022) Linux environment in 2002.
Currently, it can run on Windows, macOSX, and Linux operating systems. The project was
conceived by Thomas Friedrichsmeier to fulfill the needs of both proficient and neophyte R
users. They can leverage the RKWard task-oriented GUI dialogues, including a spreadsheet-
like data editor, workspace browser, code editor, help pages and plot preview. R package
management is available to easily handle and manipulate R packages. Additional plugins
can enhance RKWard’s features. This software may be a great choice for users only working
in R.
6.2.1. Vi/Vim
Vi is the visual version of the ‘ex’ command-line editor. It was originally written by
Bill Joy, a graduate student at Berkeley in 1976, and officially published with the first BSD
Unix release, in March 1978. About 24% of responders to Stack Overflow’s 2021 developer
survey asserted they used vi regularly [65]. This editor operates in “Visual” or command
mode, which renders the text being edited in a terminal, and “Insert” mode, where text is
included in the document, by typing “I”. A “Vi Improved” version, Vim, was released in
1991, and included more features, such as syntax highlighting, mouse support, and new
editing commands. Since 2000, Vim has been included with almost every Linux distribution,
and GUI versions exist for other operating systems, such as gVim for Linux, and MacVim
for macOS. Many plugins improve the Vim user experience for R programmers, such as
Nvim-R-Vim [66] for integrating Vim with the R console.
6.2.4. Notepad++
Another simple and lightweight code-agnostic text editor is Notepad++, a com-
pletely free alternative to Sublime Text for editing R scripts. It was released by Don
Ho in 2003 and is only available in Windows environments (analogous editors in MacOSX
would be TextWrangler or BBedit). The Notepad++ editor features prepackaged exten-
sions supporting about 80 programming languages; amongst these, the NppToR plugin
(https://github.com/halpo/NppToR, accessed on 21 April 2022) provides R language
syntax highlighting, folding and auto-completion, on top of the possibility to communicate
directly with the R Windows console.
Table 2. Most popular programming languages according to the 2021 PYPL index.
Among the top languages shown in Table 2, the three most commonly used for
bioinformatics and data science tasks are undoubtedly Python, R and MATLAB.
MATLAB, although computationally and algorithmically efficient, is not freely avail-
able and, as such, is more difficult to access by the wider public. On the other hand, Python
and R are both freely accessible languages, with Python being the most popular language
in the world due to its versatility beyond data analysis (e.g., in database management
and website development). Historically, Python and R have grown at the expense of the
previous “king” of bioinformatics languages, PERL, which has gradually disappeared
since 2005 in Google searches and, in general, from the bioinformatics and data science
world (Figure 3C).
R maintains, and probably will maintain, a niche but dominating presence in the field
of bioinformatics, providing ready-made packages and functions, alongside a vibrant and
helpful community, for most common pipelines in computational biology, from pathway
analysis [76] to the conversion of gene identifiers [77].
Performance-wise, R has recently achieved excellent benchmarks in the speed of
common matrix operations, when compared to Python [78]. This is shown by the func-
tionality provided by the data.table and dplyr packages, and in general by the more recent
tidyverse collection.
ten by Max Kuhn, provides a common interface for many of these methods, acting as
an umbrella system to run most of machine learning tasks in R with common functions
(https://topepo.github.io/caret/, accessed on 21 April 2022). More specifically, caret pro-
vides functions to perform data splitting and preprocessing, feature selection, model tuning
and training, variable importance estimation and model testing. Virtually all machine learn-
ing algorithms written for R can still be used as standalone packages or via caret, which
imports them as needed. caret divides methods into regression methods (to predict continu-
ous numerical variables) and classification methods (to predict categorical variables), and
continuously adds new methods as they are created (for a full list of the models supported
by caret, see the CRAN page at http://topepo.github.io/caret/available-models.html, ac-
cessed on 21 April 2022). The following code snippet shows how the model training is
performed via the R caret package. The object “input” contains an input training data
frame with variables as columns and observations as rows. The vectors “predictors” and
“outcome” contain the variables to be used as predictors and to be predicted, respectively.
The object “trainmethod” specifies how to perform the training (e.g., by leave-one-out
or by cross-validation). The “metric” parameter specifies the type of output (in this clas-
sification case the values to generate a Receiver Operating Characteristic curve, ROC)
Finally, the variable “mymethod” can be changed according to the user’s choice, and in
this case is set to “gbm”, a gradient boosting model method very popular in regression and
classification analyses [82,83].
library(caret)
mymethod<-"gbm"
model<-train(data.matrix(input[,predictors]),trainDF[,outcome],
method=mymethod,
trControl=trainmethod,
metric="ROC",
)
More recently, Max Kuhn and RStudio developed a caret version running under the tidy-
verse called tidymodels, which covers most of caret functionality with classic tidyverse syntax [84].
Beyond well implemented and readily usable methods, a language for machine learn-
ing needs an efficient system for dealing with large datasets. Since at least version 2.0.0,
R has been designed to allow the treatment of big data, which are the fuel of artificial intel-
ligence besides the algorithms themselves. This was undertaken via the implementation
of lazy loading (see before) to allow datasets to be accessed without being fully loaded
in memory, in addition to other methods to operate on sparse data in a highly optimized
framework, such as the data.table package [26]. R can also recover from insufficient memory
errors by notifying the user of the infamous “Cannot allocate a vector of size x” error; such
errors do not shut down the R console, allowing the user to recover the current session.
Modern machine learning systems and libraries for applications such as computer
vision and language understanding have gained significant popularity among data sci-
entists, and have virtually all been ported to R. For example, Keras, a high-level neural
network library, is currently available for R users through at least two different packages:
the kerasR package, authored and created by Taylor Arnold, and RStudio’s keras package.
Both packages provide an R interface to the native Python deep learning code, providing
a flexible interface for specifying machine learning models. Another popular Python-based
machine learning library, Scikit-learn, can be used on R via the reticulate package, which
provides a comprehensive set of tools for interoperability between Python and R, including
the possibility to convert R objects into Python objects and vice versa.
Beyond caret, other method-aggregating packages are currently available to run ma-
chine learning code in R. The h2o package, for example, written by Erin DeLell, provides
an interface for the very popular H2 O machine learning platform, including a plethora of
methods from generalized linear models to deep neural networks, and a very convenient
automated algorithm called H2 O AutoML [85]. Another popular package to perform
Life 2022, 12, 648 19 of 25
various machine learning tasks is mlr3, written by Michel Lang; mlr3 has been recently
rewritten in a fully object-oriented fashion, taking advantage of both data.table and the new
lightweight R6 classes [86].
A few other R packages available to perform dedicated machine learning tasks are also
worth mentioning. The library MASS, for example, originally developed for the S language,
is still widely used to perform statistical learning tasks such as linear discriminant analysis
and quadratic discriminant analysis [87]. Another package is prophet, written by Sean
Taylor, whose name aptly describes its main task: the prediction of future trends based on
time series data [88]; the prophet algorithm is elegantly written to account for missing data,
outliers and typical phenomena of time series, such as seasonal trends and holiday effects,
and is broadly used in various data science studies [89,90].
(UI) object (usually saved as ui.R), and a server function (server.R). The UI contains the
instructions defining the appearance of the app, enabling the user to interact with the app
by clicking on interactive buttons, text boxes or drop-down menus. Default choices are
coded into the UI. The server function defines how the app should work, by housing all
the instructions that drive the functionality of the application and accessing all built-in
functionalities available to R users. The simplest way to create a Shiny app is to create
a new directory hosting both ui.R and server.R files. Once running, these files will be
used to tell Shiny how the app should both look and behave, respectively. Shiny takes
advantage of reactive programming, a style of programming that creates software that
responds to events rather than solicits inputs from users [107]. The reactive programming
paradigm relies on streams of time-ordered sequences of related event messages allowing
objects to be updated in response to changes introduced by the users. Although the R
language makes use of imperative programming, where each command should be re-
issued each time to make changes in the output, the output is almost instantly updated
in Shiny web apps because of reactive programming, where changes to the output value
take place whenever each value or object that is connected to the output changes. To create
a reactive context, reactive expressions describing how inputs and outputs are connected
(the reactive graph) need to be declared in the app, creating object classes (called reactive
values) by using the reactive() function. Two objects are required: a source (usually,
a user input that occurs through the web interface), and an endpoint (an output object, for
example in the form of a table or a plot). It is also possible to place reactive components to
manage complex operations, called reactive conductors, between the sources and endpoints.
A detailed description of reactivity in Shiny can be found at https://shiny.rstudio.com/
articles/reactivity-overview.html.
Hosting of Shiny apps can be performed through a Shiny server (https://www.rstudio.
com/products/shiny/shiny-server/, accessed on 21 April 2022) on a private server or on
cloud-based hosting services such as Amazon Web Services or Microsoft Azure. Limitations
exist, however, for the free version of Shiny server, which, for example, limits the number
of usable threads (and therefore of concurrent users) for the deployed Shiny app. The
relative simplicity of Shiny, at least for simple web interfaces, and the vast availability of
Shiny tutorials and solutions online, makes it an ideal, and currently very successful [102],
instrument for bioinformatics to release their methods to all biologists, taking advantage of
R code and implementations that would be otherwise inaccessible to non-programmers.
10.1. Books
“The R book” by Michael J. Crawley (ISBN-13: 978-0470973929). An excellent start
written by a true R enthusiast, showing with practical examples the basics of R structures
and functions, and venturing into early machine learning.
“R for dummies” by Andrie de Vries and Joris Meys (ISBN-13: 978-1119055808). From
the popular Wiley series, this book explains R in detail, even for a complete beginner of
computer science, bringing him or her up to speed with the most recent R functionalities.
“R for Data Science” by Hadley Wickham (ISBN-13: 978-1491910399). This book,
written by the creator of ggplot2 and tidyverse, follows the classic O’Reilly teaching tradition,
providing not only a structured lesson, but also a handful cookbook of solutions for the
easier and less easy data science operations with R.
Life 2022, 12, 648 21 of 25
“The R inferno” by Patrick Burns (ISBN-13: 978-1471046520). This is one of the least
statistics-oriented books, focusing rather on subtleties of the R language and teaching
optimal ways to write efficient (in terms of computational time and memory usage) R code.
“Use R!” by Robert Gentleman, Kurt Hornik and Giovanni Parmigiani (ISSN: 2197-
5736) is a book series of almost 100 volumes dedicated to practical and focused R usages,
written by some of the original founders of the R community itself.
“The R Series” (https://www.routledge.com/Chapman--HallCRC-The-R-Series/book-
series/CRCTHERSER, accessed on 21 April 2022) is another collection of R books edited
by John Chambers, Torsten Hothorn, Duncan Temple Lank and Hadley Wickham. This
series consists of 62 series titles, and covers a wide range of topics, from more hardcore
programming concepts to applications of R in biology, finance and criminology).
Supplementary Materials: The following supporting information can be downloaded at: https:
//www.mdpi.com/article/10.3390/life12050648/s1, Supplementary File S1: Artificial dataset used
as examples in the “Practical R” paragraph. Supplementary Figure S1: Archive size of the r-help
official mailing list over time (every month). Sizes above 1 MB are rounded to the nearest MB value.
Data extracted from https://hypatia.math.ethz.ch/pipermail/r-help/. Supplementary Figure S2:
File writing times compared for a 200 × 50,000 numeric matrix object in R. The file amounts to
182,037,446 bytes in CSV format, 76,782,060 bytes in RDS format, and 76,782,078 bytes in RDA format.
The three tests were performed in succession 100 times on an otherwise idle PC with Windows 10,
64.0 GB RAM, Intel Xeon E3-1245 @ 3.50 GHz, Seagate ST2000VX008 2TB SATA Hard Drive (with
speed = 7200 RPM). Error bars indicate the standard deviation.
Author Contributions: Conceptualization, F.M.G.; methodology, D.M., C.C. and F.M.G.; validation,
D.M., C.C. and F.M.G.; formal analysis, D.M., C.C. and F.M.G.; investigation, D.M., C.C. and F.M.G.;
resources, F.M.G. and D.M.; data curation, D.M.; writing—original draft preparation, D.M. and F.M.G.;
writing—review and editing, C.C.; visualization, F.M.G.; supervision, F.M.G.; project administration,
Life 2022, 12, 648 22 of 25
F.M.G.; funding acquisition, D.M. and F.M.G. All authors have read and agreed to the published
version of the manuscript.
Funding: This research was funded by the CARISBO Foundation, under the Bando Ricerca Medica e
Alta Tecnologia 2021 (project 2021.0167); University of Bologna, ALMArie CURIE 2021 initiative; Ital-
ian Ministry of University and Research, under the PON “Ricerca e Innovazione” 2014–2020 program;
CINECA consortium, project HP10CJH90B.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: Not applicable.
Acknowledgments: We would like to thank Francesco Licausi and Beatrice Giuntoli for their scientific
support, and Mariangela Santorsola for the fruitful discussions on code editors.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Ihaka, R.; Gentleman, R. R: A Language for Data Analysis and Graphics. J. Comput. Graph. Stat. 1996, 5, 299–314. [CrossRef]
2. Becker, R.A. A Brief History of S. In Computational Statistics; Dirschedl, P., Ostermann, R., Eds.; Contributions to Statistics;
Physica-Verlag HD: Heidelberg, Germany, 1994; pp. 81–110. ISBN 978-3-7908-0813-1.
3. Chambers, J.M. Programming with Data: A Guide to the S Language; Springer Science & Business Media: Berlin/Heidelberg,
Germany, 1998; ISBN 978-0-387-98503-9.
4. Becker, R.A. The New S Language; CRC Press: Boca Raton, FL, USA, 2018; ISBN 978-1-351-09188-6.
5. Ihaka, R. The R Project: A Brief History and Thoughts about the Future. Univ. Auckl. 2017, 4, 22.
6. Morandat, F.; Hill, B.; Osvald, L.; Vitek, J. Evaluating the Design of the R Language. In Proceedings of the ECOOP 2012—Object-
Oriented Programming; Noble, J., Ed.; Springer: Berlin/Heidelberg, Germany, 2012; pp. 104–131.
7. Ihaka, R. R: Past and Future History. Comput. Sci. Stat. 1998, 392396. Available online: https://cran.r-project.org/doc/html/
interface98-paper/paper.html (accessed on 21 April 2022).
8. Hornik, K. R Frequently Asked Questions. Available online: https://cran.r-project.org/doc/FAQ/R-FAQ.html#What-are-the-
differences-between-R-and-S_003f (accessed on 8 December 2021).
9. Carbonnelle, P. PYPL PopularitY of Programming Language Index. Available online: https://pypl.github.io/PYPL.html (accessed
on 9 December 2021).
10. Maechler, M. “R-Announce”, “R-Help”, “R-Devel”: 3 Mailing Lists for R. Available online: https://stat.ethz.ch/pipermail/r-
announce/1997/000000.html (accessed on 8 December 2021).
11. Hornik, K. Post from the R-Announce Mailing List: “ANNOUNCE: CRAN”. Available online: https://stat.ethz.ch/pipermail/r-
announce/1997/000001.html (accessed on 9 December 2021).
12. R: Contributors. Available online: https://www.r-project.org/contributors.html (accessed on 9 December 2021).
13. Bates, D. Post from the R-Announce Mailing List: “New Domain—r-Project.Org”. Available online: https://stat.ethz.ch/
pipermail/r-announce/1999/000103.html (accessed on 9 December 2021).
14. Dalgaard, P. Post from the R-Announce Mailing List: “R-1.0.0 Is Released”. Available online: https://stat.ethz.ch/pipermail/r-
announce/2000/000127.html (accessed on 9 December 2021).
15. Leisch, F. Post from the R-Announce Mailing List: “R Foundation for Statistical Computing”. Available online: https://stat.ethz.
ch/pipermail/r-announce/2003/000385.html (accessed on 9 December 2021).
16. The R Foundation Statute. Available online: https://www.r-project.org/foundation/Rfoundation-statutes.pdf (accessed on
9 December 2021).
17. Roh, S.W.; Abell, G.C.; Kim, K.-H.; Nam, Y.-D.; Bae, J.-W. Comparing Microarrays and Next-Generation Sequencing Technologies
for Microbial Ecology Research. Trends Biotechnol. 2010, 28, 291–299. [CrossRef]
18. Galili, T. R 3.0.0 Is Released! (What’s New, and How to Upgrade)|R-Statistics Blog. 2013. Available online: https://www.r-
statistics.com/2013/04/r-3-0-0-is-released-whats-new-and-how-to-upgrade/ (accessed on 21 April 2022).
19. Smith, D. R 4.0.0 Now Available, and a Look Back at R’s History. Available online: https://blog.revolutionanalytics.com/2020/0
4/r-400-is-released.html (accessed on 9 December 2021).
20. Lockstone, H.E. Exon Array Data Analysis Using Affymetrix Power Tools and R Statistical Software. Brief. Bioinform. 2011, 12,
634–644. [CrossRef]
21. Heather, J.M.; Chain, B. The Sequence of Sequencers: The History of Sequencing DNA. Genomics 2016, 107, 1–8. [CrossRef] [PubMed]
22. Gentleman, Robert 2002 Annual Report for the Bioconductor Project. Available online: https://www.bioconductor.org/about/
annual-reports/AnnRep2002.pdf (accessed on 9 December 2021).
23. Gentleman, R.C.; Carey, V.J.; Bates, D.M.; Bolstad, B.; Dettling, M.; Dudoit, S.; Ellis, B.; Gautier, L.; Ge, Y.; Gentry, J.; et al.
Bioconductor: Open Software Development for Computational Biology and Bioinformatics. Genome Biol. 2004, 5, R80. [CrossRef]
Life 2022, 12, 648 23 of 25
24. Kopf, D. Ggplot2 Is 10 Years Old: The Program That Brought Data Visualization to the Masses. Available online: https://qz.com/10
07328/all-hail-ggplot2-the-code-powering-all-those-excellent-charts-is-10-years-old/ (accessed on 9 December 2021).
25. Villanueva, R.A.M.; Chen, Z.J. Ggplot2: Elegant Graphics for Data Analysis (2nd Ed.). Meas. Interdiscip. Res. Perspect. 2019, 17,
160–167. [CrossRef]
26. Wickham, H.; Averick, M.; Bryan, J.; Chang, W.; McGowan, L.; François, R.; Grolemund, G.; Hayes, A.; Henry, L.; Hester, J.; et al.
Welcome to the Tidyverse. J. Open Source Softw. 2019, 4, 1686. [CrossRef]
27. RStudio GitHub Repository. Available online: https://github.com/rstudio (accessed on 9 December 2021).
28. RStudio Team RStudio, New Open-Source IDE for R. Available online: https://rstudio.comhttps://www.rstudio.com/blog/
rstudio-new-open-source-ide-for-r/ (accessed on 9 December 2021).
29. Smith, D. RStudio Releases Shiny|R-Bloggers. 2012. Available online: https://www.r-bloggers.com/2012/11/rstudio-releases-
shiny/ (accessed on 21 April 2022).
30. Mercatelli, D.; Holding, A.N.; Giorgi, F.M. Web Tools to Fight Pandemics: The COVID-19 Experience. Brief. Bioinform. 2021, 22,
690–700. [CrossRef]
31. Xie, Y.; Allaire, J.J.; Grolemund, G. R Markdown: The Definitive Guide, 1st ed.; Chapman and Hall/CRC: London, UK, 2018;
ISBN 978-1-138-35944-4.
32. Baumer, B.; Udwin, D. R Markdown. WIREs Comput. Stat. 2015, 7, 167–177. [CrossRef]
33. Xu, W.; Huang, R.; Zhang, H.; El-Khamra, Y.; Walling, D. Empowering R with High Performance Computing Resources for
Big Data Analytics. In Conquering Big Data with High Performance Computing; Arora, R., Ed.; Springer International Publishing:
Cham, Switzerland, 2016; pp. 191–217. ISBN 978-3-319-33742-5.
34. Schäfer, J.; Opgen-Rhein, R.; Strimmer, K. Reverse Engineering Genetic Networks Using the GeneNet Package. Newsl. R Proj.
2006, 6, 50.
35. Hornik, K. Are There Too Many R Packages? Austrian J. Stat. 2012, 41, 59–66. [CrossRef]
36. Love, M.I.; Huber, W.; Anders, S. Moderated Estimation of Fold Change and Dispersion for RNA-Seq Data with DESeq2. Genome
Biol. 2014, 15, 550. [CrossRef]
37. Smyth, G.K. Limma: Linear Models for Microarray Data. In Bioinformatics and Computational Biology Solutions Using R and
Bioconductor; Springer: Berlin/Heidelberg, Germany, 2005; pp. 397–420.
38. Lawrence, M.; Huber, W.; Pages, H.; Aboyoun, P.; Carlson, M.; Gentleman, R.; Morgan, M.T.; Carey, V.J. Software for Computing
and Annotating Genomic Ranges. PLoS Comput. Biol. 2013, 9, e1003118. [CrossRef]
39. Mercatelli, D.; Lopez-Garcia, G.; Giorgi, F.M. Corto: A Lightweight R Package for Gene Network Inference and Master Regulator
Analysis. Bioinformatics 2020, 36, 3916–3917. [CrossRef]
40. Satija, R.; Farrell, J.A.; Gennert, D.; Schier, A.F.; Regev, A. Spatial Reconstruction of Single-Cell Gene Expression Data. Nat.
Biotechnol. 2015, 33, 495–502. [CrossRef]
41. R-Forge Home Page. Available online: https://r-forge.r-project.org/ (accessed on 9 December 2021).
42. Zapponi, C. GitHut—Programming Languages and GitHub. Available online: https://githut.info/ (accessed on 9 December 2021).
43. Lopez, G.; Egolf, L.E.; Giorgi, F.M.; Diskin, S.J.; Margolin, A.A. Svpluscnv: Analysis and Visualization of Complex Structural
Variation Data. Bioinformatics 2021, 37, 1912–1914. [CrossRef]
44. Su, K.; Wu, Z.; Wu, H. Simulation, Power Evaluation and Sample Size Recommendation for Single-Cell RNA-Seq. Bioinformatics
2020, 36, 4860–4868. [CrossRef]
45. Gillespie, C. Understanding the Parquet File Format. Available online: https://www.jumpingrivers.com/blog/parquet-file-
format-big-data-r/ (accessed on 9 December 2021).
46. Royston, P. Approximating the Shapiro-Wilk W-Test for Non-Normality. Stat. Comput. 1992, 2, 117–119. [CrossRef]
47. Gosset, W.S. The Probable Error of a Mean. Biometrika 1908, 6, 1–25. [CrossRef]
48. Bonett, D.G.; Wright, T.A. Sample Size Requirements for Estimating Pearson, Kendall and Spearman Correlations. Psychometrika
2000, 65, 23–28. [CrossRef]
49. Wilcoxon, F. Individual Comparisons by Ranking Methods. Biom. Bull. 1945, 1, 80–83. [CrossRef]
50. Mercatelli, D.; Balboni, N.; Palma, A.; Aleo, E.; Sanna, P.P.; Perini, G.; Giorgi, F.M. Single-Cell Gene Network Analysis and
Transcriptional Landscape of MYCN-Amplified Neuroblastoma Cell Lines. Biomolecules 2021, 11, 177. [CrossRef] [PubMed]
51. Spitzer, M.; Wildenhain, J.; Rappsilber, J.; Tyers, M. BoxPlotR: A Web Tool for Generation of Box Plots. Nat. Methods 2014, 11,
121–122. [CrossRef]
52. Kenny, M.; Schoen, I. Violin SuperPlots: Visualizing Replicate Heterogeneity in Large Data Sets. MBoC 2021, 32, 1333–1334. [CrossRef]
53. Hintze, J.L.; Nelson, R.D. Violin Plots: A Box Plot-Density Trace Synergism. Am. Stat. 1998, 52, 181–184. [CrossRef]
54. Leisch, F. Sweave: Dynamic Generation of Statistical Reports Using Literate Data Analysis. In Proceedings of the Compstat; Härdle,
W., Rönz, B., Eds.; Physica-Verlag HD: Heidelberg, Germany, 2002; pp. 575–580.
55. Xie, Y. Dynamic Documents with R and Knitr; Chapman and Hall/CRC: London, UK, 2016; ISBN 978-0-429-17103-1.
56. Markowetz, F. Five Selfish Reasons to Work Reproducibly. Genome Biol. 2015, 16, 274. [CrossRef]
57. Murrell, P. R Graphics; Chapman and Hall/CRC: New York, NY, USA, 2005; ISBN 978-0-429-19610-2.
58. Stander, J.; Dalla Valle, L. On Enthusing Students About Big Data and Social Media Visualization and Analysis Using R, RStudio,
and RMarkdown. J. Stat. Educ. 2017, 25, 60–67. [CrossRef]
Life 2022, 12, 648 24 of 25
59. Rödiger, S.; Friedrichsmeier, T.; Kapat, P.; Michalke, M. RKWard: A Comprehensive Graphical User Interface and Integrated
Development Environment for Statistical Analysis with R. J. Stat. Softw. 2012, 49, 1–34. [CrossRef]
60. Lam, L. A Guide to Eclipse and the R Plug-in StatET. Available online: https://usermanual.wiki/Document/A20guide20to2
0Eclipse20and20the20R20plugin20StatET.1831954166 (accessed on 21 April 2022).
61. Wahlbrink, S.; Verbeke, T. An Open Source Visual R Debugger in StatET. In Proceedings of the R User Conference, Coventry, UK,
16–18 August 2011; University of Warwick: Coventry, UK; p. 71.
62. Nelson, M.J.; Hoover, A.K. Notes on Using Google Colaboratory in AI Education. In Proceedings of the 2020 ACM Conference on
Innovation and Technology in Computer Science Education, Trondheim, Norway, 15–19 June 2020; pp. 533–534.
63. Beard, B. Setup and Installation of R Tools for Visual Studio. In Beginning SQL Server R Services; Springer: Berlin/Heidelberg,
Germany, 2016; pp. 33–71.
64. Ueda, Y. R Extension for Visual Studio Code. Available online: https://marketplace.visualstudio.com/items?itemName=
Ikuyadeu.r (accessed on 9 December 2021).
65. Stack Overflow Developer Survey 2021—Most Popular Integrated Development Environments. Available online: https://
insights.stackoverflow.com/survey/2021#section-most-popular-technologies-integrated-development-environment (accessed
on 9 December 2021).
66. de Aquino, J.A. Jalvesaq/Nvim-R. Available online: https://github.com/jalvesaq/Nvim-R (accessed on 21 April 2022).
67. Bell, C.G.; Mudge, J.C.; McNamara, J.E. Digital Equipment Corporation. In Computer Engineering: A DEC View of Hardware Systems
Design; Digital Press: Bedford, MA, USA, 1978; ISBN 978-0-932376-00-8.
68. Kirkbride, P. Emacs and Vim. In Basic Linux Terminal Tips and Tricks; Springer: Berlin/Heidelberg, Germany, 2020; pp. 247–274.
69. Hallen, J. Text Editor Performance Comparison. Available online: https://github.com/jhallen/joes-sandbox/tree/master/editor-
perf (accessed on 9 December 2021).
70. Sparapani, R. Revolutions Blog—Emacs, ESS and R for Zombies. Available online: https://blog.revolutionanalytics.com/2014/0
3/emacs-ess-and-r-for-zombies.html (accessed on 9 December 2021).
71. Fourment, M.; Gillings, M.R. A Comparison of Common Programming Languages Used in Bioinformatics. BMC Bioinform. 2008,
9, 82. [CrossRef] [PubMed]
72. Eddelbuettel, D.; Francois, R. Rcpp: Seamless R and C++ Integration. J. Stat. Softw. 2011, 40, 1–18. [CrossRef]
73. Irizarry, R.A.; Wu, Z.; Jaffee, H.A. Comparison of Affymetrix GeneChip Expression Measures. Bioinformatics 2006, 22, 789–794.
[CrossRef] [PubMed]
74. Anders, S.; Huber, W. Differential Expression of RNA-Seq Data at the Gene Level–the DESeq Package. Heidelb. Ger. Eur. Mol. Biol.
Lab. (EMBL) 2012, 10, f1000research.
75. Eastwood, B. The 10 Most Popular Programming Languages to Learn in 2021. Available online: https://www.northeastern.edu/
graduate/blog/most-popular-programming-languages/ (accessed on 9 December 2021).
76. Yu, G.; Wang, L.-G.; Han, Y.; He, Q.-Y. ClusterProfiler: An R Package for Comparing Biological Themes among Gene Clusters.
Omics J. Integr. Biol. 2012, 16, 284–287. [CrossRef]
77. Durinck, S.; Spellman, P.T.; Birney, E.; Huber, W. Mapping Identifiers for the Integration of Genomic Datasets with the
R/Bioconductor Package BiomaRt. Nat. Protoc. 2009, 4, 1184–1191. [CrossRef]
78. Dowle, M. Benchmarks: Grouping · Rdatatable/Data.Table Wiki · GitHub. Available online: https://github.com/Rdatatable/
data.table/wiki/Benchmarks-%3A-Grouping (accessed on 9 December 2021).
79. James, G.; Witten, D.; Hastie, T.; Tibshirani, R. An Introduction to Statistical Learning; Springer Texts in Statistics; Springer:
New York, NY, USA, 2013; Volume 103, ISBN 978-1-4614-7137-0.
80. Tibshirani, R. The Lasso Method for Variable Selection in the Cox Model. Stat. Med. 1997, 16, 385–395. [CrossRef]
81. Vasilevski, A.; Giorgi, F.M.; Bertinetti, L.; Usadel, B. LASSO Modeling of the Arabidopsis Thaliana Seed/Seedling Transcriptome:
A Model Case for Detection of Novel Mucilage and Pectin Metabolism Genes. Mol. BioSyst. 2012, 8, 2566. [CrossRef]
82. Rawi, R.; Mall, R.; Kunji, K.; Shen, C.-H.; Kwong, P.D.; Chuang, G.-Y. PaRSnIP: Sequence-Based Protein Solubility Prediction
Using Gradient Boosting Machine. Bioinformatics 2018, 34, 1092–1098. [CrossRef]
83. Mercatelli, D.; Ray, F.; Giorgi, F.M. Pan-Cancer and Single-Cell Modeling of Genomic Alterations Through Gene Expression. Front.
Genet. 2019, 10, 671. [CrossRef]
84. Barter, R. Tidymodels: Tidy Machine Learning in R. Available online: https://www.rebeccabarter.com/blog/2020-03-25
_machine_learning/ (accessed on 8 December 2021).
85. LeDell, E.; Gill, N.; Aiello, S.; Fu, A.; Candel, A.; Click, C.; Kraljevic, T.; Nykodym, T.; Aboyoun, P.; Kurka, M.; et al. H2O: R
Interface for the “H2O” Scalable Machine Learning Platform. Available online: https://docs.h2o.ai/h2o/latest-stable/h2o-r/
docs/index.html (accessed on 21 April 2022).
86. Lang, M.; Binder, M.; Richter, J.; Schratz, P.; Pfisterer, F.; Coors, S.; Au, Q.; Casalicchio, G.; Kotthoff, L.; Bischl, B. Mlr3: A Modern
Object-Oriented Machine Learning Framework in R. J. Open Source Softw. 2019, 4, 1903. [CrossRef]
87. Venables, W.N.; Ripley, B.D. Modern Applied Statistics with S, 4th ed.; Springer: New York, NY, USA, 2002.
88. Taylor, S.; Letham, B. Prophet: Automatic Forecasting Procedure. Available online: https://cran.r-project.org/web/packages/
prophet/index.html (accessed on 21 April 2022).
89. Papacharalampous, G.A.; Tyralis, H. Evaluation of Random Forests and Prophet for Daily Streamflow Forecasting. Adv. Geosci.
2018, 45, 201–208. [CrossRef]
Life 2022, 12, 648 25 of 25
90. Rahimi, I.; Chen, F.; Gandomi, A.H. A Review on COVID-19 Forecasting Models. Neural Comput. Appl. 2021, 1–11. [CrossRef]
[PubMed]
91. Berners-Lee, T.; Cailliau, R.; Groff, J.; Pollermann, B. World-Wide Web: The Information Universe. Internet Res. 1992, 2, 52–58.
[CrossRef]
92. Hendler, J. Web 3.0 Emerging. Computer 2009, 42, 111–113. [CrossRef]
93. Becoming A Data-Driven CEO|Domo. Available online: https://www.domo.com/solution/data-never-sleeps-6 (accessed on
7 November 2021).
94. Internet Users in the World. 2021. Available online: https://www.statista.com/statistics/617136/digital-population-worldwide/
(accessed on 7 November 2021).
95. Brusic, V. The Growth of Bioinformatics. Brief. Bioinform. 2006, 8, 69–70. [CrossRef]
96. Clough, E.; Barrett, T. The Gene Expression Omnibus Database. In Statistical Genomics: Methods and Protocols; Mathé, E., Davis,
S., Eds.; Methods in Molecular Biology; Springer: New York, NY, USA, 2016; pp. 93–110. ISBN 978-1-4939-3578-9.
97. Parkinson, H.; Kapushesky, M.; Shojatalab, M.; Abeygunawardena, N.; Coulson, R.; Farne, A.; Holloway, E.; Kolesnykov, N.; Lilja,
P.; Lukk, M.; et al. ArrayExpress—A Public Database of Microarray Experiments and Gene Expression Profiles. Nucleic Acids Res.
2007, 35, D747–D750. [CrossRef]
98. Hubbard, S.J.; Jones, A.R. (Eds.) Proteome Bioinformatics; Methods in Molecular Biology; Humana Press: Totowa, NJ, USA, 2010;
Volume 604, ISBN 978-1-60761-443-2.
99. Szklarczyk, D.; Gable, A.L.; Nastou, K.C.; Lyon, D.; Kirsch, R.; Pyysalo, S.; Doncheva, N.T.; Legeay, M.; Fang, T.; Bork, P.; et al.
The STRING Database in 2021: Customizable Protein–Protein Networks, and Functional Characterization of User-Uploaded
Gene/Measurement Sets. Nucleic Acids Res. 2021, 49, D605–D612. [CrossRef]
100. Stark, C.; Breitkreutz, B.-J.; Reguly, T.; Boucher, L.; Breitkreutz, A.; Tyers, M. BioGRID: A General Repository for Interaction
Datasets. Nucleic Acids Res. 2006, 34, D535–D539. [CrossRef]
101. Pal, S.; Mondal, S.; Das, G.; Khatua, S.; Ghosh, Z. Big Data in Biology: The Hope and Present-Day Challenges in It. Gene Rep. 2020,
21, 100869. [CrossRef]
102. Jia, L.; Yao, W.; Jiang, Y.; Li, Y.; Wang, Z.; Li, H.; Huang, F.; Li, J.; Chen, T.; Zhang, H. Development of Interactive Biological Web
Applications with R/Shiny. Brief. Bioinform. 2021, 23, bbab415. [CrossRef]
103. Greene, C.S.; Tan, J.; Ung, M.; Moore, J.H.; Cheng, C. Big Data Bioinformatics. J. Cell. Physiol. 2014, 229, 1896–1900. [CrossRef]
104. Mercatelli, D.; Triboli, L.; Fornasari, E.; Ray, F.; Giorgi, F.M. Coronapp: A Web Application to Annotate and Monitor SARS-CoV-
2 Mutations. J. Med. Virol. 2021, 93, 3238–3245. [CrossRef]
105. Menestrina, L.; Cabrelle, C.; Recanatini, M. COVIDrugNet: A Network-Based Web Tool to Investigate the Drugs Currently in
Clinical Trial to Contrast COVID-19. Sci. Rep. 2021, 11, 19426. [CrossRef]
106. Kasprzak, P.; Mitchell, L.; Kravchuk, O.; Timmins, A. Six Years of Shiny in Research—Collaborative Development of Web Tools in
R. arXiv 2021, arXiv:2101.10948. [CrossRef]
107. Salvaneschi, G.; Margara, A.; Tamburrelli, G. Reactive Programming: A Walkthrough. In Proceedings of the 2015 IEEE/ACM
37th IEEE International Conference on Software Engineering, Florence, Italy, 16–24 May 2015; Volume 2, pp. 953–954.