R Programming
R Programming
Frame, Array, Matrix, Statistics Commands, Base graphics, Data manipulation with data
table ,concept of cluster, Concept of Prediction Model ,Analysis of Real world Problem.
The R environment :
R is an integrated suite of software facilities for data manipulation,
calculation and graphical display. It includes
Installation of R
Installing R on Windows OS
To install R on Windows OS:
4. Run the R executable file to start installation, and allow the app to
make changes to your device.
Data Types in R
The logical data type in R is also known as boolean data type. It can only
have two values: TRUE and FALSE . For example,
print(bool1)
print(class(bool1))
print(bool2)
print(class(bool2))
Output
[1] TRUE
[1] "logical"
[1] FALSE
[1] "logical"
is_weekend <- F
print(class(is_weekend)) # "logical"
In R, the numeric data type represents all real numbers with or without
decimal values. For example,
print(weight)
print(class(weight))
# real numbers
height <- 182
print(height)
print(class(height))
Output
[1] 63.5
[1] "numeric"
[1] 182
[1] "numeric"
The integer data type specifies real values without decimal points. We use
the suffix L to specify integer data. For example,
Output
[1] "integer"
Here, 186L is an integer data. So we get "integer" when we print the class
of integer_variable .
[1] "complex"
print(class(fruit))
print(class(my_char))
Output
[1] "character"
[1] "character"
Here, both the variables - fruit and my_char - are of character data type.
6. Raw Data Type
A raw data type specifies values as raw bytes. You can use the following
methods to convert character data types to a raw data type and vice-versa:
charToRaw() - converts character data to raw data
rawToChar() - converts raw data to character data
For example,
print(raw_variable)
print(class(raw_variable))
print(char_variable)
print(class(char_variable))
Output
[1] 57 65 6c 63 6f 6d 65 20 74 6f 20 50 72 6f 67 72 61 6d 69 7a
[1] "raw"
[1] "Welcome to Programiz"
[1] "character"
In this program,
We have first used the charToRaw() function to convert the string "Welcome to
This is why we get "raw" as output when we print the class of raw_variable .
Then, we have used the rawToChar() function to convert the data
in raw_variable back to character form.
R Vectors
A vector is the basic data structure in R that stores data of similar
types. For example,
Suppose we need to record the age of 5 employees. Instead of
creating 5 separate variables, we can simply create a vector.
Create a Vector in R
In R, we use the c() function to create a vector. For example,
print(employees)
We can access elements of a vector using the index number (1, 2, 3 …).
For example,
Vector
Indexing in R
Here, we have used the vector index to access the vector elements
Note: In R, the vector index always starts with 1. Hence, the first element of
a vector is present at index 1, second element at index 2 and so on.
Modify Vector Element
To change a vector element, we can simply reassign a new value to the
specific index. For example,
Output
Numeric Vector in R
Similar to strings, we use the c() function to create a numeric vector. For
example,
print(numbers)
# Output: [1] 1 2 3 4 5
Here, we have used the C() function to create a vector of numeric
sequence called numbers .
print(numbers)
Output
[1] 1 2 3 4 5
Here, we have used the : operator to create the vector named numbers with
numerical values in sequence i.e. 1 to 5.
Repeat Vectors in R
In R, we use the rep() function to repeat elements of vectors. For example,
Output
rep(numbers, times=2)
Here,
Output
rep(numbers, each = 2)
Output
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
Length of Vector in R
We can use the length() function to find the number of elements present
inside the vector. For example,
Output
Total Elements: 4
Here, we have used length() to find the length of the languages vector.
List in R Programming :
A list in R is a generic object consisting of an ordered collection of
objects. Lists are one-dimensional, heterogeneous data structures.
The list can be a list of vectors, a list of matrices, a list of characters
and a list of functions, and so on. A list is a vector but with
heterogeneous data elements.
Matrices
Matrices are nothing more than a collection of data elements arranged in a
rectangular layout that is two-dimensional. An example matrix with 3x3
dimensions looks like this.
1 [,1] [,2] [,3]
2[1,] 1 2 3
3[2,] 4 5 6
4[3,] 7 8 9
R
The most important thing you need to remember to get started with
matrices is the matrix() function. This function has the following skeleton.
1matrix(
2 c(),
3 nrow=,
4 ncol=,
5 byrow = )
R
The first argument is a vector that defines which atomic values are present
in the matrix. The second argument defines how many rows that vector
splits up, and the third argument tells how many columns. The number of
elements in the vector should be multiple or sub-multiple to nrow * ncol.
The last argument defines whether you want to fill up the matrix by rows or
columns. By default, the argument for byrow is FALSE, which means the
matrix if filled up from column to column.
Let's try this one out. Your matrix definition looks like this.
1matrix(
2c(1,2,3,4,5,6,7,8),
3nrow = 4,
4ncol = 2,
5byrow = TRUE)
R
The output should look like this.
1 [,1] [,2]
2[1,] 1 2
3[2,] 3 4
4[3,] 5 6
5[4,] 7 8
bash
If you omit the byrow=TRUE argument, the following output greets you.
1 [,1] [,2]
2[1,] 1 3
3[2,] 2 4
bash
Where did the rest of the elements go? The problem is that your vector is
bigger than the matrix size. If you want to get all the values from the vector,
the ncol=4 should be the modification you make. That way you have the
following output.
1 [,1] [,2] [,3] [,4]
2[1,] 1 3 5 7
3[2,] 2 4 6 8
bash
Now that you have the foundations, how do you proceed?
Transpose
This concept comes from linear algebra. Basically, what happens is that the
matrix gets flipped over its diagonal. In order to do this in R, you can use
the t() function. Let's see how it looksgiven the matrix below.
1a <- matrix(
2c(1,2,3,4,5,6,7,8),
3nrow = 4,
4ncol = 2,
5byrow = TRUE)
R
Before transposing, the output looks like this.
1 [,1] [,2]
2[1,] 1 2
3[2,] 3 4
4[3,] 5 6
5[4,] 7 8
bash
After transposing, the output looks like this.
1 [,1] [,2] [,3] [,4]
2[1,] 1 3 5 7
3[2,] 2 4 6 8
bash
Combine Matrices
In order to combine two matrices, the cbind() function needs to be used. It
takes two matrices as arguments and produces their combination. When
you are combining matrices, you need to make sure the they have the
same number of rows, otherwise an exception is thrown.
Given are two matrices B and D.
1B <- matrix(
2c(2, 4, 3, 1, 5, 7),
3nrow=3,
4ncol=2)
5
6D <- matrix(
7c(1, 3, 2),
8nrow=3,
9ncol=1)
R
Their combination can be created the following way.
1cbind(B,D)
R
The output looks like this.
1 [,1] [,2] [,3]
2[1,] 2 1 1
3[2,] 4 5 3
4[3,] 3 7 2
bash
Deconstruction
This concept allows you to break down the matrix into its original vector,
which can come handy in certain situations. Take the following matrix
called H.
1H <- matrix(
2c(1,2,3,4,5,6,7,8,9,10),
3nrow=5,
4ncol=2)
R
You are able to deconstruct it with the c() function.
1c(H)
R
The output looks like this.
1 [1] 1 2 3 4 5 6 7 8 9 10
bash
Lists
Lists are objects that may contain elements of different types, similar to
vectors. These different types can be of strings, numbers, vectors, and
even another list inside. You can have matrices as different elements in
your lists. The concept is a general container for special use cases. The
function that allows you to create a list is called list().
An example list would look like this.
1data <- list("Server","Network Device",c(1,2,3,4),
FALSE, list(1,2,3,4,5,6))
R
The content of your list is now as follows.
1[[1]]
2[1] "Server"
3
4[[2]]
5[1] "Network Device"
6
7[[3]]
8[1] 1 2 3 4
9
10[[4]]
11[1] FALSE
12
13[[5]]
14[[5]][[1]]
15[1] 1
16
17[[5]][[2]]
18[1] 2
19
20[[5]][[3]]
21[1] 3
22
23[[5]][[4]]
24[1] 4
25
26[[5]][[5]]
27[1] 5
28
29[[5]][[6]]
30[1] 6
bash
You can see that there is not really any limit as to how many or what type
of elements you can store in the list. There is a special function
called names() that allows you to name your list, which results in a special
dictionary-like datastructure. A dictionary datastructure consists of a key-
value pair. In this case, the key is the list of names and the values are the
actual elements.
Let's give names to the list elements.
1names(data) <- c("Hardware", "Network", "vector",
"boolean","nestedlist")
R
After the function is executed, we can refer to the elements in the list by
their names.
1data$Hardware
2[1] "Server"
3
4data$Network
5[1] "Network Device"
6
7data$nestedlist
8[[1]]
9[1] 1
10
11[[2]]
12[1] 2
13
14[[3]]
15[1] 3
16
17[[4]]
18[1] 4
19
20[[5]]
21[1] 5
22
23[[6]]
24[1] 6
R
This allows you to build more sophisticated functions and create
abstractions that allow users to understand and maintain the code more
efficiently. As with lists in other programming languages, you can access,
manipulate, and merge the lists. The indexing starts from 1.
In order to access the elements, refer to them with their indexes.
Let's retrieve the first and second elements.
1> data[1]
2$Hardware
3[1] "Server"
4
5> data[2]
6$Network
7[1] "Network Device"
R
In order to remove a specific element, assign the NULL value to its index.
This will reduce the length of your list.
Let's remove the nested list. You can do this in two ways. The second one
will only work if you have named your list elements.
1data[5] <- NULL
2
3data$nestedlist <- NULL
R
Suppose you have two lists from different datasources, and you have a
function that needs data from both of them. You have the option to merge
these two lists.
1monthids <- list(1,2,3,4,5,6,7,8,9,10,11,12)
2months <-
list("Jan","Feb","Mar","Apr","May","June","July","Aug",
"Sep","Oct","Nov","Dec")
R
The way to achieve this is to use the c() function.
1merged.list <- c(monthids,months)
R
This will produce the following results.
1[[1]]
2[1] 1
3
4[[2]]
5[1] 2
6
7[[3]]
8[1] 3
9
10[[4]]
11[1] 4
12
13[[5]]
14[1] 5
15
16[[6]]
17[1] 6
18
19[[7]]
20[1] 7
21
22[[8]]
23[1] 8
24
25[[9]]
26[1] 9
27
28[[10]]
29[1] 10
30
31[[11]]
32[1] 11
33
34[[12]]
35[1] 12
36
37[[13]]
38[1] "Jan"
39
40[[14]]
41[1] "Feb"
42
43[[15]]
44[1] "Mar"
45
46[[16]]
47[1] "Apr"
48
49[[17]]
50[1] "May"
51
52[[18]]
53[1] "June"
54
55[[19]]
56[1] "July"
57
58[[20]]
59[1] "Aug"
60
61[[21]]
62[1] "Sep"
63
64[[22]]
65[1] "Oct"
66
67[[23]]
68[1] "Nov"
69
70[[24]]
71[1] "Dec"
bash
The unlist() function allows you to convert your lists to vectors.
1myvector <- unlist(merged.list)
R
After this, all the usual arithmetic operators can be applied to the newly
created vector.
Arrays
An array is a vector with one or more dimensions. A one-dimensional array
can be considered a vector, and an array with two dimensions can be
considered a matrix. Behind the scenes, data is stored in a form of an n-
dimensional matrix. The array() function can be used to create your own
array. The only restriction is that arrays can only store data types.
You can create a simple array the following way.
1v1 <- c(1,2,3)
2v2 <- c(4,5,6,7,8,9)
3result <- array(c(v1,v2),dim = c(3,3,2))
R
Now the result holds an array which has two matrices with three rows and
three columns.
1, , 1
2
3 [,1] [,2] [,3]
4[1,] 1 4 7
5[2,] 2 5 8
6[3,] 3 6 9
7
8, , 2
9
10 [,1] [,2] [,3]
11[1,] 1 4 7
12[2,] 2 5 8
13[3,] 3 6 9
bash
The keyword here is dim. It defines the maximum number of indices in each
dimension.
There is a more general syntax that is a skeleton to keep in mind and
comes in handy most of the time.
1my_array <- array(data, dim = (rows, colums, matrices,
dimnames)
R
You have the option to name your rows, columns and matrices in an array.
Suppose you extend your above code with the following.
1v1 <- c(1,2,3)
2v2 <- c(4,5,6,7,8,9)
3col.names <- c("Item","Serial","Size")
4row.names <- c("Server","Network","Firewall")
5matrix.names <- c("DataCenter EU","DataCenter US")
6result <- array(c(v1,v2),dim = c(3,3,2),dimnames =
list(row.names,col.names,matrix.names))
R
Now the result array holds a more meaningful name that makes the code
cleaner and easier to maintain.
1, , DataCenter EU
2
3 Item Serial Size
4Server 1 4 7
5Network 2 5 8
6Firewall 3 6 9
7
8, , DataCenter US
9
10 Item Serial Size
11Server 1 4 7
12Network 2 5 8
13Firewall 3 6 9
bash
Accessing the elements is a bit more tricky, but once you get the hang of it,
it should become easy. The skeleton code you should keep in mind is the
following.
1result[row,column,matrix]
R
There is a neat trick with this. If you omit any of the arguments, they will be
collected for all matrices, rows, or columns.
For example, if you were to collect the serials from each datacenter, all you
would have to do is write the following.
1result[1,2,]
R
The output should be the following.
1DataCenterEU DataCenterUS
2 4 4
R
If you were to collect the size of each device from every datacenter the
following code would do the job.
1result[,3,]
bash
The output should be the following.
1 DataCenterEU DataCenterUS
2Server 7 7
3Network 8 8
4Firewall 9 9
bash
Arrays allow you to create matrices from them with the following code. Let's
separate each datacenter to their own matrix.
1DCEU <- result[,,1]
2DCUS <- result[,,2]
R
The corresponding outputs will be as you expect.
1#DCEU
2 Item Serial Size
3Server 1 4 7
4Network 2 5 8
5Firewall 3 6 9
6#DCUS
7 Item Serial Size
8Server 1 4 7
9Network 2 5 8
10Firewall 3 6 9
Statistical Commands in R :
R is a statistical programming tool that's uniquely equipped to handle data, and lots of it.
Wrangling mass amounts of information and producing publication-ready graphics and
visualizations is easy with R. So are all sorts of data analysis, mining, and modeling tasks.
R Statistics
Mean, median and mode.
Minimum and maximum value.
Percentiles.
Variance and Standard Deviation.
Covariance and Correlation.
Probability distributions.
Mean in R Programming
Mean, often referred to as the average, is a fundamental statistical measure that represents the
central value of a dataset. It is calculated by summing up all the values in the dataset and
dividing the sum by the number of data points. The mean is commonly used to provide an
overall understanding of the dataset's typical value, making it a widely used measure in data
analysis and research.
Syntax
In R, calculating the mean is straightforward, thanks to the built-in mean() function. The
syntax for computing the mean in R is as follows:
Here,
x: This parameter represents the input vector or numeric data frame from which the mean is
to be calculated.
trim: An optional parameter that allows you to exclude a certain percentage of extreme
values from the dataset before calculating the mean. The default value is 0, meaning no
trimming is applied.
na.rm: Another optional parameter that determines whether to exclude missing values (NA)
from the calculation. The default value is FALSE, which includes NA values in the
computation. Set this parameter to TRUE to ignore NA values.
Calculate Mean in R
To calculate the mean of a dataset in R, simply use the mean() function with the input vector
or numeric data frame as the argument. Let's consider an example:
Output:
[1] 45
Median in R Programming
The median is a statistical measure used to represent the central value of a dataset. Unlike the
mean, which is the average of all values, the median is the middle value when the dataset is
arranged in ascending order. It is a robust measure of central tendency, meaning it is not
affected by extreme values or outliers in the data. This makes the median particularly useful
when dealing with datasets that have skewed distributions or contain extreme values.
Syntax
Calculating the median in R is straightforward with the built-in median() function. The syntax
for computing the median is as follows:
Here,
x: This parameter represents the input vector or numeric data frame for which you
want to calculate the median. It can be a numeric vector, a numeric matrix, or a
numeric data frame.
na.rm: An optional parameter that determines whether to exclude missing values
(NA) from the calculation. The default value is FALSE, which includes NA values in
the computation. Set this parameter to TRUE to ignore NA values.
Parameter
The median() function in R takes two parameters:
x: This is the main parameter that represents the input vector or numeric data frame
for which you want to calculate the median. The vector can be of any length and
should contain numeric values.
na.rm: The na.rm parameter specifies whether to exclude missing values (NA) from
the calculation. By default, it is set to FALSE, meaning NA values are included in the
computation. Setting it to TRUE ensures that NA values are ignored during the
calculation.
Calculate Median in R
To calculate the median of a dataset in R, use the median() function with the input vector or
numeric data
Output:
[1] 45
In this example, we have a numeric vector data_vec containing some values. We use
the median() function to calculate the median of this vector and store the result in
the median_result variable. Finally, we print the calculated median.
The median() function works seamlessly even if the dataset has an odd or even number of
elements. If the number of elements is odd, the median is the middle value. However, if the
number of elements is even, the median is the average of the two middle values.
Output:
[1] 39.5
Mode in R Programming
The mode is a statistical measure used to identify the value that occurs most frequently in a
dataset. Unlike the mean and median, which represent the central tendencies, the mode
highlights the most commonly recurring value, making it a useful measure for identifying
patterns and understanding the distribution of categorical or discrete data.
Syntax
R does not have a built-in function to directly calculate the mode. However, we can create a
user-defined function or use the mlv() function from the modeest package to find the mode.
The syntax for creating the user-defined function or using the mlv() function is as follows:
User-defined Function:
library(modeest)
mlv(x)
Here,
data: This parameter represents the input vector for which you want to calculate the
mode. It can be a vector of any length and should contain categorical or discrete
numeric values.
x: This parameter is the input vector for which you want to find the mode using
the mlv() function from the modeest package.
Parameter
The mode calculation in R can be performed using a user-defined function or the mlv()
function from the modeest package. Both methods take a single parameter:
data: This is the main parameter that represents the input vector for which you want to
calculate the mode. The vector can contain either categorical values or discrete numeric
values.
Since R does not have a built-in function for finding the mode, we can create a user-defined
function to calculate it. The idea behind the user-defined function is to determine the unique
values in the dataset, count their occurrences, and identify the one with the highest frequency.
In this function, we use the unique() function to obtain the unique values in the dataset. Next,
we use the table() function to count the occurrences of each unique value. Finally, we find the
value with the highest frequency using which.max() and return it as the mode.
Output:
[1] 23
In this example, we have a numeric vector data_vec with multiple values, and we use our
user-defined function calculate_mode() to find the mode. The result will be the value that
appears most frequently in the dataset.
Here is a list of all graph types that are illustrated in this article:
Barplot
Boxplot
Density Plot
Heatmap
Histogram
Line Plot
Pairs Plot
Polygon Plot
QQplot
Scatterplot
Venn Diagram
Each type of graphic is illustrated with some basic example code. These codes are
based on the following data:
Barplot
Barplot Definition: A barplot (or barchart; bargraph) illustrates the association between
a numeric and a categorical variable. The barplot represents each category as a bar and
reflects the corresponding numeric value with the bar’s size.
Based on the previous output of the RStudio console, you can see how our example
data should look like: It’s a matrix consisting of a column for each bar and a row for each
group.
Now, we can draw a stacked barchart by specifying our previously created matrix as
input data for the barplot function:
Furthermore, we should add a legend to our stacked bargraph to illustrate the meaning
of each color:.
install.packages('data.table')
install.packages("data.table",
repos="https://Rdatatable.gitlab.io/data.table")
Importing Datasets
In R programming language, we have tons of built-in datasets that one may
use as demo data to demonstrate how the R functions work.
One such popular inbuilt dataset is “Iris” dataset. This dataset provides us
the measurement of four different attributes of 50 flowers (three different
species).
The way we deal with datasets in data.table is quite different from dealing
datasets in data.frame. Let’s go deep into this and get some insights.
Example
Consider the below program that imports iris data stored as a CSV file on the
internet −
# Importing library
library(data.table)
# Creating a dataset
myDataset <-
fread("https://raw.githubusercontent.com/gexijin/learnR/master/datasets/iris.csv")
# print the iris dataset
print(myDataset)
Output
As you see from the above output, the imported data is directly stored as a
data.table.
Example
# Importing library
library(data.table)
# Creating a dataset
myDataset <- fread(
"https://raw.githubusercontent.com/gexijin/learnR/master/datasets/iris.cs
v")
# print the iris dataset
print(myDataset)
Output
There are 150 rows and 5 columns in the Iris data set.
head(myDataset)
Output
The data.table package comes with advanced features that make it capable of
knowing its column names. Using data.table package we can easily filter out
rows by passing column conditions inside the square bracket.
myDataset[column_condition]
Example
# Importing library
library(data.table)
# Creating a dataset
myDataset <- fread(
"https://raw.githubusercontent.com/gexijin/learnR/master/datasets/iris.csv")
# datatable syntax to filter rows
# based on column condition
myDataset[Sepal.Length==5.1 & Petal.Length==1.4,]
Output
As you can see above in the output, two rows have been filtered out that
matches with the column condition provided inside of square brackets.
Selecting Columns
We will now see how we can select columns of a dataset using data.table
package. The basic syntax of selecting columns is given below,
myDataset[, column_number, with = F]
Her column_number must be equal to the column that you want to subset
(Columns are 1-based)
Example
library(data.table)
# Creating a dataset
myDataset <- fread(
"https://raw.githubusercontent.com/gexijin/learnR/master/datasets/iris.csv")
# data.table syntax to subset second column
myDataset[, 2, with = F]
Output
Sepal.Width
1: 3.5
2: 3.0
3: 3.2
4: 3.1
5: 3.6
---
146: 3.0
147: 2.5
148: 3.0
149: 3.4
150: 3.0
As you can see above in the output, the second column of the iris dataset is
selected.
Example
Now let’s select multiple columns. In the below example, we select two
columns, i.e., 'Petal.Length' and 'Species'.
# Importing library
library(data.table)
# Creating a dataset
myDataset <- fread(
"https://raw.githubusercontent.com/gexijin/learnR/master/datasets/iris.csv")
Output
Petal.Length Species
1: 1.4 setosa
2: 1.4 setosa
3: 1.3 setosa
4: 1.5 setosa
5: 1.4 setosa
---
146: 5.2 virginica
147: 5.0 virginica
148: 5.2 virginica
149: 5.4 virginica
150: 5.1 virginica
Example:
R
install.packages("factoextra")
library(factoextra)
# Loading dataset
df <- mtcars
df <- na.omit(df)
# Scaling dataset
df <- scale(df)
png(file = "KMeansExample.png")
dev.off()
dev.off()
Output:
When k = 4
K-Means Clustering in R Programming
When k = 5
K-Means clustering in R
Using the fviz_cluster function from the factoextra package, this code applies
k-means clustering to the mtcars dataset using two different values of
centers (4 and 5) and then saves the cluster visualizations as PNG files.
The clustering technique uses the information from the mtcars dataset, which
includes details on various automobile models including the number of
cylinders, horsepower, and miles per gallon. The scale function is used to
scale the data to have a zero mean and unit variance, and the na.omit
function is used to delete any rows with missing values.
Then, to improve the chance of discovering the global optimum, the means
function is used to execute k-means clustering with 4 and 5 clusters. The
fviz_cluster function, which plots the data points coloured by cluster
membership and also displays the cluster centers, is then used to see the
resulting cluster assignments.
The png and dev. off functions are then used to save the generated plots as
PNG files.
To save the files to the correct location on your computer, you might need to
modify the file paths or file names used in the png function.
Clustering by Similarity Aggregation
Clustering by similarity aggregation is also known as relational clustering or
Condorcet method which compares each data point with all other data points
in pairs. For a pair of values A and B, these values are assigned to both the
vectors m(A, B) and d(A, B). A and B are the same in m(A, B) but different in
d(A, B).
Whether you're preparing for your first job interview or aiming to upskill in
this ever-evolving tech landscape, GeeksforGeeks Courses are your key to
success. We provide top-quality content at affordable prices, all geared
towards accelerating your growth in a time-bound manner. Join the millions
we've already empowered, and we're here to do the same for you. Don't miss
out - check it out now!