Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
10 views

R Programming

Uploaded by

Meghana Barad
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

R Programming

Uploaded by

Meghana Barad
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 48

Introduction to R Programming (15Lectures) Concept of R, Installation of R, Data Types , Vector, List,

Frame, Array, Matrix, Statistics Commands, Base graphics, Data manipulation with data
table ,concept of cluster, Concept of Prediction Model ,Analysis of Real world Problem.

R is a language and environment for statistical computing and graphics. It is


a GNU project which is similar to the S language and environment which was
developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by
John Chambers and colleagues. R can be considered as a different
implementation of S. There are some important differences, but much code
written for S runs unaltered under R.

R provides a wide variety of statistical (linear and nonlinear modelling,


classical statistical tests, time-series analysis, classification, clustering, …)
and graphical techniques, and is highly extensible. The S language is often
the vehicle of choice for research in statistical methodology, and R provides
an Open Source route to participation in that activity.

One of R’s strengths is the ease with which well-designed publication-quality


plots can be produced, including mathematical symbols and formulae where
needed. Great care has been taken over the defaults for the minor design
choices in graphics, but the user retains full control.

R is available as Free Software under the terms of the Free Software


Foundation’s GNU General Public License in source code form. It compiles
and runs on a wide variety of UNIX platforms and similar systems (including
FreeBSD and Linux), Windows and MacOS.

The R environment :
R is an integrated suite of software facilities for data manipulation,
calculation and graphical display. It includes

 an effective data handling and storage facility,


 a suite of operators for calculations on arrays, in particular matrices,
 a large, coherent, integrated collection of intermediate tools for data
analysis,
 graphical facilities for data analysis and display either on-screen or on
hardcopy, and
 a well-developed, simple and effective programming language which
includes conditionals, loops, user-defined recursive functions and input
and output facilities.
The term “environment” is intended to characterize it as a fully planned and
coherent system, rather than an incremental accretion of very specific and
inflexible tools, as is frequently the case with other data analysis software.
R, like S, is designed around a true computer language, and it allows users to
add additional functionality by defining new functions. Much of the system is
itself written in the R dialect of S, which makes it easy for users to follow the
algorithmic choices made. For computationally-intensive tasks, C, C++ and
Fortran code can be linked and called at run time. Advanced users can write
C code to manipulate R objects directly.

Many users think of R as a statistics system. We prefer to think of it as an


environment within which statistical techniques are implemented. R can be
extended (easily) via packages. There are about eight packages supplied
with the R distribution and many more are available through the CRAN family
of Internet sites covering a very wide range of modern statistics.

R has its own LaTeX-like documentation format, which is used to supply


comprehensive documentation, both on-line in a number of formats and in
hardcopy.

Installation of R

Installing R on Windows OS
To install R on Windows OS:

1. Go to the CRAN website.

2. Click on "Download R for Windows".

3. Click on "install R for the first time" link to download the R


executable (.exe) file.

4. Run the R executable file to start installation, and allow the app to
make changes to your device.
Data Types in R

1. Logical Data Type

The logical data type in R is also known as boolean data type. It can only
have two values: TRUE and FALSE . For example,

bool1 <- TRUE

print(bool1)
print(class(bool1))

bool2 <- FALSE

print(bool2)
print(class(bool2))

Output

[1] TRUE
[1] "logical"
[1] FALSE
[1] "logical"

In the above example,

 bool1 has the value TRUE ,

 bool2 has the value FALSE .

Here, we get "logical" when we check the type of both variables.


Note: You can also define logical variables with a single letter
- T for TRUE or F for FALSE . For example,

is_weekend <- F
print(class(is_weekend)) # "logical"

2. Numeric Data Type

In R, the numeric data type represents all real numbers with or without
decimal values. For example,

# floating point values


weight <- 63.5

print(weight)
print(class(weight))

# real numbers
height <- 182

print(height)
print(class(height))

Output

[1] 63.5
[1] "numeric"
[1] 182
[1] "numeric"

Here, both weight and height are variables of numeric type.

3. Integer Data Type

The integer data type specifies real values without decimal points. We use
the suffix L to specify integer data. For example,

integer_variable <- 186L


print(class(integer_variable))

Output

[1] "integer"

Here, 186L is an integer data. So we get "integer" when we print the class
of integer_variable .

4. Complex Data Type

The complex data type is used to specify purely imaginary values in R. We


use the suffix i to specify the imaginary part. For example,

# 2i represents imaginary part


complex_value <- 3 + 2i

# print class of complex_value


print(class(complex_value))
Output

[1] "complex"

Here, 3 + 2i is of complex data type because it has an imaginary part 2i .

5. Character Data Type

The character data type is used to specify character or string values in a


variable.
In programming, a string is a set of characters. For example, 'A' is a single
character and "Apple" is a string.
You can use single quotes '' or double quotes "" to represent strings. In
general, we use:
 '' for character variables
 "" for string variables
For example,

# create a string variable


fruit <- "Apple"

print(class(fruit))

# create a character variable


my_char <- 'A'

print(class(my_char))

Output

[1] "character"
[1] "character"

Here, both the variables - fruit and my_char - are of character data type.
6. Raw Data Type

A raw data type specifies values as raw bytes. You can use the following
methods to convert character data types to a raw data type and vice-versa:
 charToRaw() - converts character data to raw data
 rawToChar() - converts raw data to character data
For example,

# convert character to raw


raw_variable <- charToRaw("Welcome to Programiz")

print(raw_variable)
print(class(raw_variable))

# convert raw to character


char_variable <- rawToChar(raw_variable)

print(char_variable)
print(class(char_variable))

Output

[1] 57 65 6c 63 6f 6d 65 20 74 6f 20 50 72 6f 67 72 61 6d 69 7a
[1] "raw"
[1] "Welcome to Programiz"
[1] "character"

In this program,

 We have first used the charToRaw() function to convert the string "Welcome to

Programiz" to raw bytes.

This is why we get "raw" as output when we print the class of raw_variable .
 Then, we have used the rawToChar() function to convert the data
in raw_variable back to character form.

This is why we get "character" as output when we print the class


of char_variable .

 R Vectors
 A vector is the basic data structure in R that stores data of similar
types. For example,
 Suppose we need to record the age of 5 employees. Instead of
creating 5 separate variables, we can simply create a vector.

Create a Vector in R
In R, we use the c() function to create a vector. For example,

# create vector of string types


employees <- c("Sabby", "Cathy", "Lucy")

print(employees)

# Output: [1] "Sabby" "Cathy" "Lucy"

In the above example, we have created a vector named employees with


elements: Sabby , Cathy , and Lucy .

Here, the c() function creates a vector by combining three different


elements of employees together.

Access Vector Elements in R


In R, each element in a vector is associated with a number. The number is
known as a vector index.

We can access elements of a vector using the index number (1, 2, 3 …).
For example,

# a vector of string type


languages <- c("Swift", "Java", "R")

# access first element of languages


print(languages[1]) # "Swift"

# access third element of languages


print(languages[3]). # "R"

In the above example, we have created a vector named languages . Each


element of the vector is associated with an integer number.

Vector
Indexing in R
Here, we have used the vector index to access the vector elements

 languages[1] - access the first element "Swift"

 languages[3] - accesses the third element "R"

Note: In R, the vector index always starts with 1. Hence, the first element of
a vector is present at index 1, second element at index 2 and so on.
Modify Vector Element
To change a vector element, we can simply reassign a new value to the
specific index. For example,

dailyActivities <- c("Eat","Repeat")


cat("Initial Vector:", dailyActivities)

# change element at index 2


dailyActivities[2] <- "Sleep"

cat("\nUpdated Vector:", dailyActivities)

Output

Initial Vector: Eat Repeat


Updated Vector: Eat Sleep

Here, we have changed the vector element at


index 2 from "Repeat" to "Sleep" by simply assigning a new value.

Numeric Vector in R
Similar to strings, we use the c() function to create a numeric vector. For
example,

# a vector with number sequence from 1 to 5


numbers <- c(1, 2, 3, 4, 5)

print(numbers)

# Output: [1] 1 2 3 4 5
Here, we have used the C() function to create a vector of numeric
sequence called numbers .

However, there is an efficient way to create a numeric sequence. We can


use the : operator instead of C() .

Create a Sequence of Number in R

In R, we use the : operator to create a vector with numerical values in


sequence. For example,

# a vector with number sequence from 1 to 5


numbers <- 1:5

print(numbers)

Output

[1] 1 2 3 4 5

Here, we have used the : operator to create the vector named numbers with
numerical values in sequence i.e. 1 to 5.

Repeat Vectors in R
In R, we use the rep() function to repeat elements of vectors. For example,

# repeat sequence of vector 2 times


numbers <- rep(c(2,4,6), times = 2)

cat("Using times argument:", numbers)

Output

Using times argument: 2 4 6 2 4 6


In the above example, we have created a numeric vector with elements 2,
4, 6. Notice the code,

rep(numbers, times=2)

Here,

 numbers - vector whose elements to be repeated


 times = 2 - repeat the vector two times
We can see that we have repeated the whole vector two times. However,
we can also repeat each element of the vector. For this we use
the each parameter.
Let's see an example.

# repeat each element of vector 2 times


numbers <- rep(c(2,4,6), each = 2)

cat("\nUsing each argument:", numbers)

Output

Using each argument: 2 2 4 4 6 6

In the above example, we have created a numeric vector with elements 2,


4, 6. Notice the code,

rep(numbers, each = 2)

Here, each = 2 - repeats each element of vector two times

Loop Over a R Vector


We can also access all elements of the vector by using a for loop. For
example,
In R, we can also loop through each element of the vector using the for
loop. For example,

numbers <- c(1, 2, 3, 4, 5)

# iterate through each elements of numbers


for (number in numbers) {
print(number)
}

Output

[1] 1
[1] 2
[1] 3
[1] 4
[1] 5

Length of Vector in R
We can use the length() function to find the number of elements present
inside the vector. For example,

languages <- c("R", "Swift", "Python", "Java")

# find total elements in languages using length()


cat("Total Elements:", length(languages))

Output

Total Elements: 4

Here, we have used length() to find the length of the languages vector.

List in R Programming :
A list in R is a generic object consisting of an ordered collection of
objects. Lists are one-dimensional, heterogeneous data structures.
The list can be a list of vectors, a list of matrices, a list of characters
and a list of functions, and so on. A list is a vector but with
heterogeneous data elements.

Matrices
Matrices are nothing more than a collection of data elements arranged in a
rectangular layout that is two-dimensional. An example matrix with 3x3
dimensions looks like this.
1 [,1] [,2] [,3]
2[1,] 1 2 3
3[2,] 4 5 6
4[3,] 7 8 9
R
The most important thing you need to remember to get started with
matrices is the matrix() function. This function has the following skeleton.
1matrix(
2 c(),
3 nrow=,
4 ncol=,
5 byrow = )
R
The first argument is a vector that defines which atomic values are present
in the matrix. The second argument defines how many rows that vector
splits up, and the third argument tells how many columns. The number of
elements in the vector should be multiple or sub-multiple to nrow * ncol.
The last argument defines whether you want to fill up the matrix by rows or
columns. By default, the argument for byrow is FALSE, which means the
matrix if filled up from column to column.
Let's try this one out. Your matrix definition looks like this.
1matrix(
2c(1,2,3,4,5,6,7,8),
3nrow = 4,
4ncol = 2,
5byrow = TRUE)
R
The output should look like this.
1 [,1] [,2]
2[1,] 1 2
3[2,] 3 4
4[3,] 5 6
5[4,] 7 8
bash
If you omit the byrow=TRUE argument, the following output greets you.
1 [,1] [,2]
2[1,] 1 3
3[2,] 2 4
bash
Where did the rest of the elements go? The problem is that your vector is
bigger than the matrix size. If you want to get all the values from the vector,
the ncol=4 should be the modification you make. That way you have the
following output.
1 [,1] [,2] [,3] [,4]
2[1,] 1 3 5 7
3[2,] 2 4 6 8
bash
Now that you have the foundations, how do you proceed?

Transpose
This concept comes from linear algebra. Basically, what happens is that the
matrix gets flipped over its diagonal. In order to do this in R, you can use
the t() function. Let's see how it looksgiven the matrix below.
1a <- matrix(
2c(1,2,3,4,5,6,7,8),
3nrow = 4,
4ncol = 2,
5byrow = TRUE)
R
Before transposing, the output looks like this.
1 [,1] [,2]
2[1,] 1 2
3[2,] 3 4
4[3,] 5 6
5[4,] 7 8
bash
After transposing, the output looks like this.
1 [,1] [,2] [,3] [,4]
2[1,] 1 3 5 7
3[2,] 2 4 6 8
bash

Combine Matrices
In order to combine two matrices, the cbind() function needs to be used. It
takes two matrices as arguments and produces their combination. When
you are combining matrices, you need to make sure the they have the
same number of rows, otherwise an exception is thrown.
Given are two matrices B and D.
1B <- matrix(
2c(2, 4, 3, 1, 5, 7),
3nrow=3,
4ncol=2)
5
6D <- matrix(
7c(1, 3, 2),
8nrow=3,
9ncol=1)
R
Their combination can be created the following way.
1cbind(B,D)
R
The output looks like this.
1 [,1] [,2] [,3]
2[1,] 2 1 1
3[2,] 4 5 3
4[3,] 3 7 2
bash

Deconstruction
This concept allows you to break down the matrix into its original vector,
which can come handy in certain situations. Take the following matrix
called H.
1H <- matrix(
2c(1,2,3,4,5,6,7,8,9,10),
3nrow=5,
4ncol=2)
R
You are able to deconstruct it with the c() function.
1c(H)
R
The output looks like this.
1 [1] 1 2 3 4 5 6 7 8 9 10
bash

Lists
Lists are objects that may contain elements of different types, similar to
vectors. These different types can be of strings, numbers, vectors, and
even another list inside. You can have matrices as different elements in
your lists. The concept is a general container for special use cases. The
function that allows you to create a list is called list().
An example list would look like this.
1data <- list("Server","Network Device",c(1,2,3,4),
FALSE, list(1,2,3,4,5,6))
R
The content of your list is now as follows.
1[[1]]
2[1] "Server"
3
4[[2]]
5[1] "Network Device"
6
7[[3]]
8[1] 1 2 3 4
9
10[[4]]
11[1] FALSE
12
13[[5]]
14[[5]][[1]]
15[1] 1
16
17[[5]][[2]]
18[1] 2
19
20[[5]][[3]]
21[1] 3
22
23[[5]][[4]]
24[1] 4
25
26[[5]][[5]]
27[1] 5
28
29[[5]][[6]]
30[1] 6
bash
You can see that there is not really any limit as to how many or what type
of elements you can store in the list. There is a special function
called names() that allows you to name your list, which results in a special
dictionary-like datastructure. A dictionary datastructure consists of a key-
value pair. In this case, the key is the list of names and the values are the
actual elements.
Let's give names to the list elements.
1names(data) <- c("Hardware", "Network", "vector",
"boolean","nestedlist")
R
After the function is executed, we can refer to the elements in the list by
their names.
1data$Hardware
2[1] "Server"
3
4data$Network
5[1] "Network Device"
6
7data$nestedlist
8[[1]]
9[1] 1
10
11[[2]]
12[1] 2
13
14[[3]]
15[1] 3
16
17[[4]]
18[1] 4
19
20[[5]]
21[1] 5
22
23[[6]]
24[1] 6
R
This allows you to build more sophisticated functions and create
abstractions that allow users to understand and maintain the code more
efficiently. As with lists in other programming languages, you can access,
manipulate, and merge the lists. The indexing starts from 1.
In order to access the elements, refer to them with their indexes.
Let's retrieve the first and second elements.
1> data[1]
2$Hardware
3[1] "Server"
4
5> data[2]
6$Network
7[1] "Network Device"
R
In order to remove a specific element, assign the NULL value to its index.
This will reduce the length of your list.
Let's remove the nested list. You can do this in two ways. The second one
will only work if you have named your list elements.
1data[5] <- NULL
2
3data$nestedlist <- NULL
R
Suppose you have two lists from different datasources, and you have a
function that needs data from both of them. You have the option to merge
these two lists.
1monthids <- list(1,2,3,4,5,6,7,8,9,10,11,12)
2months <-
list("Jan","Feb","Mar","Apr","May","June","July","Aug",
"Sep","Oct","Nov","Dec")
R
The way to achieve this is to use the c() function.
1merged.list <- c(monthids,months)
R
This will produce the following results.
1[[1]]
2[1] 1
3
4[[2]]
5[1] 2
6
7[[3]]
8[1] 3
9
10[[4]]
11[1] 4
12
13[[5]]
14[1] 5
15
16[[6]]
17[1] 6
18
19[[7]]
20[1] 7
21
22[[8]]
23[1] 8
24
25[[9]]
26[1] 9
27
28[[10]]
29[1] 10
30
31[[11]]
32[1] 11
33
34[[12]]
35[1] 12
36
37[[13]]
38[1] "Jan"
39
40[[14]]
41[1] "Feb"
42
43[[15]]
44[1] "Mar"
45
46[[16]]
47[1] "Apr"
48
49[[17]]
50[1] "May"
51
52[[18]]
53[1] "June"
54
55[[19]]
56[1] "July"
57
58[[20]]
59[1] "Aug"
60
61[[21]]
62[1] "Sep"
63
64[[22]]
65[1] "Oct"
66
67[[23]]
68[1] "Nov"
69
70[[24]]
71[1] "Dec"
bash
The unlist() function allows you to convert your lists to vectors.
1myvector <- unlist(merged.list)
R
After this, all the usual arithmetic operators can be applied to the newly
created vector.

Arrays
An array is a vector with one or more dimensions. A one-dimensional array
can be considered a vector, and an array with two dimensions can be
considered a matrix. Behind the scenes, data is stored in a form of an n-
dimensional matrix. The array() function can be used to create your own
array. The only restriction is that arrays can only store data types.
You can create a simple array the following way.
1v1 <- c(1,2,3)
2v2 <- c(4,5,6,7,8,9)
3result <- array(c(v1,v2),dim = c(3,3,2))
R
Now the result holds an array which has two matrices with three rows and
three columns.
1, , 1
2
3 [,1] [,2] [,3]
4[1,] 1 4 7
5[2,] 2 5 8
6[3,] 3 6 9
7
8, , 2
9
10 [,1] [,2] [,3]
11[1,] 1 4 7
12[2,] 2 5 8
13[3,] 3 6 9
bash
The keyword here is dim. It defines the maximum number of indices in each
dimension.
There is a more general syntax that is a skeleton to keep in mind and
comes in handy most of the time.
1my_array <- array(data, dim = (rows, colums, matrices,
dimnames)
R
You have the option to name your rows, columns and matrices in an array.
Suppose you extend your above code with the following.
1v1 <- c(1,2,3)
2v2 <- c(4,5,6,7,8,9)
3col.names <- c("Item","Serial","Size")
4row.names <- c("Server","Network","Firewall")
5matrix.names <- c("DataCenter EU","DataCenter US")
6result <- array(c(v1,v2),dim = c(3,3,2),dimnames =
list(row.names,col.names,matrix.names))
R
Now the result array holds a more meaningful name that makes the code
cleaner and easier to maintain.
1, , DataCenter EU
2
3 Item Serial Size
4Server 1 4 7
5Network 2 5 8
6Firewall 3 6 9
7
8, , DataCenter US
9
10 Item Serial Size
11Server 1 4 7
12Network 2 5 8
13Firewall 3 6 9
bash
Accessing the elements is a bit more tricky, but once you get the hang of it,
it should become easy. The skeleton code you should keep in mind is the
following.
1result[row,column,matrix]
R
There is a neat trick with this. If you omit any of the arguments, they will be
collected for all matrices, rows, or columns.
For example, if you were to collect the serials from each datacenter, all you
would have to do is write the following.
1result[1,2,]
R
The output should be the following.
1DataCenterEU DataCenterUS
2 4 4
R
If you were to collect the size of each device from every datacenter the
following code would do the job.
1result[,3,]
bash
The output should be the following.
1 DataCenterEU DataCenterUS
2Server 7 7
3Network 8 8
4Firewall 9 9
bash
Arrays allow you to create matrices from them with the following code. Let's
separate each datacenter to their own matrix.
1DCEU <- result[,,1]
2DCUS <- result[,,2]
R
The corresponding outputs will be as you expect.
1#DCEU
2 Item Serial Size
3Server 1 4 7
4Network 2 5 8
5Firewall 3 6 9
6#DCUS
7 Item Serial Size
8Server 1 4 7
9Network 2 5 8
10Firewall 3 6 9

Statistical Commands in R :

R is a statistical programming tool that's uniquely equipped to handle data, and lots of it.
Wrangling mass amounts of information and producing publication-ready graphics and
visualizations is easy with R. So are all sorts of data analysis, mining, and modeling tasks.

R Statistics
 Mean, median and mode.
 Minimum and maximum value.
 Percentiles.
 Variance and Standard Deviation.
 Covariance and Correlation.
 Probability distributions.

Mean in R Programming

Mean, often referred to as the average, is a fundamental statistical measure that represents the
central value of a dataset. It is calculated by summing up all the values in the dataset and
dividing the sum by the number of data points. The mean is commonly used to provide an
overall understanding of the dataset's typical value, making it a widely used measure in data
analysis and research.

Syntax

In R, calculating the mean is straightforward, thanks to the built-in mean() function. The
syntax for computing the mean in R is as follows:

mean(x, trim = 0, na.rm = FALSE)

Here,

 x: This parameter represents the input vector or numeric data frame from which the mean is
to be calculated.
 trim: An optional parameter that allows you to exclude a certain percentage of extreme
values from the dataset before calculating the mean. The default value is 0, meaning no
trimming is applied.
 na.rm: Another optional parameter that determines whether to exclude missing values (NA)
from the calculation. The default value is FALSE, which includes NA values in the
computation. Set this parameter to TRUE to ignore NA values.

Calculate Mean in R

To calculate the mean of a dataset in R, simply use the mean() function with the input vector
or numeric data frame as the argument. Let's consider an example:

# Sample data vector


data_vec <- c(23, 12, 45, 67, 34, 56, 78)

# Calculate the mean


mean_result <- mean(data_vec)

# Print the result


print(mean_result)

Output:

[1] 45

Median in R Programming
The median is a statistical measure used to represent the central value of a dataset. Unlike the
mean, which is the average of all values, the median is the middle value when the dataset is
arranged in ascending order. It is a robust measure of central tendency, meaning it is not
affected by extreme values or outliers in the data. This makes the median particularly useful
when dealing with datasets that have skewed distributions or contain extreme values.

Syntax

Calculating the median in R is straightforward with the built-in median() function. The syntax
for computing the median is as follows:

median(x, na.rm = FALSE)

Here,

 x: This parameter represents the input vector or numeric data frame for which you
want to calculate the median. It can be a numeric vector, a numeric matrix, or a
numeric data frame.
 na.rm: An optional parameter that determines whether to exclude missing values
(NA) from the calculation. The default value is FALSE, which includes NA values in
the computation. Set this parameter to TRUE to ignore NA values.

Parameter
The median() function in R takes two parameters:

 x: This is the main parameter that represents the input vector or numeric data frame
for which you want to calculate the median. The vector can be of any length and
should contain numeric values.
 na.rm: The na.rm parameter specifies whether to exclude missing values (NA) from
the calculation. By default, it is set to FALSE, meaning NA values are included in the
computation. Setting it to TRUE ensures that NA values are ignored during the
calculation.

Calculate Median in R

To calculate the median of a dataset in R, use the median() function with the input vector or
numeric data

# Sample data vector


data_vec <- c(23, 12, 45, 67, 34, 56, 78)

# Calculate the median


median_result <- median(data_vec)

# Print the result


print(median_result)

Output:

[1] 45

In this example, we have a numeric vector data_vec containing some values. We use
the median() function to calculate the median of this vector and store the result in
the median_result variable. Finally, we print the calculated median.

The median() function works seamlessly even if the dataset has an odd or even number of
elements. If the number of elements is odd, the median is the middle value. However, if the
number of elements is even, the median is the average of the two middle values.

# Sample data vector with an even number of elements


data_vec_even <- c(23, 12, 45, 67, 34, 56)

# Calculate the median for the even-sized dataset


median_result_even <- median(data_vec_even)

# Print the result


print(median_result_even)

Output:

[1] 39.5

Mode in R Programming
The mode is a statistical measure used to identify the value that occurs most frequently in a
dataset. Unlike the mean and median, which represent the central tendencies, the mode
highlights the most commonly recurring value, making it a useful measure for identifying
patterns and understanding the distribution of categorical or discrete data.

Syntax

R does not have a built-in function to directly calculate the mode. However, we can create a
user-defined function or use the mlv() function from the modeest package to find the mode.
The syntax for creating the user-defined function or using the mlv() function is as follows:

User-defined Function:

calculate_mode <- function(data) {


# Function implementation to find the mode
}

Using modeest Package:

library(modeest)
mlv(x)

Here,

 data: This parameter represents the input vector for which you want to calculate the
mode. It can be a vector of any length and should contain categorical or discrete
numeric values.
 x: This parameter is the input vector for which you want to find the mode using
the mlv() function from the modeest package.

Parameter

The mode calculation in R can be performed using a user-defined function or the mlv()
function from the modeest package. Both methods take a single parameter:

data: This is the main parameter that represents the input vector for which you want to
calculate the mode. The vector can contain either categorical values or discrete numeric
values.

Calculate Mode Using User-defined Function

Since R does not have a built-in function for finding the mode, we can create a user-defined
function to calculate it. The idea behind the user-defined function is to determine the unique
values in the dataset, count their occurrences, and identify the one with the highest frequency.

Let's create a user-defined function named calculate_mode to find the mode:

calculate_mode <- function(data) {


uniq_vals <- unique(data)
uniq_counts <- table(data)
mode_value <- uniq_vals[which.max(uniq_counts)]
return(mode_value)
}

In this function, we use the unique() function to obtain the unique values in the dataset. Next,
we use the table() function to count the occurrences of each unique value. Finally, we find the
value with the highest frequency using which.max() and return it as the mode.

Let's calculate the mode using the calculate_mode() function:

# Sample data vector


data_vec <- c(23, 12, 45, 67, 34, 56, 78, 23, 56, 34, 23)

# Calculate the mode using the user-defined function


mode_result <- calculate_mode(data_vec)

# Print the mode


print(mode_result)

Output:

[1] 23

In this example, we have a numeric vector data_vec with multiple values, and we use our
user-defined function calculate_mode() to find the mode. The result will be the value that
appears most frequently in the dataset.

Graphics in R (Gallery with


Examples)
This page shows an overview of (almost all) different types of graphics, plots, charts,
diagrams, and figures of the R programming language.

Here is a list of all graph types that are illustrated in this article:

 Barplot
 Boxplot
 Density Plot
 Heatmap
 Histogram
 Line Plot
 Pairs Plot
 Polygon Plot
 QQplot
 Scatterplot
 Venn Diagram

Each type of graphic is illustrated with some basic example code. These codes are
based on the following data:
Barplot
Barplot Definition: A barplot (or barchart; bargraph) illustrates the association between
a numeric and a categorical variable. The barplot represents each category as a bar and
reflects the corresponding numeric value with the bar’s size.

The following R syntax shows how to draw a basic barplot in R:

barplot(x) # Draw barplot in R

Our example barplot looks a follows:

Stacked Barplot with Legend


When we have data with several subgroups (e.g. male and female), it is often useful to
plot a stacked barplot in R. For this task, we need to create some new example data:

data <- as.matrix(data.frame(A = c(0.2, 0.4), # Create matrix for stacked


barchart
B = c(0.3, 0.1),
C = c(0.7, 0.1),
D = c(0.1, 0.2),
E = c(0.3, 0.3)))
rownames(data) <- c("Group 1", "Group 2")
data # Print matrix to console
# A B C D E
# Group 1 0.2 0.3 0.7 0.1 0.3
# Group 2 0.4 0.1 0.1 0.2 0.3

Based on the previous output of the RStudio console, you can see how our example
data should look like: It’s a matrix consisting of a column for each bar and a row for each
group.

Now, we can draw a stacked barchart by specifying our previously created matrix as
input data for the barplot function:

barplot(data, # Create stacked barchart


col = c("#1b98e0", "#353436"))

Furthermore, we should add a legend to our stacked bargraph to illustrate the meaning
of each color:.

legend("topright", # Add legend to barplot


legend = c("Group 1", "Group 2"),
fill = c("#1b98e0", "#353436"))
Data Manipulation in R with data.table

Data manipulation is a crucial step in the data analysis process, as it allows us


to prepare and organize our data in a way that is suitable for the specific
analysis or visualization. There are many different tools and techniques for
data manipulation, depending on the type and structure of the data, as well
as the specific goals of the manipulation.

The data.table package is an R package that provides an enhanced version of


the data.frame class in R. It’s syntax and features make it easier and faster to
manipulate and work with large datasets.

The date.table is one of the most downloaded packages by developers and an


ideal choice for Data Scientists.

Installating data.table package


Installing data.table package is as simple as installing other packages. You
can use the below commands in CRAN’s command line tool to install this
package −

Installing ‘data.table’ package using CRAN

install.packages('data.table')

Installing dev version from Gitlab

install.packages("data.table",
repos="https://Rdatatable.gitlab.io/data.table")

Importing Datasets
In R programming language, we have tons of built-in datasets that one may
use as demo data to demonstrate how the R functions work.

One such popular inbuilt dataset is “Iris” dataset. This dataset provides us
the measurement of four different attributes of 50 flowers (three different
species).
The way we deal with datasets in data.table is quite different from dealing
datasets in data.frame. Let’s go deep into this and get some insights.

The data.table provides us fread() function (fast read) which is basically


data.table’s version of read.csv() function. Similar to read.csv() function it
can read a file stored locally as well as capable enough to read files hosted
on a website.

Example

Consider the below program that imports iris data stored as a CSV file on the
internet −

# Importing library
library(data.table)
# Creating a dataset
myDataset <-
fread("https://raw.githubusercontent.com/gexijin/learnR/master/datasets/iris.csv")
# print the iris dataset
print(myDataset)

Output

[1] "data.table" "data.frame"

As you see from the above output, the imported data is directly stored as a
data.table.

The data.table generally inherits from a data.frame class and therefore is a


data.frame by itself. Therefore, those functions that accept a data.frame will
get the job done for data.table as well.

Displaying IRIS Dataset

Example

# Importing library
library(data.table)
# Creating a dataset
myDataset <- fread(
"https://raw.githubusercontent.com/gexijin/learnR/master/datasets/iris.cs
v")
# print the iris dataset
print(myDataset)

Output

Sepal.Length Sepal.Width Petal.Length Petal.Width Species


1: 5.1 3.5 1.4 0.2 setosa
2: 4.9 3.0 1.4 0.2 setosa
3: 4.7 3.2 1.3 0.2 setosa
4: 4.6 3.1 1.5 0.2 setosa
5: 5.0 3.6 1.4 0.2 setosa
---
146: 6.7 3.0 5.2 2.3 virginica
147: 6.3 2.5 5.0 1.9 virginica
148: 6.5 3.0 5.2 2.0 virginica
149: 6.2 3.4 5.4 2.3 virginica
150: 5.9 3.0 5.1 1.8 virginica

There are 150 rows and 5 columns in the Iris data set.

Let’s print first six rows from the iris dataset

head(myDataset)

Output

Sepal.Length Sepal.Width Petal.Length Petal.Width Species


1: 5.1 3.5 1.4 0.2 setosa
2: 4.9 3.0 1.4 0.2 setosa
3: 4.7 3.2 1.3 0.2 setosa
4: 4.6 3.1 1.5 0.2 setosa
5: 5.0 3.6 1.4 0.2 setosa
6: 5.4 3.9 1.7 0.4 setosa

Filtering Rows Based on a Condition


The main problem with data.frame package was that this package is not well
aware of its column names. Therefore, it becomes difficult sometimes when
we need to select or filter some rows on the basis of column conditions.

The data.table package comes with advanced features that make it capable of
knowing its column names. Using data.table package we can easily filter out
rows by passing column conditions inside the square bracket.

myDataset[column_condition]

Here column_condition specifies the column conditions on the basis of which


certain rows will be selected.

Let us consider an example to filter the dataset with the condition


"Sepal.Length==5.1 & Petal.Length==1.4".

Example

# Importing library
library(data.table)
# Creating a dataset
myDataset <- fread(
"https://raw.githubusercontent.com/gexijin/learnR/master/datasets/iris.csv")
# datatable syntax to filter rows
# based on column condition
myDataset[Sepal.Length==5.1 & Petal.Length==1.4,]

Output

Sepal.Width Petal.Length Petal.Width Species


1: 5.1 3.5 1.4 0.2 setosa
2: 5.1 3.5 1.4 0.3 setosa

As you can see above in the output, two rows have been filtered out that
matches with the column condition provided inside of square brackets.

Selecting Columns
We will now see how we can select columns of a dataset using data.table
package. The basic syntax of selecting columns is given below,
myDataset[, column_number, with = F]

Her column_number must be equal to the column that you want to subset
(Columns are 1-based)

Example

Let’s consider an example in which we want to select second column of the


iris dataset −

library(data.table)

# Creating a dataset
myDataset <- fread(
"https://raw.githubusercontent.com/gexijin/learnR/master/datasets/iris.csv")
# data.table syntax to subset second column
myDataset[, 2, with = F]

Output

Sepal.Width
1: 3.5
2: 3.0
3: 3.2
4: 3.1
5: 3.6
---
146: 3.0
147: 2.5
148: 3.0
149: 3.4
150: 3.0

As you can see above in the output, the second column of the iris dataset is
selected.

Example

Now let’s select multiple columns. In the below example, we select two
columns, i.e., 'Petal.Length' and 'Species'.
# Importing library
library(data.table)

# Creating a dataset
myDataset <- fread(
"https://raw.githubusercontent.com/gexijin/learnR/master/datasets/iris.csv")

columns <- c('Petal.Length', 'Species')

# selecting two columns- 'Petal.Length' and 'Species'


myDataset[, columns, with = F]

Output

Petal.Length Species
1: 1.4 setosa
2: 1.4 setosa
3: 1.3 setosa
4: 1.5 setosa
5: 1.4 setosa
---
146: 5.2 virginica
147: 5.0 virginica
148: 5.2 virginica
149: 5.4 virginica
150: 5.1 virginica

Here we selected two columns, 'Petal.Length' and 'Species'.


Concept of cluster:
Clustering in R Programming Language is an unsupervised learning
technique in which the data set is partitioned into several groups called
clusters based on their similarity. Several clusters of data are produced after
the segmentation of data. All the objects in a cluster share common
characteristics. During data mining and analysis, clustering is used to find
similar datasets.
Types of Clustering in R Programming
In R, there are numerous clustering algorithms to choose from. Here are a
few of the most popular clustering techniques in R:
1. K-means clustering: it is a data-partitioning technique that seeks to
assign each observation to the cluster with the closest mean after dividing
the data into k clusters.
2. Hierarchical clustering : By repeatedly splitting or merging clusters
according to their similarity, hierarchical clustering is a technique for
creating a hierarchy of clusters.

3. DBSCAN clustering: it is a density-based technique that divides regions


with lower densities and clusters together data points that are close to
one another.

4. Spectral clustering: Spectral clustering is a technique that turns the


clustering problem into a graph partitioning problem by using the
eigenvectors of the similarity matrix.

5. Fuzzy clustering: Instead of allocating each data point to a single


cluster, fuzzy clustering allows points to belong to numerous clusters with
varying degrees of membership.

6. Density-based clustering: A class of techniques known as density-


based clustering groups together data points based on density rather than
distance.

7. Ensemble clustering: Ensemble clustering is a technique for enhancing


clustering performance by combining several clustering methods or
iterations of the same algorithm.
Each kind of clustering technique has its own advantages and
disadvantages and is appropriate for various kinds of data and clustering
issues. The qualities of the data and the objectives of the research will
determine which clustering technique is best.
Applications of Clustering in R Programming
Language
 Marketing: In R programming, clustering is helpful for the marketing field.
It helps in finding the market pattern and thus, helps in finding the likely
buyers. Getting the interests of customers using clustering and showing
the same product of their interest can increase the chance of buying the
product.
 Medical Science: In the medical field, there is a new invention of
medicines and treatments on a daily basis. Sometimes, new species are
also found by researchers and scientists. Their category can be easily
found by using the clustering algorithm based on their similarities.
 Games: A clustering algorithm can also be used to show the games to
the user based on his interests.
 Internet: An user browses a lot of websites based on his interest.
Browsing history can be aggregated to perform clustering on it and based
on clustering results, the profile of the user is generated.
Methods of Clustering
There are 2 types of clustering in R programming:
 Hard clustering: In this type of clustering, the data point either belongs to
the cluster totally or not and the data point is assigned to one cluster only.
The algorithm used for hard clustering is k-means clustering.
 Soft clustering: In soft clustering, the probability or likelihood of a data
point is assigned in the clusters rather than putting each data point in a
cluster. Each data point exists in all the clusters with some probability.
The algorithm used for soft clustering is the fuzzy clustering method or
soft k-means.
K-Means Clustering in R Programming language
K-Means is an iterative hard clustering technique that uses an unsupervised
learning algorithm. In this, total numbers of clusters are pre-defined by the
user and based on the similarity of each data point, the data points are
clustered. This algorithm also finds out the centroid of the
Algorithm:
 Specify number of clusters (K): Let us take an example of k =2 and 5
data points.
 Randomly assign each data point to a cluster: In the below example,
the red and green color shows 2 clusters with their respective random
data points assigned to them.
 Calculate cluster centroids: The cross mark represents the centroid of
the corresponding cluster.
 Re-allocate each data point to their nearest cluster centroid: Green
data point is assigned to the red cluster as it is near to the centroid of red
cluster.
 Re-figure cluster centroid
Syntax: kmeans(x, centers, nstart)
where,
 x represents numeric matrix or data frame object
 centers represents the K value or distinct cluster centers
 nstart represents number of random sets to be chosen

Example:
R

# Library required for fviz_cluster function

install.packages("factoextra")

library(factoextra)
# Loading dataset

df <- mtcars

# Omitting any NA values

df <- na.omit(df)

# Scaling dataset

df <- scale(df)

# output to be present as PNG file

png(file = "KMeansExample.png")

km <- kmeans(df, centers = 4, nstart = 25)

# Visualize the clusters

fviz_cluster(km, data = df)

# saving the file

dev.off()

# output to be present as PNG file


png(file = "KMeansExample2.png")

km <- kmeans(df, centers = 5, nstart = 25)

# Visualize the clusters

fviz_cluster(km, data = df)

# saving the file

dev.off()

Output:
When k = 4
K-Means Clustering in R Programming

When k = 5
K-Means clustering in R

Using the fviz_cluster function from the factoextra package, this code applies
k-means clustering to the mtcars dataset using two different values of
centers (4 and 5) and then saves the cluster visualizations as PNG files.
The clustering technique uses the information from the mtcars dataset, which
includes details on various automobile models including the number of
cylinders, horsepower, and miles per gallon. The scale function is used to
scale the data to have a zero mean and unit variance, and the na.omit
function is used to delete any rows with missing values.
Then, to improve the chance of discovering the global optimum, the means
function is used to execute k-means clustering with 4 and 5 clusters. The
fviz_cluster function, which plots the data points coloured by cluster
membership and also displays the cluster centers, is then used to see the
resulting cluster assignments.
The png and dev. off functions are then used to save the generated plots as
PNG files.
To save the files to the correct location on your computer, you might need to
modify the file paths or file names used in the png function.
Clustering by Similarity Aggregation
Clustering by similarity aggregation is also known as relational clustering or
Condorcet method which compares each data point with all other data points
in pairs. For a pair of values A and B, these values are assigned to both the
vectors m(A, B) and d(A, B). A and B are the same in m(A, B) but different in
d(A, B).

where, S is the cluster


With the first condition, the cluster is constructed and with the next condition,
the global Condorcet criterion is calculated. It follows in an iterative manner
until specified iterations are not completed or the global Condorcet criterion
produces no improvement.

Whether you're preparing for your first job interview or aiming to upskill in
this ever-evolving tech landscape, GeeksforGeeks Courses are your key to
success. We provide top-quality content at affordable prices, all geared
towards accelerating your growth in a time-bound manner. Join the millions
we've already empowered, and we're here to do the same for you. Don't miss
out - check it out now!

Concept of Prediction Model

Analysis of Real world Problem

You might also like