Lab Manual DAR
Lab Manual DAR
Lab Manual DAR
Landran, Mohali-140307
LAB MANUAL
Name:
Roll No.
Index
Sr. No. Experiment Name Remarks
R -BASICS
1 Downloading, installing and setting path for R
2 Give an idea of R Data Types.
3 R as a calculator: Perform some arithmetic operations in R
Installation of R
R programming is a very popular language and to work on that we have to install two things, i.e.,
R and RStudio. R and RStudio works together to create a project on R.
Installing R to the local computer is very easy. First, we must know which operating system we are
using so that we can download it accordingly.
The official site https://cloud.r-project.org provides binary files for major operating systems
including Windows, Linux, and Mac OS. In some Linux distributions, R is installed by default,
which we can verify from the console by entering R.
To install R, either we can get it from the site https://cloud.r-project.org or can use commands from
the terminal.
Install R in Windows
Step 1:
When we click on Download R 3.6.1 for windows, our downloading will be started of R setup.
Once the downloading is finished, we have to run the setup of R in the following way:
1) Select the path where we want to download the R and proceed to Next.
2) Select all components which we want to install, and then we will proceed to Next.
3) In the next step, we have to select either customized startup or accept the default, and then
we proceed to Next.
4) When we proceed to next, our installation of R in our system will get started:
Step 1:
In the first step, we have to update all the required files in our system using sudo apt-get update
command as:
Step 2:
In the second step, we will install R file in our system with the help of sudo apt-get install r-base
as:
Step 3:
x <- 123L
Here, 123L is an integer data. So the data type of the variable x is integer.
x <- 123L
# print value of x
print(x)
# print type of x
print(class(x))
Output
[1] 123
[1] "integer"
logical
numeric
integer
complex
character
raw
The logical data type in R is also known as boolean data type. It can only have two values: TRUE
and FALSE. For example,
bool1 <- TRUE
print(bool1)
print(class(bool1))
Output:
[1] TRUE
[1] "logical"
[1] FALSE
[1] "logical"
Note: You can also define logical variables with a single letter - T for TRUE or F for FALSE. For
example,
is_weekend <- F
print(class(is_weekend)) # "logical"
In R, the numeric data type represents all real numbers with or without decimal values. For
example,
print(weight)
print(class(weight))
# real numbers
height <- 182
print(height)
print(class(height))
Output
[1] 63.5
[1] "numeric"
[1] 182
[1] "numeric"
The integer data type specifies real values without decimal points. We use the suffix L to specify
integer data. For example,
Output
[1] "integer"
Here, 186L is an integer data. So we get "integer" when we print the class of integer_variable.
The complex data type is used to specify purely imaginary values in R. We use the suffix i to
specify the imaginary part. For example,
Output
[1] "complex"
The character data type is used to specify character or string values in a variable.
In programming, a string is a set of characters. For example, 'A' is a single character and "Apple"
is a string.
You can use single quotes '' or double quotes "" to represent strings. In general, we use:
print(class(fruit))
print(class(my_char))
Output
[1] "character"
[1] "character"
Here, both the variables - fruit and my_char - are of character data type.
A raw data type specifies values as raw bytes. You can use the following methods to convert
character data types to a raw data type and vice-versa:
print(raw_variable)
print(class(raw_variable))
print(char_variable)
print(class(char_variable))
[1] 57 65 6c 63 6f 6d 65 20 74 6f 20 50 72 6f 67 72 61 6d 69 7a
[1] "raw"
[1] "Welcome to Programiz"
[1] "character"
Task 3
R as a calculator: Perform some arithmetic operations in R
R can be used as a powerful calculator by entering equations directly at the prompt in the command
console. Simply type your arithmetic expression and press ENTER. R will evaluate the expressions
and respond with the result. While this is a simple interaction interface, there could be problems if
you are not careful. R will normally execute your arithmetic expression by evaluating each item
from left to right, but some operators have precedence in the order of evaluation. Let's start with
some simple expressions as examples.
+ Addition
- Subtraction
* Multiplication
/ Division
^ Exponentiation
We can create user-defined functions in R. They are specific to what a user wants and once created
they can be used like the built-in functions. Below is an example of how a function is created and
used.
Calling a Function
# Create a function to print squares of numbers in sequence.
new.function <- function(a) {
for(i in 1:a) {
b <- i^2
print(b)
}
}
Output
[1] 1
[1] 4
[1] 9
[1] 16
[1] 25
[1] 36
[1] 1
[1] 4
[1] 9
[1] 16
[1] 25
The arguments to a function call can be supplied in the same sequence as defined in the function or
they can be supplied in a different sequence but assigned to the names of the arguments.
Output
[1] 26
[1] 58
We can define the value of the arguments in the function definition and call the function without
supplying any argument to get the default result. But we can also call such functions by supplying
new values of the argument and get non default result.
Output
[1] 18
[1] 45
Task 5
Logical operators are used to carry out Boolean operations like AND, OR etc.
Operator Description
! Logical NOT
& Element-wise logical AND
&& Logical AND
| Element-wise logical OR
|| Logical OR
Operators & and | perform element-wise operation producing result having length of the longer
operand.
But && and || examines only the first element of the operands resulting in a single length logical
vector.
Zero is considered FALSE and non-zero numbers are taken as TRUE. Let's see an example for this:
Output
[1] FALSE TRUE TRUE FALSE
[1] FALSE FALSE FALSE TRUE
[1] FALSE
[1] TRUE TRUE FALSE TRUE
[1] TRUE
Task 6
# Remove a column
data <- data[, -3]
# Reshape the data frame (convert wide to long format using tidyr)
library(tidyr)
data_long <- gather(data, key = "Variable", value = "Value", -Name)
Here, sequence is a vector and val takes on each of its value during the loop. In each iteration,
statement is evaluated.
x <- c(2,5,3,9,8,11,6)
count <- 0
for (val in x)
{ if(val %% 2 == 0)count = count+1
}
print(count)
Output
[1] 3
Here, test_expression is evaluated and the body of the loop is entered if the result is TRUE.
The statements inside the loop are executed and the flow returns to evaluate the test_expression
again.
This is repeated each time until test_expression evaluates to FALSE, in which case, the loop
exits.
i <- 1
while (i < 6) {
print(i)
i = i+1
}
Output
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
In the statement block, we must use the break statement to exit the loop.
x <- 1
repeat {
print(x)
x = x+1
if (x == 6){
break
}
}
Output
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
Task 8
R if statement
The syntax of if statement is:
if (test_expression) {
statement
}
If test_expression is TRUE, the statement gets executed. But if it's FALSE, nothing happens.
Here, test_expression can be a logical or numeric vector, but only the first element is taken into
consideration.
Example 1: if statement
x <- 5
if(x > 0)
{ print("Positive
number")
}
Output
if else statement
if (test_expression) {
statement1
} else {
statement2
}
Output
x <- -5
y <- if(x > 0) 5 else 6
y
Output
[1] 6
Task 9
Syntax: as.factor(object)
Parameters:
Object: Vector to be converte
Output:
[1] female male male female
Levels: female male
Task 10
# Creating vectors
x1 <- c("abc", "cde", "def")
x2 <- c(1, 2, 3)
x3 <- c("M", "F")
Output:
Var1 Var2
1 abc M
2 cde M
3 def M
4 abc F
5 cde F
6 def F
Task 11
# Scalar (numeric)
scalar <- 42
# Vector
numeric_vector <- c(1, 2, 3, 4, 5)
character_vector <- c("apple", "banana", "orange")
# Matrix
matrix_example <- matrix(1:9, nrow = 3, ncol = 3)
# Data Frame
data_frame_example <- data.frame(
Name = c("John", "Jane", "Bob"),
Age = c(25, 30, 22),
Score = c(85, 92, 78)
)
# List
list_example <- list(
"Numeric Vector" = numeric_vector,
"Character Vector" = character_vector,
"Matrix" = matrix_example,
"Data Frame" = data_frame_example
)
print("\nNumeric Vector:")
print(numeric_vector)
print("\nCharacter Vector:")
print(character_vector)
print("\nMatrix:")
print(matrix_example)
print("\nData Frame:")
print(data_frame_example)
print("\nList:")
print(list_example)
Task 12
# Creating a Subset
df1<-subset(df, select =
row2) print("Modified Data
Frame") print(df1)
Task 15
In data science, one of the common tasks is dealing with missing data. If we have missing data in
your dataset, there are several ways to handle it in R programming. One way is to simply remove
any rows or columns that contain missing data. Another way to handle missing data is to impute
the missing values using a statistical method. This means replacing the missing values with
estimates based on the other values in the dataset. For example, we can replace missing values with
the mean or median value of the variable in which the missing values are found.
Missing Data
In R, the NA symbol is used to define the missing values, and to represent impossible arithmetic
operations (like dividing by zero) we use the NAN symbol which stands for “not a number”. In
simple words, we can say that both NA or NAN symbols represent missing values in R.
Let us consider a scenario in which a teacher is inserting the marks (or data) of all the students in a
spreadsheet. But by mistake, she forgot to insert data from one student in her class. Thus, missing
data/values are practical in nature.
R provides us with inbuilt functions using which we can find the missing values. Such inbuilt
functions are explained in detail below −
We can use the is.na() inbuilt function in R to check for NA values. This function returns a vector
that contains only logical value (either True or False). For the NA values in the original dataset, the
corresponding vector value should be True otherwise it should be False.
Example
# vector with some data
myVector <- c(NA, "TP", 4, 6.7, 'c', NA, 12)
myVector
Output
[1] NA "TP" "4" "6.7" "c" NA "12"
# finding NAs
myVector <- c(NA, "TP", 4, 6.7, 'c', NA, 12)
is.na(myVector)
Output
[1] TRUE FALSE FALSE FALSE FALSE TRUE FALSE
Output
[1] 1 6
Output
[1] 2
As you can see in the output this function produces a vector having True boolean value at those
positions in which myVector holds a NA value.
We can apply the is.nan() function to check for NAN values. This function returns a vector
containing logical values (either True or False). If there are some NAN values present in the
vector, then it returns True corresponding to that position in the vector otherwise it returns False.
Example
myVector <- c(NA, 100, 241, NA, 0 / 0, 101, 0 / 0)
is.nan(myVector)
[1] FALSE FALSE FALSE FALSE TRUE FALSE TRUE
Output
[1] FALSE FALSE FALSE FALSE TRUE FALSE TRUE
As you can see in the output this function produces a vector having True boolean value at those
positions in which myVector holds a NAN value.
Let us consider a scenario in which we want to filter values except for missing values. In R, we
have two ways to remove missing values. These methods are explained below −
The first way to remove missing values from a dataset is to use R's modeling functions. These
functions accept a na.action parameter that lets the function what to do in case an NA value is
encountered. This makes the modeling function invoke one of its missing value filter functions.
These functions are capable enough to replace the original data set with a new data set in which
the NA values have been changed. It has the default setting as na.omit that completely removes a
row if this row contains any missing value. An alternative to this setting is −
It just terminates whenever it encounters any missing values. The following are the filter functions
−
na.omit − It simply rules out any rows that contain any missing value and forgets
those rows forever.
na.exclude − This agument ignores rows having at least one missing value.
na.pass − Take no action.
na.fail − It terminates the execution if any of the missing values are found.
Example
myVector <- c(NA, "TP", 4, 6.7, 'c', NA, 12)
na.exclude(myVector)
Output
[1] "TP" "4" "6.7" "c" "12"
attr(,"na.action")
[1] 1 6
attr(,"class")
[1] "exclude"
Example
myVector <- c(NA, "TP", 4, 6.7, 'c', NA, 12)
na.omit(myVector)
Output
[1] "TP" "4" "6.7" "c" "12"
attr(,"na.action")
[1] 1 6
attr(,"class")
[1] "omit"
Example
myVector <- c(NA, "TP", 4, 6.7, 'c', NA, 12)
na.fail(myVector)
Output
Error in na.fail.default(myVector) : missing values in object
As you can see in the output, execution halted for rows containing at least one missing value.
In order to select only those values which are not missing, firstly we are required to produce a
logical vector having corresponding values as True for NA or NAN value and False for other
values in the given vector.
Example
Let logicalVector be such a vector (we can easily get this vector by applying is.na() function).
myVector1 <- c(200, 112, NA, NA, NA, 49, NA, 190)
logicalVector1 <- is.na(myVector1)
newVector1 = myVector1[! logicalVector1]
print(newVector1)
Output
[1] 200 112 49 190
As you can see in the output missing values of type NA and NAN have been successfully removed
from myVector1 and myVector2 respectively.
In this section, we will see how we can fill or populate missing values in a dataset using mean and
median. We will use the apply method to get the mean and median of missing columns.
Step 1 − The very first step is to get the list of columns that contain at least one missing value (NA)
value.
Example
# Create a data frame
dataframe <- data.frame( Name = c("Bhuwanesh", "Anil", "Jai", "Naveen"),
Physics = c(98, 87, 91, 94),
Chemistry = c(NA, 84, 93, 87),
Mathematics = c(91, 86, NA, NA) )
#Print dataframe
print(dataframe)
Output
Name Physics Chemistry Mathematics
1 Bhuwanesh 98 NA 91
2 Anil 87 84 86
3 Jai 91 93 NA
4 Naveen 94 87 NA
Step 2 − Now we are required to compute the mean and median of the corresponding columns.
Since we need to omit NA values in the missing columns, therefore, we can pass "na.rm = True"
argument to the apply() function.
print(medianMissing)
Output
Chemistry Mathematics
87.0 88.5
The median of Column Chemistry is 87.0 and that of Mathematics is 88.5.
Step 3 − Now our mean and median values of corresponding columns are ready. In this step, we
will replace NA values with mean and median using mutate() function which is defined under
“dplyr” package.
Example
# Importing library
library(dplyr)
newDataFrameMean
Output
Name Physics Chemistry Mathematics
1 Bhuwanesh 98 88 91.0
2 Anil 87 84 86.0
3 Jai 91 93 88.5
4 Naveen 94 87 88.5
Notice the missing values are filled with the mean of the corresponding column.
Example
Now let’s fill the NA values with the median of the corresponding column.
# Importing library
library(dplyr)
print(newDataFrameMedian)
Output
Name Physics Chemistry Mathematics
1 Bhuwanesh 98 87 91.0
2 Anil 87 84 86.0
3 Jai 91 93 88.5
4 Naveen 94 87 88.5
The missing values are filled with the median of the corresponding column.
Task 17
Write an R script to handle outliers
Handling outliers in R typically involves identifying and dealing with data points that are
significantly different from the majority of the observations. There are various approaches to
handle outliers, such as removing them, transforming the data, or applying robust statistical
methods. Here's a simple example script that demonstrates how to identify and handle outliers
using the summary(), boxplot(), and winsorize() functions:
Note: In practice, the approach to handling outliers depends on the specific characteristics of your
data and the goals of your analysis. Winsorization is just one of many possible methods. Other
methods include trimming, transformation, or using robust statistical techniques. Choose the
method that best fits your data and analysis requirements.
Task 18
Handling invalid values in R involves identifying and addressing missing or incorrect data.
Common strategies include imputation for missing values, cleaning or transforming data, or
removing observations with invalid values. Here's a basic example script that demonstrates
handling missing values using the complete.cases() function for removal and na.omit()
function, and imputation using the na.mean() function:
# Display the data after imputing missing values with the mean
cat("\nData after Imputing Missing Values with Mean:\n")
print(imputed_data)
This script first displays the original data, then removes observations with missing values using
both indexing and the na.omit() function. Finally, it imputes missing values with the mean using
the na.mean() function.
Task 19
A mosaic plot can be used for plotting categorical data very effectively with the area of the data
showing the relative proportions.
data(HairEyeColor)
mosaicplot(HairEyeColor)
Task 20
Visualize correlation between sepal length and petal length in iris data set using
scatter plot.
# Load the Iris dataset
data(iris)
# Create a scatter plot for Sepal.Length vs. Petal.Length
SL=iris$Sepal.Length
PL=iris$Petal.Length
plot(SL, PL)
Task 21
cat("\n\nSummary Statistics:\n")
print(summary_stats)
Task 22
Consider the above data and predict the weight of a mouse for a given height
and plot the results using a graph
# Use the existing 'mice_data' and 'model' from the previous code
Logistic Regression: Analyse iris data set using Logistic Regression. Note: create
a subset of iris dataset with two species..
# Mice data
height <- c(140, 142, 150, 147, 139, 152, 154, 135, 148, 147)
weight <- c(59, 61, 66, 62, 57, 68, 69, 58, 63, 62)
# Linear regression
linear_model <- lm(weight ~ height)
# Logistic regression
logistic_model <- glm(outcome ~ height + weight, family=binomial)
Not Obese
Obese
68
66
64
Weight
62
0.5
60
58
Height
Task 25
library(data.tree)
# Split the attribute into groups by its possible values (sunny, overcast,
rainy).
attributeValues <- split(data, data[,attributeCol])
# The information gain is the remainder from the attribute entropy minus the
attribute value gains.
systemEntropy - gains
}
for(i in 1:length(childObs)) {
# Construct a child having the name of that attribute value.
child <- node$AddChild(names(childObs)[i])
# Read dataset.
data <- read.table('weather.tsv', header=T)
# Convert the last column to a boolean.
data[, ncol(data)] <- ifelse(tolower(data[, ncol(data)]) == 'yes', T, F)
The C4.5 algorithm, created by Ross Quinlan, implements decision trees. The algorithm starts with
all instances in the same group, then repeatedly splits the data based on attributes until each item is
classified. To avoid overfitting, sometimes the tree is pruned back. C4.5 attempts this
automatically. C4.5 handles both continuous and discrete attributes.
load packages
J48 is an open source Java implementation of the C4.5 algorith available in the Weka package.
The caret package (Classification And REgression Training) is a set of functions that streamline
learning by providing functions for data splitting, feature selection, model tuning, and more.
library(RWeka)
library(caret)
## Warning: package 'caret' was built under R version 3.2.5
## Loading required package: lattice
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.2.4
First, use caret to create a 10-fold training set. Then train the model. We are using the well-known
iris data set here that is automatically included in R.
150 samples
4 predictor attributes
3 classes: setosa, versicolor, virginica
C45Fit
## C4.5-like Trees
##
## 150 samples
## 4 predictor
## 3 classes: 'setosa', 'versicolor', 'virginica'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 135, 135, 135, 135, 135, 135, ...
## Resampling results:
##
## Accuracy Kappa
## 0.96 0.94
##
## Tuning parameter 'C' was held constant at a value of 0.25
##
Looking at the tree we see that it first split on whether or not the petal width was 0.6 or less. Each
indentation is the next split in the tree.
C45Fit$finalModel
## J48 pruned tree
##
##
## Petal.Width <= 0.6: setosa (50.0)
## Petal.Width > 0.6
## | Petal.Width <= 1.7
## | | Petal.Length <= 4.9: versicolor (48.0/1.0)
## | | Petal.Length > 4.9
## | | | Petal.Width <= 1.5: virginica (3.0)
## | | | Petal.Width > 1.5: versicolor (3.0/1.0)
## | Petal.Width > 1.7: virginica (46.0/1.0)
##
## Number of Leaves : 5
##
## Size of the tree : 9
Task 27
Time Series: Write R script to decompose time series data into random, trend
and seasonal data..
library(astsa, quietly=TRUE, warn.conflicts=FALSE)
library(ggplot2)
library(knitr)
library(printr)
library(plyr)
library(dplyr)
library(lubridate)
library(gridExtra)
library(reshape2)
library(TTR)
kings <- scan('http://robjhyndman.com/tsdldata/misc/kings.dat', skip=3)
head(kings)
## [1] 60 43 67 50 56 42
## Time Series:
## Start = 1
## End = 42
## Frequency = 1
## [1] 60 43 67 50 56 42 50 65 68 43 65 34 47 34 49 41 13 35 53 56 16 43 69
## [24] 59 48 59 86 55 68 51 33 49 67 77 81 67 71 81 68 70 77 56
births <- scan("http://robjhyndman.com/tsdldata/data/nybirths.dat")
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct
## 1946 26.663 23.598 26.931 24.740 25.806 24.364 24.477 23.901 23.175 23.227
## 1947 21.439 21.089 23.709 21.669 21.752 20.761 23.479 23.824 23.105 23.110
## 1948 21.937 20.035 23.590 21.672 22.222 22.123 23.950 23.504 22.238 23.142
## 1949 21.548 20.000 22.424 20.615 21.761 22.874 24.104 23.748 23.262 22.907
## 1950 22.604 20.894 24.677 23.673 25.320 23.583 24.671 24.454 24.122 24.252
## 1951 23.287 23.049 25.076 24.037 24.430 24.667 26.451 25.618 25.014 25.110
## 1952 23.798 22.270 24.775 22.646 23.988 24.737 26.276 25.816 25.210 25.199
## 1953 24.364 22.644 25.565 24.062 25.431 24.635 27.009 26.606 26.268 26.462
## 1954 24.657 23.304 26.982 26.199 27.210 26.122 26.706 26.878 26.152 26.379
## 1955 24.990 24.239 26.721 23.475 24.767 26.219 28.361 28.599 27.914 27.784
## 1956 26.217 24.218 27.914 26.975 28.527 27.139 28.982 28.169 28.056 29.136
## 1957 26.589 24.848 27.543 26.896 28.878 27.390 28.065 28.141 29.048 28.484
## 1958 27.132 24.924 28.963 26.589 27.931 28.009 29.229 28.759 28.405 27.945
## 1959 26.076 25.286 27.660 25.951 26.398 25.565 28.865 30.000 29.261 29.012
## Nov Dec
## 1946 21.672 21.870
## 1947 21.759 22.073
## 1948 21.059 21.573
## 1949 21.519 22.025
## 1950 22.084 22.991
## 1951 22.964 23.981
## 1952 23.162 24.707
## 1953 25.246 25.180
## 1954 24.712 25.688
## 1955 25.693 26.881
## 1956 26.291 26.987
## 1957 26.634 27.735
## 1958 25.912 26.619
## 1959 26.992 27.897
plot.ts(kings)
At this point we could guess that this time series could be described using an additive model, since
the random fluctuations in the data are roughly constant in size over time.
plot.ts(births)
We can see from this time series that there is certainly some seasonal variation in the number of
births per month; there is a peak every summer, and a trough every winter. Again the it seems like
this could be described using an additive model, as the seasonal fluctuations are roughly constant in
size over time and do not seem to depend on the level of the time series, and the random
fluctuations seem constant over time.
plot.ts(gift)
In this case, an additive model is not appropriate since the size of the seasonal and random
fluctuations change over time and the level of the time series. It is then appropriate to transform the
time series so that we can model the data with a classic additive model.
Decomposing a time series means separating it into it’s constituent components, which are often a
trend component and a random component, and if the data is seasonal, a seasonal component.
Recall that non-seasonal time series consist of a trend component and a random
component. Decomposing the time series involves tying to separate the time series into
these individual components.
One way to do this is using some smoothing method, such as a simple moving average. The SMA()
function in the TTR R package can be used to smooth time series data using a moving average. The
SMA function takes a span argument as n order. To calculate the moving average of order 5, we set
n
= 5.
Lets start with n=3 to see a clearer picture of the Kings dataset trend component
This is better, we can see that the death of English kings has declined from ~55 years to ~40 years
for a brief period, followed by a rapid increase in the next 20 years to ages in the 70’s.
A seasonal time series, in addition to the trend and random components, also has a seasonal
component. Decomposing a seasonal time series means separating the time series into these three
components. In R we can use the decompose() function to estimate the three components of the
time series.
Lets estimate the trend, seasonal, and random components of the New York births dataset.
birthsComp
## $x
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct
## 1946 26.663 23.598 26.931 24.740 25.806 24.364 24.477 23.901 23.175 23.227
## 1947 21.439 21.089 23.709 21.669 21.752 20.761 23.479 23.824 23.105 23.110
## 1948 21.937 20.035 23.590 21.672 22.222 22.123 23.950 23.504 22.238 23.142
## 1949 21.548 20.000 22.424 20.615 21.761 22.874 24.104 23.748 23.262 22.907
## 1950 22.604 20.894 24.677 23.673 25.320 23.583 24.671 24.454 24.122 24.252
## 1951 23.287 23.049 25.076 24.037 24.430 24.667 26.451 25.618 25.014 25.110
## 1952 23.798 22.270 24.775 22.646 23.988 24.737 26.276 25.816 25.210 25.199
## 1953 24.364 22.644 25.565 24.062 25.431 24.635 27.009 26.606 26.268 26.462
## 1954 24.657 23.304 26.982 26.199 27.210 26.122 26.706 26.878 26.152 26.379
## 1955 24.990 24.239 26.721 23.475 24.767 26.219 28.361 28.599 27.914 27.784
## 1956 26.217 24.218 27.914 26.975 28.527 27.139 28.982 28.169 28.056 29.136
## 1957 26.589 24.848 27.543 26.896 28.878 27.390 28.065 28.141 29.048 28.484
## 1958 27.132 24.924 28.963 26.589 27.931 28.009 29.229 28.759 28.405 27.945
## 1959 26.076 25.286 27.660 25.951 26.398 25.565 28.865 30.000 29.261 29.012
## Nov Dec
## 1946 21.672 21.870
## 1947 21.759 22.073
## 1948 21.059 21.573
## 1949 21.519 22.025
## 1950 22.084 22.991
## 1951 22.964 23.981
## 1952 23.162 24.707
## 1953 25.246 25.180
## 1954 24.712 25.688
## 1955 25.693 26.881
## 1956 26.291 26.987
## 1957 26.634 27.735
## 1958 25.912 26.619
## 1959 26.992 27.897
##
## $seasonal
## Jan Feb Mar Apr May Jun
## 1946 -0.6771947 -2.0829607 0.8625232 -0.8016787 0.2516514 -0.1532556
## 1947 -0.6771947 -2.0829607 0.8625232 -0.8016787 0.2516514 -0.1532556
## 1948 -0.6771947 -2.0829607 0.8625232 -0.8016787 0.2516514 -0.1532556
## 1949 -0.6771947 -2.0829607 0.8625232 -0.8016787 0.2516514 -0.1532556
## 1950 -0.6771947 -2.0829607 0.8625232 -0.8016787 0.2516514 -0.1532556
## 1951 -0.6771947 -2.0829607 0.8625232 -0.8016787 0.2516514 -0.1532556
## 1952 -0.6771947 -2.0829607 0.8625232 -0.8016787 0.2516514 -0.1532556
## 1953 -0.6771947 -2.0829607 0.8625232 -0.8016787 0.2516514 -0.1532556
## 1954 -0.6771947 -2.0829607 0.8625232 -0.8016787 0.2516514 -0.1532556
## 1955 -0.6771947 -2.0829607 0.8625232 -0.8016787 0.2516514 -0.1532556
## 1956 -0.6771947 -2.0829607 0.8625232 -0.8016787 0.2516514 -0.1532556
## 1957 -0.6771947 -2.0829607 0.8625232 -0.8016787 0.2516514 -0.1532556
## 1958 -0.6771947 -2.0829607 0.8625232 -0.8016787 0.2516514 -0.1532556
## 1959 -0.6771947 -2.0829607 0.8625232 -0.8016787 0.2516514 -0.1532556
## Jul Aug Sep Oct Nov Dec
## 1946 1.4560457 1.1645938 0.6916162 0.7752444 -1.1097652 -0.3768197
## 1947 1.4560457 1.1645938 0.6916162 0.7752444 -1.1097652 -0.3768197
## 1948 1.4560457 1.1645938 0.6916162 0.7752444 -1.1097652 -0.3768197
## 1949 1.4560457 1.1645938 0.6916162 0.7752444 -1.1097652 -0.3768197
## 1950 1.4560457 1.1645938 0.6916162 0.7752444 -1.1097652 -0.3768197
## 1951 1.4560457 1.1645938 0.6916162 0.7752444 -1.1097652 -0.3768197
## 1952 1.4560457 1.1645938 0.6916162 0.7752444 -1.1097652 -0.3768197
## 1953 1.4560457 1.1645938 0.6916162 0.7752444 -1.1097652 -0.3768197
## 1954 1.4560457 1.1645938 0.6916162 0.7752444 -1.1097652 -0.3768197
## 1955 1.4560457 1.1645938 0.6916162 0.7752444 -1.1097652 -0.3768197
## 1956 1.4560457 1.1645938 0.6916162 0.7752444 -1.1097652 -0.3768197
## 1957 1.4560457 1.1645938 0.6916162 0.7752444 -1.1097652 -0.3768197
## 1958 1.4560457 1.1645938 0.6916162 0.7752444 -1.1097652 -0.3768197
## 1959 1.4560457 1.1645938 0.6916162 0.7752444 -1.1097652 -0.3768197
##
## $trend
## Jan Feb Mar Apr May Jun Jul
## 1946 NA NA NA NA NA NA 23.98433
## 1947 22.35350 22.30871 22.30258 22.29479 22.29354 22.30562 22.33483
## 1948 22.43038 22.43667 22.38721 22.35242 22.32458 22.27458 22.23754
## 1949 22.06375 22.08033 22.13317 22.16604 22.17542 22.21342 22.27625
## 1950 23.21663 23.26967 23.33492 23.42679 23.50638 23.57017 23.63888
## 1951 24.00083 24.12350 24.20917 24.28208 24.35450 24.43242 24.49496
## 1952 24.27204 24.27300 24.28942 24.30129 24.31325 24.35175 24.40558
## 1953 24.78646 24.84992 24.92692 25.02362 25.16308 25.26963 25.30154
## 1954 25.92446 25.92317 25.92967 25.92137 25.89567 25.89458 25.92963
## 1955 25.64612 25.78679 25.93192 26.06388 26.16329 26.25388 26.35471
## 1956 27.21104 27.21900 27.20700 27.26925 27.35050 27.37983 27.39975
## 1957 27.44221 27.40283 27.44300 27.45717 27.44429 27.48975 27.54354
## 1958 27.68642 27.76067 27.75963 27.71037 27.65783 27.58125 27.49075
## 1959 26.96858 27.00512 27.09250 27.17263 27.26208 27.36033 NA
## Aug Sep Oct Nov Dec## $random
## Jan Feb Mar Apr May
## 1946 NA NA NA NA NA
## 1947 -0.237305288 0.863252404 0.543893429 0.175887019 -0.793193109
## 1948 0.183819712 -0.318705929 0.340268429 0.121262019 -0.354234776
## 1949 0.161444712 0.002627404 -0.571689904 -0.749362981 -0.666068109
## 1950 0.064569712 -0.292705929 0.479560096 1.047887019 1.561973558
## 1951 -0.036638622 1.008460737 0.004310096 0.556595353 -0.176151442
## 1952 0.203153045 0.079960737 -0.376939904 -0.853612981 -0.576901442
## 1953 0.254736378 -0.122955929 -0.224439904 -0.159946314 0.016265224
## 1954 -0.590263622 -0.536205929 0.189810096 1.079303686 1.062681891
## 1955 0.021069712 0.535169071 -0.073439904 -1.787196314 -1.647943109
## 1956 -0.316846955 -0.918039263 -0.155523237 0.507428686 0.924848558
## 1957 -0.176013622 -0.471872596 -0.762523237 0.240512019 1.182056891
## 1958 0.122778045 -0.753705929 0.340851763 -0.319696314 0.021515224
## 1959 -0.215388622 0.363835737 -0.295023237 -0.419946314 -1.115734776
## Jun Jul Aug Sep Oct
## 1946 NA -0.963379006 -0.925718750 -0.939949519 -0.709369391
## 1947 -1.391369391 -0.311879006 0.347739583 0.150592147 0.076797276
## 1948 0.001672276 0.256412660 0.119531250 -0.623449519 0.289547276
## 1949 0.813838942 0.371704327 0.225906250 0.081758814 -0.578161058
## 1950 0.166088942 -0.423920673 -0.467718750 -0.433157853 -0.418577724
## 1951 0.387838942 0.499995994 -0.030385417 -0.116407853 -0.033536058
## 1952 0.538505609 0.414370994 0.206656250 0.025133814 -0.161411058
## 1953 -0.481369391 0.251412660 0.100156250 0.148592147 0.110880609
## 1954 0.380672276 -0.679670673 -0.269052083 -0.550157853 -0.282411058
## 1955 0.118380609 0.550245994 1.029447917 0.768592147 0.359422276
## 1956 -0.087577724 0.126204327 -0.437093750 -0.087907853 0.927213942
## 1957 0.053505609 -0.934587340 -0.592927083 0.724717147 0.030713942
## 1958 0.581005609 0.282204327 0.132572917 0.290758814 -0.171994391
## 1959 -1.642077724 NA NA NA NA
## Nov Dec
## 1946 -0.082484776 -0.298388622
## 1947 0.591098558 0.095819712
## 1948 0.154806891 -0.076221955
## 1949 -0.356859776 -0.761638622
## 1950 -0.679651442 -0.513680288
## 1959 NA NA
##
## $figure
## [1] -0.6771947 -2.0829607 0.8625232 -0.8016787 0.2516514 -0.1532556
## [7] 1.4560457 1.1645938 0.6916162 0.7752444 -1.1097652 -0.3768197
##
## $type
## [1] "additive"
##
## attr(,"class")
## [1] "decomposed.ts"
plot(birthsComp)
Seasonally Adjusting
If you have a seasonal time series, you can seasonally adjust the series by estimating the seasonal
component, and subtracting it from the original time series. We can see below that time time series
simply consists of the trend and random components.
Task 28:
This method is quite intuitive and cis applied to huge range of time series data. There are many
types of exponential smoothing but we are only going to discuss simple exponential smoothing in
this recipe.
Simple Exponential smoothing technique is used when the time series data has no trend or seasonal
variation. The weight of each and every parameter is always determined by alpha value described
below:
where:
In this recipe, we will carry out simple exponential smoothing on amazon stock data
fpp2_conflicts --
x purrr::flatten() masks jsonlite::flatten()
We will use the ses(train_data, alpha = , h = forecast forward steps) function to carry out the task
ses_google = ses(google.train,
alpha = 0.3, # alpha value to be 0.3
h = 100)
autoplot(ses_google)
800
700
google.train
600
500
400
# Installing Packages
install.packages("ClusterR")
install.packages("cluster")
# Loading package
library(ClusterR)
library(cluster)
# Confusion Matrix
cm <- table(iris$Species, kmeans.re$cluster)
cm
## Visualizing clusters
y_kmeans <- kmeans.re$cluster
clusplot(iris_1[, c("Sepal.Length", "Sepal.Width")],
y_kmeans,
lines = 0,
shade = TRUE,
color = TRUE,
labels = 2,
plotchar = FALSE,
span = TRUE,
main = paste("Cluster iris"),
xlab = 'Sepal.Length',
ylab = 'Sepal.Width')
Output:
Model kmeans_re:
The 3 clusters are made which are of 50, 62, and 38 sizes respectively. Within the cluster,
the sum of squares is 88.4%.
Cluster identification:
The model achieved an accuracy of 100% with a p-value of less than 1. This indicates the model
is good.
Confusion Matrix:
So, 50 Setosa are correctly classified as Setosa. Out of 62 Versicolor, 48 Versicolor are
correctly classified as Versicolor and 14 are classified as virginica. Out of 36 virginica, 19
virginica are correctly classified as virginica and 2 are classified as Versicolor.
The model showed 3 cluster plots with three different colors and with Sepal.length and
with Sepal.width.
# Function to find the index of the closest point to a given point in a set of points
find_closest_point <- function(point, points) {
distances <- apply(points, 1, function(p) euclidean_distance(point, p))
return(which.min(distances))
}
return(assignments)
}
# Example usage
set.seed(123) # for reproducibility
data <- matrix(rnorm(200), ncol = 2)
k <- 3
num_representatives <- 5
lm(iris$Sepal.Length~(iris$Sepal.Width+iris$Petal.Length+iris$Petal.Width))
Output:
Call:
Coefficients:
iris$Petal.Width
-0.5565
2. Random Forest for Classification
install.packages("randomForest")
library(randomForest)
data(iris)
set.seed(123)
install.packages("rpart")
library(rpart)
# Load a sample dataset (e.g., Iris dataset)
data(iris)
5. Explain the concept of missing values in R. How can you handle them?
- Answer: Missing values in R are represented by `NA`. Handling methods
include removing missing values using `na.omit()`, filling them with a specific value
using
`na.fill()`, or interpolating values using methods like `na.approx()`.
11. Question: What is R and why is it commonly used for data analytics?
12. Question: Explain the difference between data frames and matrices in R.
Answer: In R, matrices can only hold one data type, while data frames can store
different data types. Data frames are more flexible for handling heterogeneous data
as they can accommodate both numeric and character data, making them suitable
for real-world datasets.
13. Question: What is the purpose of the "tidyverse" in R, and name some of its
core packages.
14. Question: How can you handle missing values in a dataset using R?
Answer: In R, you can handle missing values using functions like `na.omit()` to
remove observations with missing values, `complete.cases()` to identify
complete cases, or `na.fill()` to replace missing values with a specified value.
Answer:The ggplot2 package is a powerful tool for creating a wide variety of static
and dynamic visualizations in R. It follows the Grammar of Graphics principles,
providing a consistent and intuitive syntax for creating complex plots, making it
highly effective for exploratory data analysis.