Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
4 views

PushpendraLabFile

The document is a lab report from Madhav Institute of Technology & Science on R Programming, detailing various experiments conducted by students. It includes instructions on installing R and RStudio, performing data manipulation, and creating visualizations using ggplot2, among other topics. The report is submitted to the Department of Computer Science & Engineering for the July-December 2023 semester.

Uploaded by

Satyam Khare
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

PushpendraLabFile

The document is a lab report from Madhav Institute of Technology & Science on R Programming, detailing various experiments conducted by students. It includes instructions on installing R and RStudio, performing data manipulation, and creating visualizations using ggplot2, among other topics. The report is submitted to the Department of Computer Science & Engineering for the July-December 2023 semester.

Uploaded by

Satyam Khare
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

MADHAV INSTITUTE OF TECHNOLOGY & SCIENCE, GWALIOR

(A Govt. Aided UGC Autonomous & NAAC Accredited Institute Affiliated to RGPV, Bhopal)

Lab Report

on

R Programming

150701

Submitted By:
Mohd Asif Farhan
Satyam Khare
Pushpendra Rajput
0901CS201070
0901CS201092
0901CS201112

Faculty Mentor:
Prof. Khushboo Agrawal, Assistant Professor

Submitted to:

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING


MADHAV INSTITUTE OF TECHNOLOGY & SCIENCE
GWALIOR - 474005 (MP) est. 1957

JULY-DEC 2023
INDEX

S. No. Experiment No. Remarks

1. Install R and RStudio, familiarize with the interface.

2. Practice basic operations, variable creation, and types of operators.

3. Create and manipulate different data structures.

4. Install and load R packages.

5. Perform data manipulation using dplyr.

6. Clean data using tidyr.

7. Create different types of plots using ggplot2.

8. Perform descriptive statistics and hypothesis testing.

9. Import and export data from CSV, Excel, SQL Databases.

10. Practice advanced data manipulation.

11. Create a report with R Markdown.

12. Understand the concept of machine learning and its types.

13. Apply supervised learning algorithm: Linear and Logistic Regression.

14. Apply supervised learning algorithm: Decision Trees and Random Forest.

15. Apply unsupervised learning algorithms: K-Means Clustering and


Hierarchical Clustering.

16. Perform Principal Component Analysis.

17. Perform Time Series Analysis in R.

18. Manipulate text data in R for Sentiment Analysis.

19. Implement topic modelling in R.

20. Review and final project using a combination of the techniques learned.
Q1. Install R and RStudio, familiarize with the interface.

Installing R:

● Visit the Comprehensive R Archive Network (CRAN) website: https://cran.r-project.org/.


● Choose a CRAN mirror near you.
● Download the R installation file for your operating system (e.g., Windows, macOS, or Linux).
● Follow the installation instructions for your specific operating system.

Installing RStudio:

● Visit the RStudio download page: https://www.rstudio.com/products/rstudio/download/.


● Download the RStudio Desktop version that matches your operating system (RStudio Desktop is
free and open-source).
● Install RStudio by following the installation instructions for your specific operating system.

Familiarizing with R and RStudio Interfaces:

● Open RStudio after installation.


● You'll see a layout with several panes, including the script editor, console, environment, and plots.
Familiarize yourself with these panes as follows:

● Script Editor: This is where you can write and edit R scripts or code. You can create a new script
by clicking File > New File > R Script.
● Console: The console is where you can interact with R by running individual commands or
scripts. You can type R code directly into the console and press Enter to execute it.
● Environment: This pane displays information about the current R environment, including loaded
datasets and variables. You can view your data, objects, and their properties here.
● Plots: This pane is for displaying plots and charts generated from your R code. You can use it to
visualize data.

Working with R:

● Start by entering simple R commands in the console to get a feel for how it works. For example,
you can try basic arithmetic operations like 2 + 2 or create a vector using my_vector <- c(1, 2, 3).

Creating and Running an R Script:

● In the script editor, write some R code. For instance, you can create a script that prints "Hello, R!"
to the console.

# This is a comment
print("Hello, R!")

● To run the script, you can either select the code and click the "Run" button or use the shortcut
(e.g., Ctrl+Enter or Command+Enter).

Accessing Help:

● R and RStudio have extensive help resources. You can access documentation and packages
through the "Help" menu.
Q2. Practice basic operations, variable creation, and types of
operators.

Basic Arithmetic Operations:

a <- 5
b <- 3
sum_ab <- a + b
diff_ab <- a - b
product_ab <- a * b
quotient_ab <- a / b

print(sum_ab)
print(diff_ab)
print(product_ab)
print(quotient_ab)

Output:

[1] 8
[1] 2
[1] 15
[1] 1.666667

Data Types and Variable Creation:

my_integer <- 42
my_character <- "Hello, R!"
my_logical <- TRUE
my_numeric_vector <- c(1.2, 2.3, 3.4, 4.5)

print(my_integer)
print(my_character)
print(my_logical)
print(my_numeric_vector)

Output:

[1] 42
[1] "Hello, R!"
[1] TRUE
[1] 1.2 2.3 3.4 4.5
Comparison and Logical Operators:

x <- 10
y <- 20
is_equal <- x == y
is_not_equal <- x != y
is_greater_than <- x > y
is_less_than <- x < y
logical_and <- is_equal && is_greater_than
logical_or <- is_not_equal || is_less_than

print(is_equal)
print(is_not_equal)
print(is_greater_than)
print(is_less_than)
print(logical_and)
print(logical_or)

Output:

[1] FALSE
[1] TRUE
[1] FALSE
[1] TRUE
[1] FALSE
[1] TRUE

Conditional Statements:

x <- 10
y <- 20

if (x < y) {
cat("x is less than y\n")
} else if (x > y) {
cat("x is greater than y\n")
} else {
cat("x is equal to y\n")
}

Output:

x is less than y
Functions:

add_numbers <- function(a, b) {


result <- a + b
return(result)
}

sum_result <- add_numbers(7, 8)


print(sum_result)

Output:

[1] 15
Q3. Create and manipulate different data structures.

Vectors:

numeric_vector <- c(1, 2, 3, 4, 5)


character_vector <- c("apple", "banana", "cherry")
logical_vector <- c(TRUE, FALSE, TRUE)

second_element <- numeric_vector[2]


vector_sum <- numeric_vector + 10

print(second_element)
print(vector_sum)

Output:

[1] 2
[1] 11 12 13 14 15

Lists:

my_list <- list(name = "John", age = 30, city = "New York")

# Access elements
name_value <- my_list$name

# Modify an element
my_list$age <- 31

print(name_value)
print(my_list)

Output:

[1] "John"
$name
[1] "John"

$age
[1] 31

$city
[1] "New York"
Matrices:

my_matrix <- matrix(1:6, nrow = 2, ncol = 3)

# Matrix operations
matrix_sum <- my_matrix + 5
matrix_product <- my_matrix %*% t(my_matrix)

print(matrix_sum)
print(matrix_product)

Output:

[,1] [,2] [,3]


[1,] 6 8 10
[2,] 7 9 11

[,1] [,2]
[1,] 14 32
[2,] 32 77

Data Frames:

my_df <- data.frame(


Name = c("Alice", "Bob", "Charlie"),
Age = c(25, 32, 28),
City = c("New York", "Los Angeles", "Chicago")
)

# Access columns
name_column <- my_df$Name

# Filter rows
chicago_residents <- my_df[my_df$City == "Chicago", ]

print(name_column)
print(chicago_residents)

Output:

[1] "Alice" "Bob" "Charlie"

Name Age City


3 Charlie 28 Chicago
Arrays:

my_array <- array(1:24, dim = c(2, 3, 4))

# Access an element
first_element <- my_array[1, 1, 1]

print(first_element)

Output:

[1] 1
Q4. Install and load R packages.

Installing R Packages:

● Open RStudio.
● To install a package, you can use the install.packages() function. For example, to install the
ggplot2 package for data visualization:
● install.packages("ggplot2")
● R will prompt you to choose a CRAN mirror; select one that is geographically close to your
location.
● R will download and install the package and its dependencies. Once the installation is complete,
you can use the package in your R session.

Loading R Packages:

● To use a package in your R script or session, you need to load it using the library() or require()
function. For example, to load the ggplot2 package:
● library(ggplot2)
● You can also use require() instead of library(), which returns a logical value indicating whether
the package is available. If it's not, require() will not stop the execution of your script.
● Once a package is loaded, you can use its functions and capabilities in your R code.

Checking Loaded Packages:

● To check which packages are currently loaded in your R session, you can use the search()
function:
● search()
● This will display a list of loaded packages and their order of precedence.

Detaching or Unloading Packages:

● To unload a package and remove it from your current R session, you can use the detach()
function. For example, to detach the ggplot2 package:
● detach("package:ggplot2", unload = TRUE)
● Unloading a package can be useful if you want to free up memory or avoid conflicts between
package functions.
Q5. Perform data manipulation using dplyr.

# Load the dplyr package


library(dplyr)

# Create a data frame


df <- data.frame(
Name = c("Alice", "Bob", "Charlie", "David"),
Age = c(28, 35, 22, 29),
City = c("New York", "Los Angeles", "Chicago", "Houston")
)

selected_df <- select(df, Name, Age)


filtered_df <- filter(df, Age > 30)
mutated_df <- mutate(df, BirthYear = 2023 - Age)
sorted_df <- arrange(df, Age).
grouped_df <- group_by(df, City)

result <- df %>%


filter(Age > 25) %>%
select(Name, City) %>%
arrange(City)

# Save the result to a new data frame


new_df <- result

# View the result in the viewer


View(result)

# Print the result in the console


print(result)

Output:

# Output for selected_df (select)


# A tibble: 4 × 2
Name Age
<chr> <dbl>
1 Alice 28
2 Bob 35
3 Charlie 22
4 David 29
# Output for filtered_df (filter)
# A tibble: 1 × 3
Name Age City
<chr> <dbl> <chr>
1 Bob 35 Los Angeles

# Output for mutated_df (mutate)


# A tibble: 4 × 4
Name Age City BirthYear
<chr> <dbl> <chr> <dbl>
1 Alice 28 New York 1995
2 Bob 35 Los Angeles 1988
3 Charlie 22 Chicago 2001
4 David 29 Houston 1994

# Output for sorted_df (arrange)


# A tibble: 4 × 3
Name Age City
<chr> <dbl> <chr>
1 Charlie 22 Chicago
2 Alice 28 New York
3 David 29 Houston
4 Bob 35 Los Angeles

# Output for grouped_df (group_by)


# A tibble: 4 × 3
# Groups: City [4]
Name Age City
<chr> <dbl> <chr>
1 Alice 28 New York
2 Bob 35 Los Angeles
3 Charlie 22 Chicago
4 David 29 Houston

# Output for result (chained operations)


# A tibble: 3 × 2
Name City
<chr> <chr>
1 Alice New York
2 David Houston
3 Bob Los Angeles

# View of the result is displayed in the viewer


# Output for result (printing in the console)
# A tibble: 3 × 2
Name City
<chr> <chr>
1 Alice New York
2 David Houston
3 Bob Los Angeles
Q6. Clean data using tidyr.

# Load the tidyr package


library(tidyr)

# Create a data frame


df <- data.frame(
Name = c("Alice, 28, New York", "Bob, 35, Los Angeles", "Charlie, 22, Chicago", "David, 29,
Houston")
)

# Separate the single column into Name, Age, and City columns
separated_df <- separate(df, col = Name, into = c("Name", "Age", "City"), sep = ", ")

# Convert long data to wide format


wide_df <- spread(separated_df, key = Age, value = City)

# Convert wide data to long format


long_df <- gather(wide_df, Age, City, -Name)

# Replace missing values with "Unknown"


cleaned_df <- replace_na(long_df, list(City = "Unknown"))

# Save the cleaned data frame


cleaned_data <- cleaned_df

# View the cleaned data in the viewer


View(cleaned_df)

# Print the cleaned data in the console


print(cleaned_df)

Output:

# Separating the single column into three columns


separated_df
# Output:
# Name Age City
# 1 Alice 28 New York
# 2 Bob 35 Los Angeles
# 3 Charlie 22 Chicago
# 4 David 29 Houston
# Converting long data to wide format
wide_df
# Output:
# Name 22 28 29 35
# 1 Alice NA 28 NA NA
# 2 Bob NA NA NA 35
# 3 Charlie 22 NA NA NA
# 4 David NA NA 29 NA

# Converting wide data to long format


long_df
# Output:
# Name Age City
# 1 Alice 22 NA
# 2 Bob 22 NA
# 3 Charlie 22 NA
# 4 David 22 NA
# 5 Alice 28 New York
# 6 Bob 28 NA
# 7 Charlie 28 NA
# 8 David 28 NA
# 9 Alice 29 NA
# 10 Bob 29 NA
# 11 Charlie 29 NA
# 12 David 29 NA
# 13 Alice 35 NA
# 14 Bob 35 Los Angeles
# 15 Charlie 35 NA
# 16 David 35 NA

# Replacing missing values with "Unknown"


cleaned_df
# Output:
# Name Age City
# 1 Alice 22 Unknown
# 2 Bob 22 Unknown
# 3 Charlie 22 Unknown
# 4 David 22 Unknown
# 5 Alice 28 New York
# 6 Bob 28 Unknown
# 7 Charlie 28 Unknown
# 8 David 28 Unknown
# 9 Alice 29 Unknown
# 10 Bob 29 Unknown
# 11 Charlie 29 Unknown
# 12 David 29 Unknown
# 13 Alice 35 Unknown
# 14 Bob 35 Los Angeles
# 15 Charlie 35 Unknown
# 16 David 35 Unknown
Q7. Create different types of plots using ggplot2.

# Load the ggplot2 package


library(ggplot2)

# Create a data frame


df <- data.frame(
Category = c("A", "B", "C", "D", "E"),
Value = c(25, 50, 30, 40, 55)
)

# Create a bar plot


bar_plot <- ggplot(df, aes(x = Category, y = Value)) +
geom_bar(stat = "identity") +
labs(title = "Bar Plot", x = "Categories", y = "Values")

# Display the plot


print(bar_plot)

# Create a scatter plot


scatter_plot <- ggplot(df, aes(x = Category, y = Value)) +
geom_point() +
labs(title = "Scatter Plot", x = "Categories", y = "Values")

# Display the plot


print(scatter_plot)

# Create a line plot


line_plot <- ggplot(df, aes(x = Category, y = Value)) +
geom_line() +
labs(title = "Line Plot", x = "Categories", y = "Values")

# Display the plot


print(line_plot)

# Create a box plot


box_plot <- ggplot(df, aes(x = "Data", y = Value)) +
geom_boxplot() +
labs(title = "Box Plot", x = "", y = "Values")

# Display the plot


print(box_plot)
# Create a histogram
histogram <- ggplot(df, aes(x = Value)) +
geom_histogram(binwidth = 10, fill = "blue", color = "black") +
labs(title = "Histogram", x = "Values", y = "Frequency")

# Display the plot


print(histogram)

Output:
Q8. Perform descriptive statistics and hypothesis testing.

# Load the stats package


library(stats)

# Create a data frame (example data)


df <- data.frame(
Height = c(170, 175, 160, 185, 168),
Weight = c(70, 75, 60, 80, 72)
)

# Generate summary statistics


summary(df)

# Mean
mean_height <- mean(df$Height)
mean_weight <- mean(df$Weight)

# Median
median_height <- median(df$Height)
median_weight <- median(df$Weight)

# Mode (requires additional package)


library(dplyr)
mode_height <- df %>%
group_by(Height) %>%
count() %>%
filter(n == max(n)) %>%
pull(Height)
mode_weight <- df %>%
group_by(Weight) %>%
count() %>%
filter(n == max(n)) %>%
pull(Weight)

# Variance
var_height <- var(df$Height)
var_weight <- var(df$Weight)

# Standard Deviation
sd_height <- sd(df$Height)
sd_weight <- sd(df$Weight)
# Range
range_height <- range(df$Height)
range_weight <- range(df$Weight)

Output:

Height Weight
Min. :160.0 Min. :60.0
1st Qu.:168.0 1st Qu.:70.0
Median :170.0 Median :72.0
Mean :171.6 Mean :71.4
3rd Qu.:175.0 3rd Qu.:75.0
Max. :185.0 Max. :80.0

Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

filter, lag

The following objects are masked from 'package:base':

intersect, setdiff, setequal, union

[Execution complete with exit code 0]

# Load the car package (for some advanced tests)


library(car)

group_a <- c(23, 28, 32, 25, 30)


group_b <- c(19, 22, 27, 20, 24)

# Perform a two-sample t-test


t_test_result <- t.test(group_a, group_b)

# View test results


t_test_result

Output:

Welch Two Sample t-test

data: group_a and group_b


t = 2.3935, df = 7.8728, p-value = 0.0441
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.1759501 10.2240499
sample estimates:
mean of x mean of y
27.6 22.4

[Execution complete with exit code 0]


Q9. Import and export data from CSV, Excel, SQL
Databases.
Import:

# Import data from a CSV file


data <- read.csv("your_data.csv")

# Install and load the readxl package


install.packages("readxl")
library(readxl)

# Import data from an Excel file


data <- read_excel("your_data.xlsx")

# Install and load the required packages


install.packages("DBI")
install.packages("RSQLite")
library(DBI)
library(RSQLite)

# Connect to the SQLite database


con <- dbConnect(RSQLite::SQLite(), dbname = "your_database.db")

# Retrieve data using an SQL query


data <- dbGetQuery(con, "SELECT * FROM your_table")

# Close the database connection


dbDisconnect(con)

Export:

# Export data to a CSV file


write.csv(data, file = "output_data.csv", row.names = FALSE)

# Install and load the required package


install.packages("openxlsx")
library(openxlsx)

# Export data to an Excel file


write.xlsx(data, file = "output_data.xlsx")

# Establish a connection to the SQLite database


con <- dbConnect(RSQLite::SQLite(), dbname = "your_database.db")

# Insert data into a table using SQL


dbWriteTable(con, "output_table", data, overwrite = TRUE)

# Close the database connection


dbDisconnect(con)
Q10. Practice advanced data manipulation.
# Sample data frames
data_frame1 <- data.frame(
ID = c(1, 2, 3, 4, 5),
Value1 = c(10, 20, 30, 40, 50)
)

data_frame2 <- data.frame(


ID = c(3, 4, 5, 6, 7),
Value2 = c(100, 200, 300, 400, 500)
)

wide_data <- data.frame(


ID = 1:5,
Var1 = c(10, 20, 30, 40, 50),
Var2 = c(15, 25, 35, 45, 55)
)

df <- data.frame(
Category = c("A", "B", "A", "C", "B"),
Value = c(25, 50, 30, 40, 55)
)

# Merging Data Frames


merged_data <- merge(data_frame1, data_frame2, by = "ID")
print(merged_data)

# Joining Data Frames using dplyr


library(dplyr)
joined_data <- left_join(data_frame1, data_frame2, by = "ID")
print(joined_data)

# Reshaping Data from Wide to Long


library(tidyr)
long_data <- pivot_longer(wide_data, cols = -ID, names_to = "variable", values_to = "value")
print(long_data)

# Reshaping Data from Long to Wide


wide_data <- pivot_wider(long_data, names_from = "variable", values_from = "value")
print(wide_data)

# Aggregating Data using Base R


aggregated_data <- aggregate(Value ~ Category, data = df, FUN = sum)
print(aggregated_data)

# Aggregating Data using dplyr


aggregated_data <- df %>%
group_by(Category) %>%
summarize(Sum = sum(Value))
print(aggregated_data)

# Removing Rows with Missing Values


raw_data <- data.frame(
Name = c("Alice", NA, "Bob", "Charlie", "David"),
Age = c(28, 35, 22, NA, 29)
)

cleaned_data <- na.omit(raw_data)


print(cleaned_data)

# Dropping Rows with Missing Values using tidyr


cleaned_data <- raw_data %>%
drop_na()
print(cleaned_data)

# Create a New Variable


old_data <- data.frame(
ID = 1:5,
Value = c(10, 20, 30, 40, 50)
)

new_data <- old_data %>%


mutate(NewVariable = Value * 2)
print(new_data)

# Calculate a Simple Moving Average


library(xts)

data <- data.frame(


Date = as.Date(c("2023-01-01", "2023-01-02", "2023-01-03", "2023-01-04", "2023-01-05")),
Value = c(10, 20, 30, 40, 50)
)

data_xts <- xts(data$Value, order.by = data$Date)


moving_average <- rollmean(data_xts, k = 3, fill = NA)
print(moving_average)
Output:

ID Value1 Value2
1 3 30 100
2 4 40 200
3 5 50 300

Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

filter, lag

The following objects are masked from 'package:base':

intersect, setdiff, setequal, union

ID Value1 Value2
1 1 10 NA
2 2 20 NA
3 3 30 100
4 4 40 200
5 5 50 300

# A tibble: 10 x 3

ID variable value
<int> <chr> <dbl>
1 1 Var1 10
2 1 Var2 15
3 2 Var1 20
4 2 Var2 25
5 3 Var1 30
6 3 Var2 35
7 4 Var1 40
8 4 Var2 45
9 5 Var1 50
10 5 Var2 55

# A tibble: 5 x 3

ID Var1 Var2
<int> <dbl> <dbl>
1 1 10 15
2 2 20 25
3 3 30 35
4 4 40 45
5 5 50 55

Category Value
1 A 55
2 B 105
3 C 40

# A tibble: 3 x 2

Category Sum
<chr> <dbl>
1A 55
2B 105
3C 40

Name Age
1 Alice 28
3 Bob 22
5 David 29

Name Age
1 Alice 28
2 Bob 22
3 David 29

ID Value NewVariable
1 1 10 20
2 2 20 40
3 3 30 60
4 4 40 80
5 5 50 100

Error in library(xts) : there is no package called 'xts'


Execution halted

[Execution complete with exit code 1]


Q11. Create a report with R Markdown.

1. Install and Load Required Packages:

You need to have the rmarkdown package installed and loaded to create R Markdown documents.
# Install and load the rmarkdown package
install.packages("rmarkdown")
library(rmarkdown)

2. Create a New R Markdown Document:

In RStudio, you can create a new R Markdown document by going to File > New File > R Markdown....
This will open a dialog where you can configure your R Markdown document.

3. Author Your R Markdown Document:

In the R Markdown document, you can include a mix of text, code chunks, and Markdown formatting.
Here's an example of a simple R Markdown document:

---
title: "Sample R Markdown Report"
output:
html_document:
toc: true
---

# Introduction

This is a sample R Markdown report. We'll include some code and plots in this report.

# Data Loading

```{r}
# Load data
data <- read.csv("data.csv")
head(data)

Data Summary

# Summary statistics
summary(data)

Data Visualization
# Create a scatter plot
library(ggplot2)
ggplot(data, aes(x = X, y = Y)) +
geom_point()

Conclusion

In this example:

- The `title` field in the YAML header sets the title of your report.
- Under the "Data Loading" section, there's an R code chunk that loads data from a CSV file.
- In the "Data Summary" section, another R code chunk provides summary statistics of the data.
- The "Data Visualization" section contains R code for creating a scatter plot using the `ggplot2` package.
- The report includes text sections along with code chunks that can be executed to generate results and
visualizations.

**4. Knit Your R Markdown Document:**

To generate the report, click the "Knit" button in RStudio, or use the `knit()` function in R with your R
Markdown file as an argument. This will run the code chunks and produce an HTML (or other format)
report.

**5. View and Share Your Report:**

Once the knitting process is complete, you can view the generated report. It will include your text, code
results, and plots, making it easy to communicate your data analysis in a comprehensive document. You
can save the HTML or other output formats and share them as needed.

R Markdown is a versatile tool for creating dynamic and reproducible reports, and you can customize
your documents to include various elements like tables, LaTeX equations, citations, and more. For
advanced formatting and customization, refer to the R Markdown documentation and cheat sheets.
Q12. Understand the concept of machine learning and its
types.

Machine learning is a subfield of artificial intelligence (AI) that focuses on the development of algorithms
and statistical models that enable computer systems to learn from and make predictions or decisions based
on data. The primary goal of machine learning is to develop models that can identify patterns, extract
insights, and make predictions or decisions without being explicitly programmed.

Here's an overview of the key concepts and types of machine learning:

Key Concepts in Machine Learning:

Data: Machine learning relies on data, which serves as the primary source of information for training,
testing, and validating models. Data can come in various forms, including structured (e.g., tables),
unstructured (e.g., text or images), and semi-structured (e.g., JSON or XML).

Features: Features are characteristics or attributes extracted from the data that the model uses for
learning. Features play a crucial role in model performance.

Labels: In supervised learning, models are trained with labeled data, where the correct output or category
is known. The model learns to map inputs (features) to corresponding outputs (labels).

Training: The training phase involves feeding the model with a dataset and adjusting its internal
parameters to minimize the difference between its predictions and the actual labels.

Testing and Validation: After training, the model is tested or validated using a separate dataset to
evaluate its performance and generalization to new, unseen data.

Predictions and Decisions: Once trained, a machine learning model can make predictions on new,
unlabeled data or make decisions based on the learned patterns.

Types of Machine Learning:

Machine learning can be broadly categorized into three main types, based on the learning approach and
the availability of labeled data:

Supervised Learning:

In supervised learning, models are trained on labeled data, where both input features and their
corresponding output labels are known.
The goal is to learn a mapping function from input to output, enabling the model to predict labels for new,
unseen data.
Common algorithms include linear regression, logistic regression, decision trees, and neural networks.

Unsupervised Learning:

Unsupervised learning is used when the data lacks labeled output, and the goal is to discover patterns,
structures, or groupings within the data.
Clustering and dimensionality reduction are common tasks in unsupervised learning.
Algorithms include k-means clustering, hierarchical clustering, and principal component analysis (PCA).

Reinforcement Learning:

Reinforcement learning is focused on training agents to make sequences of decisions to maximize


cumulative rewards in an environment.
Agents learn through trial and error, receiving feedback in the form of rewards or penalties.
It's commonly used in applications like autonomous robotics and game playing.
Popular reinforcement learning algorithms include Q-learning and deep reinforcement learning using
neural networks.
Additionally, there are other subtypes and approaches within machine learning, including semi-supervised
learning (combining labeled and unlabeled data), self-supervised learning (learning from data without
explicit labels), and transfer learning (applying knowledge from one task to another).

Machine learning plays a critical role in various applications, including image and speech recognition,
natural language processing, recommendation systems, autonomous vehicles, and many others. It has
become an essential tool for extracting knowledge and making predictions from vast and complex
datasets.
Q13. Apply supervised learning algorithm: Linear and
Logistic Regression.

1. Linear Regression:

# Create a simple dataset


data <- data.frame(
X = c(1, 2, 3, 4, 5),
Y = c(2, 4, 5, 4, 5)
)

# Train a linear regression model


model <- lm(Y ~ X, data = data)

# Make predictions
new_data <- data.frame(X = c(6, 7, 8))
predictions <- predict(model, newdata = new_data)

# Visualize the data and the regression line


plot(data$X, data$Y, main = "Linear Regression", xlab = "X", ylab = "Y")
abline(model, col = "red")

Output:
2. Logistic Regression:

# Create a binary classification dataset


data <- data.frame(
Exam_Score = c(78, 92, 60, 84, 74, 55, 70, 68, 90, 85),
Passed = c(1, 1, 0, 1, 1, 0, 1, 0, 1, 1)
)

# Train a logistic regression model


model <- glm(Passed ~ Exam_Score, data = data, family = binomial(link = "logit"))

# Make predictions
new_data <- data.frame(Exam_Score = c(65, 95, 75))
predictions <- predict(model, newdata = new_data, type = "response")

# Visualize the data and the logistic regression curve


plot(data$Exam_Score, data$Passed, main = "Logistic Regression", xlab = "Exam Score", ylab =
"Probability")
lines(data$Exam_Score, predict(model, type = "response"), col = "red")

Output:
Q14. Apply supervised learning algorithm: Decision Trees
and Random Forest.

1. Decision Trees:

Decision Trees are used for classification and regression tasks. They create a tree-like model of decisions
and their possible consequences. In R, you can use the rpart package to build decision trees.

Classification Example:

# Load the rpart package


library(rpart)

# Create a classification dataset


data <- data.frame(
Age = c(25, 30, 35, 40, 45, 50, 55, 60),
Outcome = c("No", "No", "No", "No", "Yes", "Yes", "Yes", "Yes")
)

# Train a decision tree model for classification


model <- rpart(Outcome ~ Age, data = data, method = "class")

# Make predictions
new_data <- data.frame(Age = c(38, 52))
predictions <- predict(model, new_data, type = "class")

# Visualize the decision tree


library(rpart.plot)
prp(model)

Regression Example:

# Create a regression dataset


data <- data.frame(
X = c(1, 2, 3, 4, 5),
Y = c(2, 4, 5, 4, 5)
)

# Train a decision tree model for regression


model <- rpart(Y ~ X, data = data)

# Make predictions
new_data <- data.frame(X = c(6, 7, 8))
predictions <- predict(model, new_data)

# Visualize the decision tree


prp(model)

2. Random Forest:

Random Forest is an ensemble learning method that combines multiple decision trees to improve
accuracy and reduce overfitting. In R, you can use the randomForest package.

Classification Example:

# Load the randomForest package


library(randomForest)

# Create a classification dataset


data <- data.frame(
Feature1 = c(1, 2, 3, 4, 5),
Feature2 = c(2, 4, 5, 4, 5),
Class = c("A", "B", "A", "B", "A")
)

# Train a Random Forest model for classification


model <- randomForest(Class ~ Feature1 + Feature2, data = data)

# Make predictions
new_data <- data.frame(Feature1 = c(6, 7), Feature2 = c(5, 4))
predictions <- predict(model, new_data)

# Evaluate model performance


confusion_matrix <- table(predictions, data$Class)
print(confusion_matrix)

Regression Example:

# Create a regression dataset


data <- data.frame(
X = c(1, 2, 3, 4, 5),
Y = c(2, 4, 5, 4, 5)
)

# Train a Random Forest model for regression


model <- randomForest(Y ~ X, data = data)
# Make predictions
new_data <- data.frame(X = c(6, 7, 8))
predictions <- predict(model, new_data)

# Visualize variable importance


importance(model)
Q15. Apply unsupervised learning algorithms: K-Means
Clustering and Hierarchical Clustering.

1. K-Means Clustering:

K-Means is a partitioning method that divides a dataset into K clusters based on similarity. It aims to
minimize the sum of squared distances within each cluster.

# Create a dataset
data <- data.frame(
X = c(1, 2, 2, 3, 5, 6, 7, 8, 8, 9),
Y = c(2, 3, 4, 3, 6, 5, 6, 7, 8, 8)
)

# Train a K-Means clustering model


kmeans_model <- kmeans(data, centers = 3)

# View cluster assignments


cluster_assignments <- kmeans_model$cluster
print(cluster_assignments)

# Visualize the clusters


library(ggplot2)
ggplot(data, aes(X, Y, color = as.factor(cluster_assignments))) +
geom_point() +
geom_point(data = kmeans_model$centers, aes(X, Y), color = "black", size = 4)

2. Hierarchical Clustering:

Hierarchical Clustering creates a tree-like structure (dendrogram) that represents the relationships
between data points. You can then cut the dendrogram at a certain level to form clusters.

# Create a dataset
data <- data.frame(
X = c(1, 2, 2, 3, 5, 6, 7, 8, 8, 9),
Y = c(2, 3, 4, 3, 6, 5, 6, 7, 8, 8)
)

# Perform Hierarchical Clustering


dist_matrix <- dist(data)
hc_model <- hclust(dist_matrix, method = "ward.D2")
# Cut the dendrogram to form clusters
num_clusters <- 3
cluster_cut <- cutree(hc_model, k = num_clusters)

# Visualize the clusters


library(dendextend)
dendro <- as.dendrogram(hc_model)
dendro <- color_branches(dendro, k = num_clusters)
plot(dendro)

K-Means and Hierarchical Clustering are powerful techniques for discovering natural groupings within
data. The choice between them depends on the nature of the data and the desired number of clusters.
Experimenting with different clustering techniques and evaluating their results is common practice in
unsupervised learning.
Q16. Perform Principal Component Analysis.

Principal Component Analysis (PCA) is a dimensionality reduction technique used to reduce the
complexity of high-dimensional data while preserving important information. It does this by transforming
the original variables into a new set of uncorrelated variables called principal components. Here's how to
perform PCA in R:

1. Load Data:

Load your dataset into R. For this example, let's assume you have a dataset named my_data with features
in columns.

2. Standardize the Data:

PCA is sensitive to the scale of the data, so it's a good practice to standardize it to have zero mean and
unit variance. You can use the scale() function for this.

# Standardize the data


scaled_data <- scale(my_data)

3. Perform PCA:

Use the prcomp() function to perform PCA on the standardized data. You can specify the number of
principal components you want to keep.

# Perform PCA and keep all principal components


pca_result <- prcomp(scaled_data)

# To specify the number of components to keep, you can use:


# pca_result <- prcomp(scaled_data, retx = TRUE, rank. = k)

4. Explore Results:

You can access various attributes of the PCA result to explore the analysis, including:

pca_result$center: The means of the variables.


pca_result$scale: The standard deviations of the variables.
pca_result$sdev: The standard deviations of the principal components.
pca_result$rotation: The loadings (coefficients) of the variables on the principal components.
pca_result$x: The transformed data in the principal component space.

5. Visualize the Results:


Visualize the explained variance by each principal component. You can create a scree plot to understand
how many components are needed to capture most of the variance.

# Create a scree plot


screeplot(pca_result)

6. Interpret the Principal Components:

You can interpret the principal components based on the loadings of the original variables on each
component. Positive or negative loadings indicate the direction and strength of the variables' influence on
the principal components.

7. Choose the Number of Components:

Based on the scree plot and the amount of variance explained, decide how many principal components to
retain for your analysis.

8. Transform Data with Selected Components:

Use the predict() function to transform your data into the space of the selected principal components.

# Keep, for example, the first two principal components


selected_components <- 2
reduced_data <- predict(pca_result, newdata = scaled_data)[, 1:selected_components]
Q17. Perform Time Series Analysis in R.

Time series analysis is a crucial technique for analyzing and modeling data that varies over time, such as
stock prices, temperature records, or sales data. In R, you can perform time series analysis using various
packages, but one of the most commonly used packages is stats for basic time series analysis and the
forecast package for more advanced forecasting tasks. Here's a step-by-step guide on performing time
series analysis in R:

1. Load the Required Packages:

# Load the necessary packages


library(stats)
library(forecast)

2. Create or Load Time Series Data:

You can create a time series object in R using the ts() function or load time series data from a file. Ensure
that your data has a timestamp or time index.

# Create a time series object (e.g., monthly data from 2020 to 2021)
ts_data <- ts(c(10, 15, 20, 25, 30, 35), start = c(2020, 1), frequency = 12)

# Load time series data from a file (e.g., CSV)


# ts_data <- read.csv("your_time_series_data.csv")

3. Visualize the Time Series:

To understand your data better, it's essential to plot the time series.

# Plot the time series


plot(ts_data, main = "Time Series Data", xlab = "Year", ylab = "Value")

4. Decompose the Time Series:

Decomposing a time series helps to separate it into its constituent components, such as trend, seasonality,
and noise.

# Decompose the time series


decomposed <- decompose(ts_data)
plot(decomposed)

5. Perform Basic Time Series Analysis:


Use functions like acf() (autocorrelation function) and pacf() (partial autocorrelation function) to
understand the autocorrelation in your data.

# Autocorrelation and partial autocorrelation plots


acf(ts_data)
pacf(ts_data)

6. Build Time Series Models:

You can use various models like ARIMA (AutoRegressive Integrated Moving Average) or Exponential
Smoothing for forecasting time series data.

# Fit an ARIMA model


arima_model <- auto.arima(ts_data)

7. Make Forecasts:

Use your time series model to make future forecasts.

# Make forecasts
forecast_values <- forecast(arima_model, h = 12) # Forecast for the next 12 time periods
plot(forecast_values, main = "Time Series Forecast")

8. Evaluate the Forecast:

You can evaluate the accuracy of your forecasts using metrics like Mean Absolute Error (MAE) or Mean
Squared Error (MSE).

# Evaluate the forecast


accuracy(forecast_values)

9. Visualize the Forecast:

Plot the original time series data along with the forecasted values.

# Plot the original time series and forecast


plot(ts_data, main = "Time Series Data and Forecast", xlab = "Year", ylab = "Value")
lines(forecast_values$mean, col = "blue")
legend("topleft", legend = "Forecast", col = "blue")
Q18. Manipulate text data in R for Sentiment Analysis.

Text data manipulation is an essential step in preparing data for sentiment analysis in R. In sentiment
analysis, you typically need to clean and preprocess the text data to make it suitable for analysis. Here are
the key steps involved in manipulating text data for sentiment analysis in R:

1. Load Required Libraries:

First, you need to load the necessary libraries for text data manipulation and sentiment analysis.
Commonly used packages include tm (Text Mining), stringr, and tidytext.

library(tm)
library(stringr)
library(tidytext)

2. Load and Prepare Text Data:

Load your text data, which could be in a CSV file, a data frame, or a text corpus. Ensure that your data
contains a column with the text you want to analyze.

# Load your text data (replace 'your_data.csv' with your data source)
text_data <- read.csv("your_data.csv")

# Create a text corpus


corpus <- Corpus(VectorSource(text_data$your_text_column))

3. Data Cleaning:

Text data is often messy, so you need to clean it by removing special characters, numbers, and other
unwanted elements. You can also convert the text to lowercase.

# Clean the text data


corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removeWords, stopwords("en"))
corpus <- tm_map(corpus, stripWhitespace)

4. Tokenization:

Tokenization is the process of splitting text into individual words or tokens, making it suitable for
analysis.
# Tokenize the text
corpus <- tm_map(corpus, wordTokenize)

5. Sentiment Analysis:

You can use sentiment lexicons or pre-trained models to perform sentiment analysis. For example, the
tidytext package provides a sentiment lexicon, and you can use it to determine the sentiment of each word
in the text.

# Perform sentiment analysis using the tidytext package


library(tidytext)
library(dplyr)

# Load the sentiment lexicon


data("nrc")

# Transform the text data into a format suitable for sentiment analysis
text_sentiment <- corpus %>%
unnest_tokens(word, text) %>%
inner_join(get_sentiments("nrc"))

# Summarize sentiment by text element (e.g., document, sentence, etc.)


sentiment_summary <- text_sentiment %>%
group_by(document, sentiment) %>%
summarise(sentiment_count = n()) %>%
pivot_wider(names_from = sentiment, values_from = sentiment_count, values_fill = 0)

6. Analyze Sentiment:

You can now analyze the sentiment of the text data by aggregating and summarizing the sentiment scores.

# Analyze sentiment
head(sentiment_summary)

This will give you a summary of sentiment scores for each document or text element.

7. Interpret Results:

Based on the sentiment scores, you can interpret whether the text is generally positive, negative, or
neutral.

These are the fundamental steps for manipulating text data for sentiment analysis in R. Depending on the
complexity of your analysis, you may need to explore additional text preprocessing and sentiment
analysis techniques, such as custom lexicons or machine learning models for sentiment classification.
Q19. Implement topic modelling in R.

Topic modeling is a technique used to discover topics or themes within a collection of documents. In R,
the tm and topicmodels packages are commonly used for topic modeling. Here's a step-by-step guide on
how to implement topic modeling in R:

1. Load the Required Libraries:

First, load the necessary libraries for text preprocessing and topic modeling.

library(tm)
library(topicmodels)

2. Prepare and Preprocess Text Data:

Load your text data and preprocess it, similar to the steps for sentiment analysis. Cleaning, tokenization,
and creating a Document-Term Matrix (DTM) are crucial.

# Load your text data (replace 'your_data.csv' with your data source)
text_data <- read.csv("your_data.csv")

# Create a text corpus


corpus <- Corpus(VectorSource(text_data$your_text_column))

# Clean the text data


corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removeWords, stopwords("en"))
corpus <- tm_map(corpus, stripWhitespace)

# Tokenize the text


corpus <- tm_map(corpus, wordTokenize)

# Create a Document-Term Matrix (DTM)


dtm <- DocumentTermMatrix(corpus)

3. Build the Topic Model:

Now, you can build a topic model using the LDA() function from the topicmodels package. Specify the
number of topics (k) you want to discover.

# Build the topic model


k <- 5 # Number of topics
lda_model <- LDA(dtm, k = k)

4. Explore Topics:

You can explore the topics and associated words using the terms() function. This will give you a list of
words for each topic.

# Explore topics
terms(lda_model, 5) # Show the top 5 words for each topic

5. Assign Topics to Documents:

You can assign topics to documents in your dataset using the tm package's tm_map() function.

# Assign topics to documents


topic_assignments <- as.data.frame(topics(lda_model))
text_data_with_topics <- cbind(text_data, topic_assignments)

6. Interpret Topics:

Inspect the top words in each topic to interpret what each topic represents. This will help you label the
topics based on the words associated with them.

7. Visualize Topics:

You can visualize the topics and their relationships using various visualization techniques, including word
clouds, bar plots, or network graphs.

# Visualize topics using word clouds


library(wordcloud)
wordcloud(terms(lda_model, 10))

Topic modeling is a valuable technique for discovering latent themes or topics in text data. The choice of
the number of topics (k) is a crucial decision and might require experimentation. Additionally, topic
modeling can be further enhanced with more advanced techniques, such as using other topic modeling
algorithms or performing sentiment analysis within each topic to gain deeper insights.
Q20. Review and final project using a combination of the
techniques learned.

A final project that combines various data analysis and machine learning techniques can be an excellent
way to apply what you've learned in R. Such a project can be both challenging and rewarding. Here's a
sample project idea that combines multiple techniques:

Project Idea: Predicting Customer Churn in a Telecom Company

In this project, you'll work on a simulated dataset from a telecom company to predict customer churn.
You'll combine data preprocessing, exploratory data analysis, feature engineering, machine learning, and
evaluation techniques. Here's an outline of the project:

1. Data Collection:

Start by obtaining the dataset. You can simulate customer data with features such as customer
demographics, usage patterns, contract details, and customer churn status (whether they churned or not).

2. Data Preprocessing:

Clean the data by handling missing values and outliers.


Perform feature scaling if necessary.
Encode categorical variables.
Split the data into training and testing sets.

3. Exploratory Data Analysis (EDA):

Conduct EDA to understand the relationships between different features and the target variable (churn).
Visualize the data using various plots and charts to gain insights.

4. Feature Engineering:

Create new features or modify existing ones that may be useful for predicting customer churn.
Extract relevant information from features like contract length and usage patterns.

5. Machine Learning:

Select machine learning algorithms suitable for the classification task (e.g., logistic regression, decision
trees, random forests, or gradient boosting).
Train and evaluate multiple models using cross-validation.
Tune hyperparameters to optimize model performance.
Consider ensembling techniques if necessary.
6. Model Evaluation:

Evaluate model performance using metrics such as accuracy, precision, recall, F1-score, and ROC AUC.
Create a confusion matrix and visualize it.
Consider plotting the ROC curve and Precision-Recall curve.

7. Interpretation:

Interpret the model results to understand which features are most influential in predicting churn.
Identify actionable insights that the telecom company can use to reduce customer churn.

8. Report and Presentation:

Create a report or presentation summarizing the project, including data preprocessing, EDA, modeling,
and results.
Clearly explain the methodology and key findings.
Present the predictive model's performance and its implications for the telecom company.

9. Future Recommendations:

Provide recommendations for the company based on the analysis.


Suggest strategies to reduce customer churn.

10. Code and Documentation:

Ensure that your code is well-documented, organized, and readable.


Share your code and documentation in a format that can be easily understood and used by others.

You might also like