R Programming

R Programming For Data Science
1. Consider a dataset customer_data with columns customer_id,

purchase_date, product, and amount, perform the following tasks:
i. Parse the purchase_date column into Date format.
ii. Create a new column year_month combining year and month from purchase_date.
iii. Identify and remove duplicate transactions (same customer_id, purchase_date,
product).
iv. Calculate the total amount spent by each customer in each year_month.
1. Parse the purchase_date column into Date format:

customer_data$purchase_date <-
as.Date(customer_data$purchase_date)
2. Create a new column year_month combining year and month from

purchase_date:
customer_data$year_month <-
format(customer_data$purchase_date, "%Y-%m")
3. Identify and remove duplicate transactions (same customer_id,

purchase_date, product):
customer_data <- customer_data[!
duplicated(customer_data[, c("customer_id",
"purchase_date", "product")]), ]
4. Calculate the total amount spent by each customer in each year_month:

library(dplyr)
total_amount <- customer_data %>%

group_by(customer_id, year_month) %>%
summarise(total_spent = sum(amount))
These steps will help you clean the data and perform the necessary aggregations.
2. Consider the sales dataset (assuming it has columns date, product, and
revenue), perform the following tasks:
i. Convert the date column to a Date type and extract year and month.
ii. Calculate total revenue per product per month.
iii. Visualize the trend of total revenue for each product over time.
1. Convert the date column to Date type and extract year and month:
sales$date <- as.Date(sales$date)
sales$year_month <- format(sales$date, "%Y-%m")

2. Calculate total revenue per product per month:
library(dplyr)
monthly_revenue <- sales %>%

group_by(product, year_month) %>%
summarise(total_revenue = sum(revenue))
3. Visualize the trend of total revenue for each product over time:
library(ggplot2)
ggplot(monthly_revenue, aes(x = year_month, y =

total_revenue, color = product, group = product)) +
geom_line() +
labs(title = "Total Revenue Trend by Product",
x = "Year-Month",
y = "Total Revenue") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
These steps will allow you to process the sales data, calculate the monthly revenue for each product,
and visualize the revenue trends over time.
3. Implement a Program to perform following tasks.

a. You have two datasets: one with customer information and another with their purchase
history. How would you use dplyr to perform an inner join and create a new dataset that
combines information from both datasets based on a common key (e.g., customer ID)?
b. In a data frame containing information about employees, you want to lter and retain
only those employees who belong to a certain department. How would you use R to lter rows
based on a speci c condition?
c. You've created a basic plot, but you want to customize the axis labels and add a title to
make the plot more informative. How would you use ggplot2 functions to customize the axis
labels and title of your plot?
d. You have a dataset in a CSV le with missing values and want to read it into R. How
would you use R to read the CSV le, handle missing values, and perform some basic
manipulations such as ltering and summarizing the data?
Here's how to implement each task in R:
a. Use dplyr to perform an inner join and create a new dataset

fi
fi
fi
fi
fi
fi
Assuming the datasets are customer_info and purchase_history with a common key
customer_id:
library(dplyr)
# Perform an inner join

combined_data <- inner_join(customer_info, purchase_history,
by = “customer_id")
b. Filter and retain only those employees who belong to a certain department
Assuming the dataset is employee_data and the department you are interested in is "Sales":
# Filter employees belonging to the Sales department

sales_employees <- employee_data %>%
filter(department == “Sales")
c. Customize the axis labels and add a title using ggplot2
Assuming you have a basic plot p:
library(ggplot2)
# Create a basic plot
p <- ggplot(data, aes(x = x_variable, y = y_variable)) +
geom_point()
# Customize the axis labels and add a title
p <- p +
labs(title = "My Customized Plot Title",
x = "X Axis Label",
y = "Y Axis Label”)
d. Read a CSV le with missing values and perform basic manipulations
Assuming the CSV le is named "data.csv":
# Read the CSV file

data <- read.csv("data.csv", na.strings = c("", "NA"))
# Handle missing values (e.g., remove rows with any missing
values)
data_clean <- na.omit(data)
# Perform basic manipulations: filtering and summarizing
# Example: Filter rows where a column 'age' is greater than
30
filtered_data <- data_clean %>%
filter(age > 30)
# Example: Summarize data by calculating the mean age
fi
fi
summary_data <- data_clean %>%
summarise(mean_age = mean(age, na.rm = TRUE))
These steps will help you perform the necessary data manipulations and visualizations using R.
4. Implement the multiple regression model for the given dataset. :

a. Install and load necessary packages (caret, e1071)
b. Load mtcars dataset
c. Explore the dataset
d. Split the data into training (70%) and testing (30%) sets
e. Build a regression model to predict miles per gallon (mpg) based on multiple features such
as weight (wt), horsepower (hp) and the number of cylinders (cyl).
f. Make predictions on the test set
g. Evaluate the model.
h. Print the evaluation metrics
a. Install and load necessary packages (caret, e1071)
install.packages("caret")
install.packages("e1071")
library(caret)
library(e1071)
b. Load mtcars dataset
data(mtcars)
c. Explore the dataset
# Display the structure of the dataset

str(mtcars)
# Summary of the dataset

summary(mtcars)
d. Split the data into training (70%) and testing (30%) sets
set.seed(123) # Setting seed for reproducibility

split_index <- createDataPartition(mtcars$mpg, p = 0.7, list
= FALSE)
train_data <- mtcars[split_index, ]
test_data <- mtcars[-split_index, ]
e. Build a regression model to predict miles per gallon (mpg) based on weight
(wt), horsepower (hp), and the number of cylinders (cyl)
# Build the model
model <- lm(mpg ~ wt + hp + cyl, data = train_data)
f. Make predictions on the test set
predictions <- predict(model, newdata = test_data)
g. Evaluate the model
# Calculate evaluation metrics

results <- data.frame(
RMSE = RMSE(predictions, test_data$mpg),
R2 = R2(predictions, test_data$mpg)
)
h. Print the evaluation metrics
print(results)
By following these steps, you can build, train, and evaluate a multiple regression model to predict
mpg based on wt, hp, and cyl in the mtcars dataset.
Important Questions (10 marks)

1. What is data analysis, and why is it important? Discuss the key steps
involved in the data analysis process.
A. What is Data Analysis and Why is it Important?

Data analysis is the process of examining, cleaning, transforming, and modeling data to extract
useful information, draw conclusions, and support decision-making.
Importance:
1. Informed Decisions: Helps make better decisions with evidence.

2. Trends and Patterns: Reveals important trends and relationships.
3. Ef ciency: Optimizes operations and reduces costs.
4. Prediction: Forecasts future trends and behaviors.
5. Quality Improvement: Identi es and corrects data errors.
6. Competitive Edge: Provides insights that offer a competitive advantage.
Key Steps in the Data Analysis Process
1. De ne Objectives: Know what questions you need to answer.

2. Collect Data: Gather relevant data from various sources.
3. Clean Data: Correct errors and handle missing values.
4. Explore Data: Use summaries and visualizations to understand the data.
5. Transform Data: Prepare data for analysis by normalizing or aggregating it.
6. Analyze Data: Apply statistical or machine learning models.
7. Interpret Results: Draw conclusions and make decisions based on the analysis.
fi
fi
fi
2. Describe how to create a basic scatter plot using ggplot2. What are the
key components of the plot, and how do you customize it?
Creating a Basic Scatter Plot using ggplot2
Install and Load ggplot2 Package:

install.packages(“ggplot2")
library(ggplot2)
Create a Basic Scatter Plot:

data <- data.frame(x = rnorm(100), y = rnorm(100))
ggplot(data, aes(x = x, y = y)) +

geom_point()
Key Components of the Plot
Data Frame: Contains the data you want to plot.

ggplot() Function: Sets up the plot with your data and aesthetic mappings.
geom_point(): Adds the scatter plot points.
Customizing the Plot
Add Titles and Axis Labels:

geom_point() +
labs(title = "Scatter Plot", x = "X Axis", y = "Y Axis")
Change Point Color and Size:
geom_point(color = "blue", size = 3)

Change Theme:
geom_point() +
theme_minimal()
These steps will help you create and customize a scatter plot using ggplot2 in R.
3. How can you identify anomalies in numerical data? Discuss methods

such as visual inspection and statistical techniques.
A. Identifying Anomalies in Numerical Data
1. Visual Inspection:
◦ Scatter Plots: Plot data points to see if any stand out from the general pattern.
◦ Box Plots: Show data distribution and highlight outliers as points outside the
whiskers.
2. Statistical Techniques:
◦ Z-Score: Calculate how many standard deviations a data point is from the mean.
Points with high Z-scores (e.g., > 3 or < -3) are potential anomalies.
◦ IQR (Interquartile Range): Calculate the range between the 1st and 3rd quartiles.
Points outside 1.5 times the IQR above the 3rd quartile or below the 1st quartile are
potential anomalies.
◦ Histogram: Plot the frequency of data values to spot unusual peaks or gaps.
These methods help in detecting unusual or unexpected values in numerical data.
4. Explain the concept of cross-validation in regression analysis. How does

it help in assessing the performance of a regression model?
A. Concept of Cross-Validation in Regression Analysis
Cross-validation is a technique used to evaluate how well a regression model performs by testing it
on different subsets of the data.
How It Works:
1. Split Data: Divide the data into several parts (folds). For example, in 5-fold cross-
validation, the data is split into 5 parts.
2. Train and Test: Train the model on some folds and test it on the remaining fold. Repeat this
process for each fold.
3. Average Results: Calculate the performance (like accuracy or error) for each fold and then
average the results.
How It Helps:
1. Better Performance Estimate: It provides a more accurate assessment of the model’s

performance by testing it on different data subsets.
2. Reduces Over tting: Helps in understanding how the model generalizes to new data,
reducing the risk of over tting to the training data.
3. Improves Model Selection: Helps in choosing the best model by comparing performance
across different folds.
Cross-validation helps ensure that the regression model performs well and generalizes effectively to
unseen data.
5. Describe the K-Nearest Neighbors (K-NN) algorithm. How does it

classify new data points based on the K-NN method?
A. K-Nearest Neighbors (K-NN) Algorithm
K-Nearest Neighbors (K-NN) is a simple algorithm used for classi cation and regression.
How It Works:
1. Choose K: Select the number of nearest neighbors (K) to consider.
fi
fi
fi
2. Find Neighbors: For a new data point, nd the K closest data points from the training set
based on distance (e.g., Euclidean distance).
3. Classify or Predict:
◦ Classi cation: Count the most common class among the K nearest neighbors and
assign that class to the new data point.
◦ Regression: Average the values of the K nearest neighbors and use this average as
the prediction for the new data point.
Example:
If K=3 and you want to classify a new point, you:
• Find the 3 nearest points to the new point.

• If 2 of the nearest points belong to Class A and 1 belongs to Class B, the new point is
classi ed as Class A.
K-NN is easy to understand and implement but can be computationally intensive for large datasets.
6. Explain the basic syntax and structure of R. How do you de ne

variables, create vectors, and perform basic operations in R?
A. Basic Syntax and Structure of R

R is a programming language used for data analysis. Here’s a simple overview:
De ning Variables
To de ne a variable in R, use the <- operator:
x <- 5 # Assigns the value 5 to the variable x
Creating Vectors
Vectors are a basic data structure in R. Use the c() function to create them:
numbers <- c(1, 2, 3, 4, 5) # Creates a vector with values 1

to 5
Performing Basic Operations
1. Arithmetic Operations:
sum <- 3 + 2 # Addition
product <- 4 * 5 # Multiplication
2. Vector Operations:
result <- numbers * 2 # Multiplies each element in the
vector by 2
fi
fi
fi
fi
fi
fi
3. Access Elements:
first_element <- numbers[1] # Gets the first element of the
vector
R uses a combination of functions and operators to perform operations and handle data.
7. What initial steps should you take to explore a new dataset? Discuss
techniques for understanding the structure, summary statistics, and data
types.
A. Initial Steps to Explore a New Dataset

1. Check the Structure:
◦ Use str(data) to see the data types and structure of each column.
2. View the Data:
◦ Use head(data) to look at the rst few rows of the dataset.

3. Summary Statistics:
◦ Use summary(data) to get basic statistics like mean, median, and range for
each column.
4. Check Data Types:
◦ Use sapply(data, class) to see the data types of each column (e.g.,
numeric, factor, character).
These steps help you understand the dataset’s layout, the types of data it contains, and basic
statistics about its contents.
8. What are the key assumptions of linear regression? Explain how you can
check for these assumptions using diagnostic plots and statistical tests.
A. Key Assumptions of Linear Regression

1. Linearity: The relationship between the predictors and the response variable is linear.
◦ Check: Use a scatter plot of residuals vs. tted values to see if there is any clear
pattern.
2. Independence: Observations are independent of each other.
◦ Check: Use Durbin-Watson test to test for autocorrelation in residuals.

3. Homoscedasticity: Residuals (errors) have constant variance.
◦ Check: Use a plot of residuals vs. tted values to ensure residuals are evenly
spread.
4. Normality of Residuals: Residuals are normally distributed.
◦ Check: Use a Q-Q plot (quantile-quantile plot) to see if residuals follow a straight
line.
◦ Check: Use the Shapiro-Wilk test to formally test for normality.
5. No Multicollinearity: Predictors are not too highly correlated with each other.
fi
fi
fi
◦ Check: Use Variance In ation Factor (VIF) to check for multicollinearity.
These checks help ensure that your linear regression model is valid and reliable.
9. What is regression analysis, and what is its purpose? Explain the

difference between simple and multiple regression
A. Regression Analysis
Regression analysis is a statistical method used to understand the relationship between a dependent
variable (what you're trying to predict) and one or more independent variables (predictors).
Purpose:
• To predict values of the dependent variable.

• To identify and quantify relationships between variables.
• To make informed decisions based on data.
Difference Between Simple and Multiple Regression
• Simple Regression:
◦ Description: Examines the relationship between one dependent variable and one
independent variable.
◦ Example: Predicting house prices based on square footage.
• Multiple Regression:
◦ Description: Examines the relationship between one dependent variable and two or
more independent variables.
◦ Example: Predicting house prices based on square footage, number of bedrooms,
and location.
Simple regression uses one predictor, while multiple regression uses several predictors.
10. Explain how logistic regression works for binary classi cation. How are
the predicted probabilities interpreted?
A. Logistic Regression for Binary Classi cation
Logistic regression is used to predict the probability of a binary outcome (e.g., yes/no, 0/1).
How It Works:
1. Model Formula: It estimates the relationship between the dependent variable and one or more
independent variables using the logistic function.
2. Logistic Function: The logistic function converts the linear combination of predictors into a
probability between 0 and 1.
fl
fi
fi
Interpreting Predicted Probabilities:
- Probability (p): The output of logistic regression is the probability that the outcome is 1. For
example, a probability of 0.8 means there is an 80% chance of the outcome being 1.
- Classi cation: To classify, you can set a threshold (e.g., 0.5). If $ p $ is greater than 0.5, the
prediction is 1; otherwise, it is 0.
Logistic regression provides probabilities that help in understanding the likelihood of

different outcomes.
Programming Classes
fi

R Programming

Uploaded by

Copyright:

Available Formats

R Programming

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

R Programming

Uploaded by

Copyright:

Available Formats

R Programming For Data Science

1. Consider a dataset customer_data with columns customer_id,

1. Parse the purchase_date column into Date format:

2. Create a new column year_month combining year and month from

3. Identify and remove duplicate transactions (same customer_id,

4. Calculate the total amount spent by each customer in each year_month:

total_amount <- customer_data %>%

sales$date <- as.Date(sales$date)

sales$year_month <- format(sales$date, "%Y-%m")

monthly_revenue <- sales %>%

ggplot(monthly_revenue, aes(x = year_month, y =

3. Implement a Program to perform following tasks.

Here's how to implement each task in R:

a. Use dplyr to perform an inner join and create a new dataset

# Perform an inner join

# Filter employees belonging to the Sales department

c. Customize the axis labels and add a title using ggplot2

Assuming you have a basic plot p:

d. Read a CSV le with missing values and perform basic manipulations

Assuming the CSV le is named "data.csv":

# Read the CSV file

4. Implement the multiple regression model for the given dataset. :

a. Install and load necessary packages (caret, e1071)

b. Load mtcars dataset

c. Explore the dataset

# Display the structure of the dataset

# Summary of the dataset

set.seed(123) # Setting seed for reproducibility

predictions <- predict(model, newdata = test_data)

g. Evaluate the model

# Calculate evaluation metrics

h. Print the evaluation metrics

Important Questions (10 marks)

A. What is Data Analysis and Why is it Important?

1. Informed Decisions: Helps make better decisions with evidence.

1. De ne Objectives: Know what questions you need to answer.

Install and Load ggplot2 Package:

Create a Basic Scatter Plot:

ggplot(data, aes(x = x, y = y)) +

Key Components of the Plot

Data Frame: Contains the data you want to plot.

Add Titles and Axis Labels:

geom_point(color = "blue", size = 3)

3. How can you identify anomalies in numerical data? Discuss methods

A. Identifying Anomalies in Numerical Data

4. Explain the concept of cross-validation in regression analysis. How does

A. Concept of Cross-Validation in Regression Analysis

1. Better Performance Estimate: It provides a more accurate assessment of the model’s

5. Describe the K-Nearest Neighbors (K-NN) algorithm. How does it

A. K-Nearest Neighbors (K-NN) Algorithm

If K=3 and you want to classify a new point, you:

• Find the 3 nearest points to the new point.

6. Explain the basic syntax and structure of R. How do you de ne

A. Basic Syntax and Structure of R

To de ne a variable in R, use the <- operator:

x <- 5 # Assigns the value 5 to the variable x

numbers <- c(1, 2, 3, 4, 5) # Creates a vector with values 1

sum <- 3 + 2 # Addition

product <- 4 * 5 # Multiplication

A. Initial Steps to Explore a New Dataset