R Programming
R Programming
R Programming
These steps will help you clean the data and perform the necessary aggregations.
2. Consider the sales dataset (assuming it has columns date, product, and
revenue), perform the following tasks:
i. Convert the date column to a Date type and extract year and month.
ii. Calculate total revenue per product per month.
iii. Visualize the trend of total revenue for each product over time.
1. Convert the date column to Date type and extract year and month:
3. Visualize the trend of total revenue for each product over time:
library(ggplot2)
These steps will allow you to process the sales data, calculate the monthly revenue for each product,
and visualize the revenue trends over time.
b. In a data frame containing information about employees, you want to lter and retain
only those employees who belong to a certain department. How would you use R to lter rows
based on a speci c condition?
c. You've created a basic plot, but you want to customize the axis labels and add a title to
make the plot more informative. How would you use ggplot2 functions to customize the axis
labels and title of your plot?
d. You have a dataset in a CSV le with missing values and want to read it into R. How
would you use R to read the CSV le, handle missing values, and perform some basic
manipulations such as ltering and summarizing the data?
library(dplyr)
b. Filter and retain only those employees who belong to a certain department
Assuming the dataset is employee_data and the department you are interested in is "Sales":
library(ggplot2)
# Create a basic plot
p <- ggplot(data, aes(x = x_variable, y = y_variable)) +
geom_point()
# Customize the axis labels and add a title
p <- p +
labs(title = "My Customized Plot Title",
x = "X Axis Label",
y = "Y Axis Label”)
These steps will help you perform the necessary data manipulations and visualizations using R.
install.packages("caret")
install.packages("e1071")
library(caret)
library(e1071)
data(mtcars)
d. Split the data into training (70%) and testing (30%) sets
e. Build a regression model to predict miles per gallon (mpg) based on weight
(wt), horsepower (hp), and the number of cylinders (cyl)
# Build the model
model <- lm(mpg ~ wt + hp + cyl, data = train_data)
f. Make predictions on the test set
print(results)
By following these steps, you can build, train, and evaluate a multiple regression model to predict
mpg based on wt, hp, and cyl in the mtcars dataset.
Importance:
library(ggplot2)
geom_point() +
labs(title = "Scatter Plot", x = "X Axis", y = "Y Axis")
Change Point Color and Size:
ggplot(data, aes(x = x, y = y)) +
geom_point() +
theme_minimal()
These steps will help you create and customize a scatter plot using ggplot2 in R.
1. Visual Inspection:
◦ Scatter Plots: Plot data points to see if any stand out from the general pattern.
◦ Box Plots: Show data distribution and highlight outliers as points outside the
whiskers.
2. Statistical Techniques:
◦ Z-Score: Calculate how many standard deviations a data point is from the mean.
Points with high Z-scores (e.g., > 3 or < -3) are potential anomalies.
◦ IQR (Interquartile Range): Calculate the range between the 1st and 3rd quartiles.
Points outside 1.5 times the IQR above the 3rd quartile or below the 1st quartile are
potential anomalies.
◦ Histogram: Plot the frequency of data values to spot unusual peaks or gaps.
These methods help in detecting unusual or unexpected values in numerical data.
Cross-validation is a technique used to evaluate how well a regression model performs by testing it
on different subsets of the data.
How It Works:
1. Split Data: Divide the data into several parts (folds). For example, in 5-fold cross-
validation, the data is split into 5 parts.
2. Train and Test: Train the model on some folds and test it on the remaining fold. Repeat this
process for each fold.
3. Average Results: Calculate the performance (like accuracy or error) for each fold and then
average the results.
How It Helps:
K-Nearest Neighbors (K-NN) is a simple algorithm used for classi cation and regression.
How It Works:
1. Choose K: Select the number of nearest neighbors (K) to consider.
fi
fi
fi
2. Find Neighbors: For a new data point, nd the K closest data points from the training set
based on distance (e.g., Euclidean distance).
3. Classify or Predict:
◦ Classi cation: Count the most common class among the K nearest neighbors and
assign that class to the new data point.
◦ Regression: Average the values of the K nearest neighbors and use this average as
the prediction for the new data point.
Example:
De ning Variables
Creating Vectors
Vectors are a basic data structure in R. Use the c() function to create them:
1. Arithmetic Operations:
2. Vector Operations:
result <- numbers * 2 # Multiplies each element in the
vector by 2
fi
fi
fi
fi
fi
fi
3. Access Elements:
first_element <- numbers[1] # Gets the first element of the
vector
R uses a combination of functions and operators to perform operations and handle data.
7. What initial steps should you take to explore a new dataset? Discuss
techniques for understanding the structure, summary statistics, and data
types.
◦ Use str(data) to see the data types and structure of each column.
2. View the Data:
◦ Use summary(data) to get basic statistics like mean, median, and range for
each column.
4. Check Data Types:
◦ Use sapply(data, class) to see the data types of each column (e.g.,
numeric, factor, character).
These steps help you understand the dataset’s layout, the types of data it contains, and basic
statistics about its contents.
8. What are the key assumptions of linear regression? Explain how you can
check for these assumptions using diagnostic plots and statistical tests.
◦ Check: Use a Q-Q plot (quantile-quantile plot) to see if residuals follow a straight
line.
◦ Check: Use the Shapiro-Wilk test to formally test for normality.
5. No Multicollinearity: Predictors are not too highly correlated with each other.
fi
fi
fi
◦ Check: Use Variance In ation Factor (VIF) to check for multicollinearity.
These checks help ensure that your linear regression model is valid and reliable.
A. Regression Analysis
Regression analysis is a statistical method used to understand the relationship between a dependent
variable (what you're trying to predict) and one or more independent variables (predictors).
Purpose:
• Simple Regression:
◦ Description: Examines the relationship between one dependent variable and one
independent variable.
◦ Example: Predicting house prices based on square footage.
• Multiple Regression:
◦ Description: Examines the relationship between one dependent variable and two or
more independent variables.
◦ Example: Predicting house prices based on square footage, number of bedrooms,
and location.
Simple regression uses one predictor, while multiple regression uses several predictors.
10. Explain how logistic regression works for binary classi cation. How are
the predicted probabilities interpreted?
A. Logistic Regression for Binary Classi cation
Logistic regression is used to predict the probability of a binary outcome (e.g., yes/no, 0/1).
How It Works:
1. Model Formula: It estimates the relationship between the dependent variable and one or more
independent variables using the logistic function.
2. Logistic Function: The logistic function converts the linear combination of predictors into a
probability between 0 and 1.
fl
fi
fi
Interpreting Predicted Probabilities:
- Probability (p): The output of logistic regression is the probability that the outcome is 1. For
example, a probability of 0.8 means there is an 80% chance of the outcome being 1.
- Classi cation: To classify, you can set a threshold (e.g., 0.5). If \( p \) is greater than 0.5, the
prediction is 1; otherwise, it is 0.
Programming Classes
fi