PushpendraLabFile
PushpendraLabFile
(A Govt. Aided UGC Autonomous & NAAC Accredited Institute Affiliated to RGPV, Bhopal)
Lab Report
on
R Programming
150701
Submitted By:
Mohd Asif Farhan
Satyam Khare
Pushpendra Rajput
0901CS201070
0901CS201092
0901CS201112
Faculty Mentor:
Prof. Khushboo Agrawal, Assistant Professor
Submitted to:
JULY-DEC 2023
INDEX
14. Apply supervised learning algorithm: Decision Trees and Random Forest.
20. Review and final project using a combination of the techniques learned.
Q1. Install R and RStudio, familiarize with the interface.
Installing R:
Installing RStudio:
● Script Editor: This is where you can write and edit R scripts or code. You can create a new script
by clicking File > New File > R Script.
● Console: The console is where you can interact with R by running individual commands or
scripts. You can type R code directly into the console and press Enter to execute it.
● Environment: This pane displays information about the current R environment, including loaded
datasets and variables. You can view your data, objects, and their properties here.
● Plots: This pane is for displaying plots and charts generated from your R code. You can use it to
visualize data.
Working with R:
● Start by entering simple R commands in the console to get a feel for how it works. For example,
you can try basic arithmetic operations like 2 + 2 or create a vector using my_vector <- c(1, 2, 3).
● In the script editor, write some R code. For instance, you can create a script that prints "Hello, R!"
to the console.
# This is a comment
print("Hello, R!")
● To run the script, you can either select the code and click the "Run" button or use the shortcut
(e.g., Ctrl+Enter or Command+Enter).
Accessing Help:
● R and RStudio have extensive help resources. You can access documentation and packages
through the "Help" menu.
Q2. Practice basic operations, variable creation, and types of
operators.
a <- 5
b <- 3
sum_ab <- a + b
diff_ab <- a - b
product_ab <- a * b
quotient_ab <- a / b
print(sum_ab)
print(diff_ab)
print(product_ab)
print(quotient_ab)
Output:
[1] 8
[1] 2
[1] 15
[1] 1.666667
my_integer <- 42
my_character <- "Hello, R!"
my_logical <- TRUE
my_numeric_vector <- c(1.2, 2.3, 3.4, 4.5)
print(my_integer)
print(my_character)
print(my_logical)
print(my_numeric_vector)
Output:
[1] 42
[1] "Hello, R!"
[1] TRUE
[1] 1.2 2.3 3.4 4.5
Comparison and Logical Operators:
x <- 10
y <- 20
is_equal <- x == y
is_not_equal <- x != y
is_greater_than <- x > y
is_less_than <- x < y
logical_and <- is_equal && is_greater_than
logical_or <- is_not_equal || is_less_than
print(is_equal)
print(is_not_equal)
print(is_greater_than)
print(is_less_than)
print(logical_and)
print(logical_or)
Output:
[1] FALSE
[1] TRUE
[1] FALSE
[1] TRUE
[1] FALSE
[1] TRUE
Conditional Statements:
x <- 10
y <- 20
if (x < y) {
cat("x is less than y\n")
} else if (x > y) {
cat("x is greater than y\n")
} else {
cat("x is equal to y\n")
}
Output:
x is less than y
Functions:
Output:
[1] 15
Q3. Create and manipulate different data structures.
Vectors:
print(second_element)
print(vector_sum)
Output:
[1] 2
[1] 11 12 13 14 15
Lists:
# Access elements
name_value <- my_list$name
# Modify an element
my_list$age <- 31
print(name_value)
print(my_list)
Output:
[1] "John"
$name
[1] "John"
$age
[1] 31
$city
[1] "New York"
Matrices:
# Matrix operations
matrix_sum <- my_matrix + 5
matrix_product <- my_matrix %*% t(my_matrix)
print(matrix_sum)
print(matrix_product)
Output:
[,1] [,2]
[1,] 14 32
[2,] 32 77
Data Frames:
# Access columns
name_column <- my_df$Name
# Filter rows
chicago_residents <- my_df[my_df$City == "Chicago", ]
print(name_column)
print(chicago_residents)
Output:
# Access an element
first_element <- my_array[1, 1, 1]
print(first_element)
Output:
[1] 1
Q4. Install and load R packages.
Installing R Packages:
● Open RStudio.
● To install a package, you can use the install.packages() function. For example, to install the
ggplot2 package for data visualization:
● install.packages("ggplot2")
● R will prompt you to choose a CRAN mirror; select one that is geographically close to your
location.
● R will download and install the package and its dependencies. Once the installation is complete,
you can use the package in your R session.
Loading R Packages:
● To use a package in your R script or session, you need to load it using the library() or require()
function. For example, to load the ggplot2 package:
● library(ggplot2)
● You can also use require() instead of library(), which returns a logical value indicating whether
the package is available. If it's not, require() will not stop the execution of your script.
● Once a package is loaded, you can use its functions and capabilities in your R code.
● To check which packages are currently loaded in your R session, you can use the search()
function:
● search()
● This will display a list of loaded packages and their order of precedence.
● To unload a package and remove it from your current R session, you can use the detach()
function. For example, to detach the ggplot2 package:
● detach("package:ggplot2", unload = TRUE)
● Unloading a package can be useful if you want to free up memory or avoid conflicts between
package functions.
Q5. Perform data manipulation using dplyr.
Output:
# Separate the single column into Name, Age, and City columns
separated_df <- separate(df, col = Name, into = c("Name", "Age", "City"), sep = ", ")
Output:
Output:
Q8. Perform descriptive statistics and hypothesis testing.
# Mean
mean_height <- mean(df$Height)
mean_weight <- mean(df$Weight)
# Median
median_height <- median(df$Height)
median_weight <- median(df$Weight)
# Variance
var_height <- var(df$Height)
var_weight <- var(df$Weight)
# Standard Deviation
sd_height <- sd(df$Height)
sd_weight <- sd(df$Weight)
# Range
range_height <- range(df$Height)
range_weight <- range(df$Weight)
Output:
Height Weight
Min. :160.0 Min. :60.0
1st Qu.:168.0 1st Qu.:70.0
Median :170.0 Median :72.0
Mean :171.6 Mean :71.4
3rd Qu.:175.0 3rd Qu.:75.0
Max. :185.0 Max. :80.0
filter, lag
Output:
Export:
df <- data.frame(
Category = c("A", "B", "A", "C", "B"),
Value = c(25, 50, 30, 40, 55)
)
ID Value1 Value2
1 3 30 100
2 4 40 200
3 5 50 300
filter, lag
ID Value1 Value2
1 1 10 NA
2 2 20 NA
3 3 30 100
4 4 40 200
5 5 50 300
# A tibble: 10 x 3
ID variable value
<int> <chr> <dbl>
1 1 Var1 10
2 1 Var2 15
3 2 Var1 20
4 2 Var2 25
5 3 Var1 30
6 3 Var2 35
7 4 Var1 40
8 4 Var2 45
9 5 Var1 50
10 5 Var2 55
# A tibble: 5 x 3
ID Var1 Var2
<int> <dbl> <dbl>
1 1 10 15
2 2 20 25
3 3 30 35
4 4 40 45
5 5 50 55
Category Value
1 A 55
2 B 105
3 C 40
# A tibble: 3 x 2
Category Sum
<chr> <dbl>
1A 55
2B 105
3C 40
Name Age
1 Alice 28
3 Bob 22
5 David 29
Name Age
1 Alice 28
2 Bob 22
3 David 29
ID Value NewVariable
1 1 10 20
2 2 20 40
3 3 30 60
4 4 40 80
5 5 50 100
You need to have the rmarkdown package installed and loaded to create R Markdown documents.
# Install and load the rmarkdown package
install.packages("rmarkdown")
library(rmarkdown)
In RStudio, you can create a new R Markdown document by going to File > New File > R Markdown....
This will open a dialog where you can configure your R Markdown document.
In the R Markdown document, you can include a mix of text, code chunks, and Markdown formatting.
Here's an example of a simple R Markdown document:
---
title: "Sample R Markdown Report"
output:
html_document:
toc: true
---
# Introduction
This is a sample R Markdown report. We'll include some code and plots in this report.
# Data Loading
```{r}
# Load data
data <- read.csv("data.csv")
head(data)
Data Summary
# Summary statistics
summary(data)
Data Visualization
# Create a scatter plot
library(ggplot2)
ggplot(data, aes(x = X, y = Y)) +
geom_point()
Conclusion
In this example:
- The `title` field in the YAML header sets the title of your report.
- Under the "Data Loading" section, there's an R code chunk that loads data from a CSV file.
- In the "Data Summary" section, another R code chunk provides summary statistics of the data.
- The "Data Visualization" section contains R code for creating a scatter plot using the `ggplot2` package.
- The report includes text sections along with code chunks that can be executed to generate results and
visualizations.
To generate the report, click the "Knit" button in RStudio, or use the `knit()` function in R with your R
Markdown file as an argument. This will run the code chunks and produce an HTML (or other format)
report.
Once the knitting process is complete, you can view the generated report. It will include your text, code
results, and plots, making it easy to communicate your data analysis in a comprehensive document. You
can save the HTML or other output formats and share them as needed.
R Markdown is a versatile tool for creating dynamic and reproducible reports, and you can customize
your documents to include various elements like tables, LaTeX equations, citations, and more. For
advanced formatting and customization, refer to the R Markdown documentation and cheat sheets.
Q12. Understand the concept of machine learning and its
types.
Machine learning is a subfield of artificial intelligence (AI) that focuses on the development of algorithms
and statistical models that enable computer systems to learn from and make predictions or decisions based
on data. The primary goal of machine learning is to develop models that can identify patterns, extract
insights, and make predictions or decisions without being explicitly programmed.
Data: Machine learning relies on data, which serves as the primary source of information for training,
testing, and validating models. Data can come in various forms, including structured (e.g., tables),
unstructured (e.g., text or images), and semi-structured (e.g., JSON or XML).
Features: Features are characteristics or attributes extracted from the data that the model uses for
learning. Features play a crucial role in model performance.
Labels: In supervised learning, models are trained with labeled data, where the correct output or category
is known. The model learns to map inputs (features) to corresponding outputs (labels).
Training: The training phase involves feeding the model with a dataset and adjusting its internal
parameters to minimize the difference between its predictions and the actual labels.
Testing and Validation: After training, the model is tested or validated using a separate dataset to
evaluate its performance and generalization to new, unseen data.
Predictions and Decisions: Once trained, a machine learning model can make predictions on new,
unlabeled data or make decisions based on the learned patterns.
Machine learning can be broadly categorized into three main types, based on the learning approach and
the availability of labeled data:
Supervised Learning:
In supervised learning, models are trained on labeled data, where both input features and their
corresponding output labels are known.
The goal is to learn a mapping function from input to output, enabling the model to predict labels for new,
unseen data.
Common algorithms include linear regression, logistic regression, decision trees, and neural networks.
Unsupervised Learning:
Unsupervised learning is used when the data lacks labeled output, and the goal is to discover patterns,
structures, or groupings within the data.
Clustering and dimensionality reduction are common tasks in unsupervised learning.
Algorithms include k-means clustering, hierarchical clustering, and principal component analysis (PCA).
Reinforcement Learning:
Machine learning plays a critical role in various applications, including image and speech recognition,
natural language processing, recommendation systems, autonomous vehicles, and many others. It has
become an essential tool for extracting knowledge and making predictions from vast and complex
datasets.
Q13. Apply supervised learning algorithm: Linear and
Logistic Regression.
1. Linear Regression:
# Make predictions
new_data <- data.frame(X = c(6, 7, 8))
predictions <- predict(model, newdata = new_data)
Output:
2. Logistic Regression:
# Make predictions
new_data <- data.frame(Exam_Score = c(65, 95, 75))
predictions <- predict(model, newdata = new_data, type = "response")
Output:
Q14. Apply supervised learning algorithm: Decision Trees
and Random Forest.
1. Decision Trees:
Decision Trees are used for classification and regression tasks. They create a tree-like model of decisions
and their possible consequences. In R, you can use the rpart package to build decision trees.
Classification Example:
# Make predictions
new_data <- data.frame(Age = c(38, 52))
predictions <- predict(model, new_data, type = "class")
Regression Example:
# Make predictions
new_data <- data.frame(X = c(6, 7, 8))
predictions <- predict(model, new_data)
2. Random Forest:
Random Forest is an ensemble learning method that combines multiple decision trees to improve
accuracy and reduce overfitting. In R, you can use the randomForest package.
Classification Example:
# Make predictions
new_data <- data.frame(Feature1 = c(6, 7), Feature2 = c(5, 4))
predictions <- predict(model, new_data)
Regression Example:
1. K-Means Clustering:
K-Means is a partitioning method that divides a dataset into K clusters based on similarity. It aims to
minimize the sum of squared distances within each cluster.
# Create a dataset
data <- data.frame(
X = c(1, 2, 2, 3, 5, 6, 7, 8, 8, 9),
Y = c(2, 3, 4, 3, 6, 5, 6, 7, 8, 8)
)
2. Hierarchical Clustering:
Hierarchical Clustering creates a tree-like structure (dendrogram) that represents the relationships
between data points. You can then cut the dendrogram at a certain level to form clusters.
# Create a dataset
data <- data.frame(
X = c(1, 2, 2, 3, 5, 6, 7, 8, 8, 9),
Y = c(2, 3, 4, 3, 6, 5, 6, 7, 8, 8)
)
K-Means and Hierarchical Clustering are powerful techniques for discovering natural groupings within
data. The choice between them depends on the nature of the data and the desired number of clusters.
Experimenting with different clustering techniques and evaluating their results is common practice in
unsupervised learning.
Q16. Perform Principal Component Analysis.
Principal Component Analysis (PCA) is a dimensionality reduction technique used to reduce the
complexity of high-dimensional data while preserving important information. It does this by transforming
the original variables into a new set of uncorrelated variables called principal components. Here's how to
perform PCA in R:
1. Load Data:
Load your dataset into R. For this example, let's assume you have a dataset named my_data with features
in columns.
PCA is sensitive to the scale of the data, so it's a good practice to standardize it to have zero mean and
unit variance. You can use the scale() function for this.
3. Perform PCA:
Use the prcomp() function to perform PCA on the standardized data. You can specify the number of
principal components you want to keep.
4. Explore Results:
You can access various attributes of the PCA result to explore the analysis, including:
You can interpret the principal components based on the loadings of the original variables on each
component. Positive or negative loadings indicate the direction and strength of the variables' influence on
the principal components.
Based on the scree plot and the amount of variance explained, decide how many principal components to
retain for your analysis.
Use the predict() function to transform your data into the space of the selected principal components.
Time series analysis is a crucial technique for analyzing and modeling data that varies over time, such as
stock prices, temperature records, or sales data. In R, you can perform time series analysis using various
packages, but one of the most commonly used packages is stats for basic time series analysis and the
forecast package for more advanced forecasting tasks. Here's a step-by-step guide on performing time
series analysis in R:
You can create a time series object in R using the ts() function or load time series data from a file. Ensure
that your data has a timestamp or time index.
# Create a time series object (e.g., monthly data from 2020 to 2021)
ts_data <- ts(c(10, 15, 20, 25, 30, 35), start = c(2020, 1), frequency = 12)
To understand your data better, it's essential to plot the time series.
Decomposing a time series helps to separate it into its constituent components, such as trend, seasonality,
and noise.
You can use various models like ARIMA (AutoRegressive Integrated Moving Average) or Exponential
Smoothing for forecasting time series data.
7. Make Forecasts:
# Make forecasts
forecast_values <- forecast(arima_model, h = 12) # Forecast for the next 12 time periods
plot(forecast_values, main = "Time Series Forecast")
You can evaluate the accuracy of your forecasts using metrics like Mean Absolute Error (MAE) or Mean
Squared Error (MSE).
Plot the original time series data along with the forecasted values.
Text data manipulation is an essential step in preparing data for sentiment analysis in R. In sentiment
analysis, you typically need to clean and preprocess the text data to make it suitable for analysis. Here are
the key steps involved in manipulating text data for sentiment analysis in R:
First, you need to load the necessary libraries for text data manipulation and sentiment analysis.
Commonly used packages include tm (Text Mining), stringr, and tidytext.
library(tm)
library(stringr)
library(tidytext)
Load your text data, which could be in a CSV file, a data frame, or a text corpus. Ensure that your data
contains a column with the text you want to analyze.
# Load your text data (replace 'your_data.csv' with your data source)
text_data <- read.csv("your_data.csv")
3. Data Cleaning:
Text data is often messy, so you need to clean it by removing special characters, numbers, and other
unwanted elements. You can also convert the text to lowercase.
4. Tokenization:
Tokenization is the process of splitting text into individual words or tokens, making it suitable for
analysis.
# Tokenize the text
corpus <- tm_map(corpus, wordTokenize)
5. Sentiment Analysis:
You can use sentiment lexicons or pre-trained models to perform sentiment analysis. For example, the
tidytext package provides a sentiment lexicon, and you can use it to determine the sentiment of each word
in the text.
# Transform the text data into a format suitable for sentiment analysis
text_sentiment <- corpus %>%
unnest_tokens(word, text) %>%
inner_join(get_sentiments("nrc"))
6. Analyze Sentiment:
You can now analyze the sentiment of the text data by aggregating and summarizing the sentiment scores.
# Analyze sentiment
head(sentiment_summary)
This will give you a summary of sentiment scores for each document or text element.
7. Interpret Results:
Based on the sentiment scores, you can interpret whether the text is generally positive, negative, or
neutral.
These are the fundamental steps for manipulating text data for sentiment analysis in R. Depending on the
complexity of your analysis, you may need to explore additional text preprocessing and sentiment
analysis techniques, such as custom lexicons or machine learning models for sentiment classification.
Q19. Implement topic modelling in R.
Topic modeling is a technique used to discover topics or themes within a collection of documents. In R,
the tm and topicmodels packages are commonly used for topic modeling. Here's a step-by-step guide on
how to implement topic modeling in R:
First, load the necessary libraries for text preprocessing and topic modeling.
library(tm)
library(topicmodels)
Load your text data and preprocess it, similar to the steps for sentiment analysis. Cleaning, tokenization,
and creating a Document-Term Matrix (DTM) are crucial.
# Load your text data (replace 'your_data.csv' with your data source)
text_data <- read.csv("your_data.csv")
Now, you can build a topic model using the LDA() function from the topicmodels package. Specify the
number of topics (k) you want to discover.
4. Explore Topics:
You can explore the topics and associated words using the terms() function. This will give you a list of
words for each topic.
# Explore topics
terms(lda_model, 5) # Show the top 5 words for each topic
You can assign topics to documents in your dataset using the tm package's tm_map() function.
6. Interpret Topics:
Inspect the top words in each topic to interpret what each topic represents. This will help you label the
topics based on the words associated with them.
7. Visualize Topics:
You can visualize the topics and their relationships using various visualization techniques, including word
clouds, bar plots, or network graphs.
Topic modeling is a valuable technique for discovering latent themes or topics in text data. The choice of
the number of topics (k) is a crucial decision and might require experimentation. Additionally, topic
modeling can be further enhanced with more advanced techniques, such as using other topic modeling
algorithms or performing sentiment analysis within each topic to gain deeper insights.
Q20. Review and final project using a combination of the
techniques learned.
A final project that combines various data analysis and machine learning techniques can be an excellent
way to apply what you've learned in R. Such a project can be both challenging and rewarding. Here's a
sample project idea that combines multiple techniques:
In this project, you'll work on a simulated dataset from a telecom company to predict customer churn.
You'll combine data preprocessing, exploratory data analysis, feature engineering, machine learning, and
evaluation techniques. Here's an outline of the project:
1. Data Collection:
Start by obtaining the dataset. You can simulate customer data with features such as customer
demographics, usage patterns, contract details, and customer churn status (whether they churned or not).
2. Data Preprocessing:
Conduct EDA to understand the relationships between different features and the target variable (churn).
Visualize the data using various plots and charts to gain insights.
4. Feature Engineering:
Create new features or modify existing ones that may be useful for predicting customer churn.
Extract relevant information from features like contract length and usage patterns.
5. Machine Learning:
Select machine learning algorithms suitable for the classification task (e.g., logistic regression, decision
trees, random forests, or gradient boosting).
Train and evaluate multiple models using cross-validation.
Tune hyperparameters to optimize model performance.
Consider ensembling techniques if necessary.
6. Model Evaluation:
Evaluate model performance using metrics such as accuracy, precision, recall, F1-score, and ROC AUC.
Create a confusion matrix and visualize it.
Consider plotting the ROC curve and Precision-Recall curve.
7. Interpretation:
Interpret the model results to understand which features are most influential in predicting churn.
Identify actionable insights that the telecom company can use to reduce customer churn.
Create a report or presentation summarizing the project, including data preprocessing, EDA, modeling,
and results.
Clearly explain the methodology and key findings.
Present the predictive model's performance and its implications for the telecom company.
9. Future Recommendations: