COMP1810 - Data and Web Analytics

001306081 COMP1810 - Data and Web Analytics
University of Greenwich ID Number: 001306081
FPT Student ID Number: GCD220190
Module Code: COMP1810
Module Assessment Title: Coursework
Lecturer Name: Tran Trong Minh
Submission Date: 13.08.2024
Number of words: 3782 (excluding code fields)
GCD220190
Table of Contents
1. Introduction............................................................................................................ 1
2. e-shop clothing 2008............................................................................................... 1
1.1. Import file..............................................................................................................1
1.2. Clean and normalise the data frame..................................................................... 2
1.3. Number of orders.................................................................................................. 4
1.4. Correlation Matrix................................................................................................. 5
1.5. Sales by months.....................................................................................................6
1.5.1. Total Sales by Day in April..............................................................................8
1.5.2. Sales of each category by month...................................................................9
1.5.3. Mean Price by Category.............................................................................. 11
1.5.4. Number of times a category appears on a page......................................... 13
1.5.5. Total Sales by Colour................................................................................... 14
1.5.6. Total Sales by Price...................................................................................... 15
1.6. Revenue by month.............................................................................................. 17
1.6.1. Revenue of each category by months......................................................... 18
1.7. Conclusion........................................................................................................... 20
1.8. Recommended solutions.....................................................................................20
1.8.1. Optimise product placement.......................................................................20
1.8.2. Promote best-selling products.................................................................... 20
1.8.3. Enhance website navigation........................................................................21
3. input.csv............................................................................................................... 21
3.2. Import file............................................................................................................21
3.3. Clean and normalise the data..............................................................................22
3.4. Mean salary......................................................................................................... 24
3.5. Median salary...................................................................................................... 27
3.6. Mode of salary.....................................................................................................28
3.7. Variance of salary................................................................................................ 28
3.8. Standard deviation of salary................................................................................28
3.9. Descriptive Analysis.............................................................................................29
4. star wars............................................................................................................... 29
4.2. Access file............................................................................................................ 29
GCD220190
4.3. Clean and normalise............................................................................................30

4.4. Actors without black eyes and taller than 150 cm.............................................. 31
4.4.1. Normalise the “height” column.................................................................. 32
4.4.2. Extract actors taller than 150 cm and without black eyes...........................32
4.4.3. Distribution of actors’ height.......................................................................33
4.4.4. Distribution of actors’ eye colour................................................................ 35
4.4.5. Distribution of tall actors without black eyes..............................................36
4.5. BMI...................................................................................................................... 38
4.5.1. Normalise the “mass” column.....................................................................38
4.5.2. BMI of the actors.........................................................................................39
4.5.3. Height vs BMI.............................................................................................. 40
4.5.4. Mass vs BMI................................................................................................ 41
5. References............................................................................................................ 44
GCD220190
1. Introduction
In this coursework, I have to import, clean, normalise, and analyse the datasets to get
the insights of the scenario and provide solutions for an e-shop clothing, enabling them
to optimise and get more revenue and traffic.
As for file "input.csv," I have to clean, normalise, and carry out descriptive analysis for
the dataset.
Lastly, from the “starwars” dataset of the ‘dplyr’ library, I have to clean, normalise, and
extract actors with specific requirements, then calculate and analyse their BMI.
2. e-shop clothing 2008

1.1. Import file
The data I have been provided is in the “.xlsx” extension, and R does not support
reading Excel files without importing libraries, so I have to import the “readxl” library,
which supports reading Excel files.
# install the package if needed

# install.packages("readxl")
# import library for reading Excel file

library(readxl)
# read excel file

df <- read_excel('e-shop clothing 2008.xlsx')
GCD220190 1
Figure 1. Raw df when imported
1.2. Clean and normalise the data frame

As can be seen above, the data frame is not cleaned and is hard to read and analyse. So,
I have to go through some steps to clean the data.
# install packages if needed

# install.packages('dplyr')
# install.packages('tidyr')
# import necessary libraries for cleaning data

library(dplyr)
library(tidyr)
# split and rename columns

df <- df %>%
separate(col = colnames(df)[1],
into = c('year', 'month', 'day', 'click_stream', 'country',
'session_id', 'main_category', 'clothing_model', 'colour', 'location',
'model_photography', 'price_usd', 'price_2', 'page'),
sep = ";", fill = "right")
Figure 2. Split df into according columns
GCD220190 2
First, I import necessary libraries for data manipulation and cleaning, which are ‘dplyr’
and ‘tidyr’. Next, I pass the df through a series of transformations. The %>% operator
(pipe operator) allows chaining operations in a readable way.
col = colnames(df)[1] specifies the column to be split. In this case,

colnames(df)[1] indicates the first column of the data frame.
into = c(...) defines the names of the new columns that will be created from the
split.
sep = ";" specifies the delimiter used to split the values in the original column.
fill = "right" indicates how to handle missing values. If there are fewer values in
a row than the number of new columns, the fill = "right" option will fill missing
values from the right side with NA.
The data value is in the correct columns now. I changed the column names based on
the description in the “shop clothing infor 2008.txt” file. Previously, there was a column
named “order (sells)”, however, it is described as a sequence of clicks during one
session, so I changed the column name to "click_stream." The other columns are
changed for easier understanding (for example, “main_category” instead of “page 1
(main category)”).
Next, I convert the data types to integers for easy calculation and analysis.
# convert data to numeric

df <- as.data.frame(lapply(df, as.integer))
Figure 3. Warning when converting data type
There is a warning that NA values appeared. Apparently, the values of the

“clothing_model” cannot be converted to an integer since they have letters, so the
values of the column are set to NA.
GCD220190 3
Figure 4. Values of “clothing_model” set to NA
When looking through the data frame and the description file, we now can conclude
that the data is in the year 2008 and except for the model code, we do not have any
further information about what the model code means and how it can help us with our
analysis, so I decided to drop the “year” and “clothing_model” columns.
# drop column "year" and "clothing_model"

df <- df[, !(colnames(df) == "year") ]
df <- df[, !(colnames(df) == "clothing_model") ]
Figure 5. Data frame after dropping “year” and “clothing_model” columns
1.3. Number of orders

Another important information is the number of orders, which is not included in the
data frame. There is no clear information about how an order is calculated. I assume
that the last click of each session is when the customer places the order. According to
this assumption, if the click stream is the highest of each session, then it will be
counted as an order.
# Add new column 'product_sold'

# if click_stream max of the session product_order = 1, else = 0
GCD220190 4
df$click_stream = as.numeric(df$click_stream)
df <- df %>%
group_by(session_id) %>%
mutate(product_sold = ifelse(click_stream == max(click_stream), 1, 0))
%>%
ungroup()
Figure 6. Added column “product_sold”
1.4. Correlation Matrix

I choose to plot a correlation matrix to see if the “product_sold” column has strong
correlation with any other columns.
# Draw Correlation Matrix

library("reshape2")
library("ggplot2")
# Calculate the correlation matrix
cor_matrix <- cor(df)
# Convert the correlation matrix to long format

cor_long <- reshape2::melt(cor_matrix)
# Plot the correlation matrix using ggplot2

ggplot(cor_long, aes(x = Var1, y = Var2, fill = value)) +
geom_tile() +
scale_fill_gradient2(low = "blue", mid = "white", high = "red",
midpoint = 0, limits = c(-1, 1),
GCD220190 5
guide = guide_colorbar(title = "Correlation")) +

labs(title = "Correlation Matrix", x = "", y = "") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
geom_text(aes(label = round(value, 2)), colour = "black", size = 3)
Figure 7. Correlation Matrix of the columns in the data frame
Except for the correlation between the “month” and “session_id” columns (which is
0.97), there are significant correlations between the columns, especially the
“product_sold” column. As for the correlation between the “month” and “session_id”
columns, there is no clear explanation on how they correlate with each other or what
analysis can be made.
1.5. Sales by months

To be able to find the insight of the data frame, it is essential to calculate sales by
month and visualise it. The total sales of the month is simply the sum of the order.
GCD220190 6
# calculate total sales by month

sales_by_month <- df %>%
group_by(month) %>%
summarise(total_sales = sum(product_sold, na.rm = TRUE))
# plot bar chart

library(ggplot2)
ggplot(sales_by_month,
aes(x = month,
y = total_sales,
fill = month)) +
geom_bar(stat = "identity") +
labs(title = "Total Sales by Months",
x = "Month",
y = "Total Sales") +
scale_y_continuous(labels = function(x) format(x, scientific = FALSE))
Figure 8. Total Sales by Months
As can be seen on the figure above, the sales in April are the highest. To better
understand what could lead to this result, I made some further analysis and
visualisation.
GCD220190 7
1.5.1. Total Sales by Day in April

After visualising the total sales by month, it is noticeable that the sales of April are
much higher than other months. Therefore, I think visualising the sales by day in April
might help me get some insight.
# calculate daily sales in April

april_daily_sales <- df %>%
filter(month == 4) %>%
group_by(day) %>%
summarise(total_sales = sum(product_sold, na.rm = TRUE))
# plot bar chart

ggplot(april_daily_sales,
aes(x = day,
y = total_sales,
fill = day)) +
labs(title = "Daily Sales in April",
x = "Day",
y = "Total Sales")
Figure 9. Daily sales in April
GCD220190 8
The sales of the first week of April are much higher than the other days in April. I could
not find any information that related to why the sales are high. I would assume that it
might be because of the marketing campaigns at that time.
1.5.2. Sales of each category by month

The distribution of sales might be because one category is more popular than another.
In the data frame, the category values are numbers from 1 to 4, and for easier
understanding and analysis, I change the numbers to according to values stated in the
information file.
# change value of main_category

# 1 = trousers; 2 = skirts; 3 = blouses; 4 = sale
df <- df %>%
mutate(main_category = recode(main_category,
'1' = 'trousers',
'2' = 'skirts',
'3' = 'blouses',
'4' = 'sale'))
Then I calculated the total sales by categories by grouping the data frame by
“main_category” and calculated the sum of "product_sold.”
# total sales by category

sales_by_category <- df %>%
group_by(main_category) %>%
summarise(total_sales = sum(product_sold), na.rm = TRUE)
# plot bar chart

ggplot(sales_by_category,
aes(x = main_category,
y = total_sales,
fill = main_category)) +
labs(title = "Total Sales by Categories",
x = "Category",
GCD220190 9
Figure 10. Total Sales by Categories
Next, I calculated the sales of each category by months by grouping the data frame by
“month” and “main_category” and counting the sum of "product_sold," then plotting
the bar chart.
# calculate sales of each category for each month.

category_sales_by_month <- df %>%
group_by(main_category, month) %>%
summarise(category_sales = sum(product_sold, na.rm = TRUE)) %>%
ungroup()
# plot bar chart

ggplot(category_sales_by_month,
aes(x = month,
y = category_sales,
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Total Sales of Each Category By Month",
x = "Month",
GCD220190 10
Figure 11. Total Sales of Each Category by Months
According to figure 10 and figure 11, “trousers” have the highest sales, while the “sale”
category is in third place. I think the reason might come from the price of the product in
each category. So, I will calculate and analyse if price affects the sales.
1.5.3. Mean Price by Category

To calculate the mean price of each category, I group the data frame by category and
then calculate the mean price by using the mean() function. .groups = 'drop' is
used to drop all the grouping after summarising.
Then, I use ggplot to plot the bar chart with the x-axis being the category and the y-axis
being the mean price.
# calculate mean price of each category

mean_price_by_category <- df %>%
GCD220190 11
summarise(mean_price = mean(price_usd), .groups = 'drop')
# plot bar chart

ggplot(mean_price_by_category,
y = mean_price,
geom_bar(stat = 'identity') +
labs(title = "Mean Price of Each Category",
x = "Category",
y = "Mean Price")
Figure 12. Mean Price by Category
Looking at the bar chart, we see that the mean price of trousers and skirts is the
highest, not as my expectation of customers purchasing them the most because of the
low price. To further answer the question of why “trousers” and "skirts" are being
purchased the most, I tried to make some analysis based on what page they appear on
the most.
GCD220190 12
1.5.4. Number of times a category appears on a page

I counted the number of times a category appears on a page by grouping the data
frame by “main_category” and “page,” then counting the number of entries.
# calculate the number of times a category appears on a page

category_page_count <- df %>%
group_by(main_category, page) %>%
summarise(count = n(), .groups = 'drop')
# plot a stacked bar chart

ggplot(category_page_count,
aes(x = factor(page),
y = count,
labs(title = "Category Count by Page Number",
x = "Page Number",
y = "Count")
Figure 13. Category Count by Page Number
GCD220190 13
I use a stacked bar chart to better see the distribution of each category on a page. From
figure 12, we can see that most of the products appear on the first page; however,
nearly half of them are from the “trousers” category, then comes the “skirts” category.
With this bar chart, it is understandable why “trousers” have the most sales, since they
appear mostly on the first page and are distributed on the first three pages, which
makes the customers locate them easier.
1.5.5. Total Sales by Colour

To further analyse the customer’s behaviours, I summarised and visualised some
attributes of the products. This can also help find the trend among the customers.
Firstly, I change the code number of the colours to the actual name of the colour to
better understand the data.
# change numeric values to colour name

# 1 = beige; 2 = black; 3 = blue; 4 = brown; 5 = burgundy; 6 = gray; 7 =
green; 8 = navy; 9 = multi-colour; 10 = olive; 11 = pink; 12 = red; 13 =
violet; 14 = white
df <- df %>%
mutate(colour = recode(colour,
'1' = 'beige', '2' = 'black',
'3' = 'blue', '4' = 'brown',
'5' = 'burgundy', '6' = 'gray',
'7' = 'green', '8' = 'navy',
'9' = 'multi-colour', '10' = 'olive',
'11' = 'pink', '12' = 'red',
'13' = 'violet', '14' = 'white'))
# sales by colour
sales_by_colour <- df %>%
group_by(colour) %>%
summarise(colour_sales = sum(product_sold), na.rm = TRUE) %>%
ungroup() %>%
arrange(desc(colour_sales))
# plot chart
ggplot(sales_by_colour,
aes(x = reorder(colour, -colour_sales),
GCD220190 14
y = colour_sales,
fill = colour)) +
labs(title = "Total Sales by Colour",
x = "Colour",
y = "Total Sales")
Figure 14. Total Sales by Colour
As can be seen in the chart above, the top 3 most popular colours are black, blue, and
gray, which are neutral colours and can mix and match well with each other.
1.5.6. Total Sales by Price

Next, I calculated the total sales by price to see how people spend their money on
clothing. I would expect the customers to spend on mid-range-priced clothing rather
than cheap clothes. The reason is that low-priced clothes might not have good quality.
# calculate total sales by price
GCD220190 15
sales_by_price <- df %>%

group_by(price_usd) %>%
summarise(price_sales = sum(product_sold, na.rm = TRUE)) %>%
ungroup()
# plot chart
ggplot(sales_by_price, aes(x = price_usd, y = price_sales)) +
geom_line(color = "black", size = 1) +
geom_smooth(method = "loess", color = "red", se = FALSE) +
geom_point() +
scale_x_continuous(breaks = sales_by_price$price_usd, labels =
sales_by_price$price_usd) +
labs(title = "Total Sales by Price",
x = "Price",
y = "Total Sales")
Figure 15. Total Sales by Price
As can be seen, the most popular price range is from $28 to $48. I think for most
people, this is a reasonable price range for clothes.
GCD220190 16
1.6. Revenue by month

The sales of April have been the highest, so I assume that the revenue of April would be
higher than other months. Moreover, the bar chart of “Total Revenue by Months”
would not have many differences from the “Total Sales by Months” chart.
The total revenue by months can be found by grouping the data frame by months and
calculating the sum of price multiplied by product_sold.
# calculate revenue by months

rev_by_month <- df %>%
group_by(month) %>%
summarise(total_revenue = sum(price_usd * product_sold), .groups =
'drop')
# plot bar chart

ggplot(rev_by_month,
aes(x = month,
y = total_revenue,
fill = month)) +
labs(title = "Total Revenue by Month",
x = "Month",
y = "Revenue") +
Figure 16. Total Revenue by Month
GCD220190 17
1.6.1. Revenue of each category by months
# total revenue by category

rev_by_category <- df %>%
summarise(total_revenue = sum(price_usd * product_sold), na.rm = TRUE)
print(rev_by_category)
# plot bar chart

ggplot(rev_by_category,
y = total_revenue,
labs(title = "Total Revenue by Categories",
x = "Category",
y = "Total Revenue") +
Figure 17. Revenue by Categories
GCD220190 18
Looking at figure 17, the revenue of “trousers” and “skirts” is almost the same. This
might result because the price of “skirts” is higher than “trousers” while having a
similar number of products sold.
Then, I calculated and plotted the chart of revenue for each category by month.
# calculate revenue for each category of each month

category_rev_by_month <- df %>%
group_by(main_category, month) %>%
summarise(category_rev = sum(price_usd * product_sold, na.rm = TRUE),
.groups = 'drop')
# plot bar chart

ggplot(category_rev_by_month,
aes(x = month,
y = category_rev,
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Total Revenue of Each Category By Month",
x = "Month",
y = "Total Revenue") +
Figure 18. Total Revenue of Each Category by Month
GCD220190 19
As can be seen in figure 18, in April, the revenue of “skirts” is not as high as “trousers,”
but it is not far behind. In other months, the revenue of “skirts” is almost the same or
higher than "trousers." Besides, even though the sales of the “sale” category are not
bad, due to its cheap price, this category does not bring much revenue to the clothing
store.
1.7. Conclusion
After doing the analysis, I can conclude that the shop has a good number of revenue
from the “trousers” and “skirts” categories. They have the highest price and appear the
most on the first and second pages of the website, which reach more customers. On the
other hand, despite decent sales, the “sale” category did not significantly impact and
contribute to the revenue of the e-shop due to low prices.
Additionally, it is clear that neutral colours are the most favoured among customers,
whereas brighter colours are less appealing.
1.8. Recommended solutions

To boost the revenue and traffic of the e-shop even more, here are some
recommendations I think would help the e-shop owner optimise their website.
1.8.1. Optimise product placement

Given the success of "trousers" and "skirts," targeted promotions or discounts for these
categories could further drive revenue. During slower periods, such as August,
well-placed promotions for these products could maintain sales momentum and foster
growth. Marketing campaigns emphasising these popular items may attract more
customers and enhance sales and revenue. Additionally, focussing on neutral-coloured
clothing, which has proven popular, and promoting mid-range priced items ($28 to $48)
could be beneficial.
1.8.2. Promote best-selling products

With the popularity of “trousers” and "skirts," these categories might bring more
revenue with targeted promotions or discounts. Especially during slow months like
August, with the right promotions of these products, the e-shop could maintain sales
momentum and drive further growth. Marketing campaigns that especially highlight
these items might attract more customers, leading to increased sales and revenue.
GCD220190 20
Besides, clothes in neutral colours got sold the most, so the store might focus more on
the products in these colours. They might also benefit from promoting products that
are in the mid-range price of the store, which are from $28 to $48.
1.8.3. Enhance website navigation

Improving website navigation can make popular products more accessible and increase
exposure for less popular items. Testing different layouts and tracking user interactions
can provide insights into how customers browse and engage with the site. This data
then can be used to optimise the site’s design, ensuring that all categories have a fair
chance of being noticed and purchased.
3. input.csv
3.2. Import file
First, the data we have been provided is not with the “.csv” extension as stated in the
coursework, but in “.xlsx” instead. So, I cannot use read.csv() to read the data. To be
able to read the data file, first we must import the library that supports reading Excel
files, which is ‘readxl’.
# install package if not already installed

install.packages('readxl')
# import package
library('readxl')
Next, we will use the read_excel() function to import the ‘input.xlsx’ file.
input_df <- read_excel('input.xlsx')

View(input_df)
GCD220190 21
Figure 16. Raw input.xlsx dataset
3.3. Clean and normalise the data

As has been shown in the figure above, the format is not correct. All the values of the
dataset are in the “id” column. We can separate the values into correct columns using
the read_excel() function with additional arguments to specify the correct data range
and handling of the headers. Additionally, we can use the ‘dplyr’ library to use the pipe
operator (%>%) to make changes to the dataset.
# install the dplyr library if not installed

install.packages('dplyr')
# import the library
library('dplyr')
# normalise the data

input_df <- read_excel('input.xlsx', col_names = TRUE, na =
"NA")
input_df <- input_df %>%
separate(id, into = c("id", "name", "salary", "start_date",
"dept"), sep = "\\s+", fill = "right", extra = "merge")
GCD220190 22
Figure 17. input_df after split columns
However, as shown in the figure above, the ID seems to be duplicated, which makes the
values and headers incorrect and not relevant. To clean this data, I use a similar
approach to how I separated the values into columns with extra steps.
First, I delete the values from the name column by setting them to NA values. Then, I
shift the values of the columns to the column on the left (for example, values of the
“salary” column to the “name” column). Finally, I set the value of the “dept” column to
NA and split the data from the “start_date” column and filled it in the “dept” column.
# delete values of "name" column

input_df$name <- NA
# shift data to the column on the left
mutate(
name = salary,
salary = start_date,
start_date = dept
)
GCD220190 23
# set value of column "dept" to NA

input_df$dept <- NA
# split the start_date column and fill in the dept column
separate(start_date, into = c("start_date", "dept"), sep =
"\\s+", fill = "right", extra = "merge")
Figure 18. Final clean input_df
3.4. Mean salary

The mean value is the sum of observations divided by the total number of observations.
It is also defined as the average, which is the sum divided by count.
GCD220190 24
Before finding the mean of the salary column, we have to convert the datatype of the
salary column to numeric (double) with the as.numeric() function. And then, use
the mean() function to find the mean of the salary column. I also assume that the
currency is pounds (£).
# convert "salary" column format to numeric

input_df$salary <- as.numeric(input_df$salary)
# mean of the "salary" column

salary_mean <- mean(input_df$salary)
[1] 656.8813
To better understand the distribution of the salary between the departments, I also
calculate the mean salary by department and visualise it.
# aggregate salary mean by department

mean_by_dept <- input_df %>%
group_by(dept) %>%
summarise(salary = mean(salary, na.rm = TRUE))
# Create the bar plot

ggplot(mean_by_dept, aes(x = dept, y = salary, fill = dept)) +
labs(title = "Mean Salary by Department", x = "Department", y
= "Mean Salary")
GCD220190 25
Figure 19. Mean Salary by Department
Looking at the figure above, it is noticeable that the mean salary of the “Finance”
department is the highest, significantly higher than the “Operations” department.
If we sort the salary column from lowest to highest, we can see that the salaries of
employees from the “IT” department are in the top lowest, even though they have the
most employee entries in this dataset.
Figure 20. Sorted input_df by salary from lowest to highest
GCD220190 26
Since I have no information about the location of this company, as well as the position
of the employees, I made an assumption of the company location being in the UK and
the employees are just ordinary officers. After doing some research about the salary of
each department in the UK, here are some conclusions I can make.
Firstly, this must be the weekly salary of the employees since, compared to the monthly
salary I have found, the salary of these employees would be too low, but for the daily
salary, it would be too high. The only suitable choice is the weekly salary.
The salary of officers from the finance department is reasonable based on the
information I have found [1].
The salary of the HR worker in this dataset is a little lower than what I have found
online, which is £808. [2]
Since there is no clear information about the position of the worker in the “IT”
department, I can only assume that they are either software developers or web
developers. No matter what position, with the information about the salary of software
developer and web developer, the employees of the company in this dataset are
receiving the salary lower than the average salary of both software developer (£1,146)
[3] and web developer (£866) [4].
The salary of the officers in the “Operations” department is nearly the same as the
average weekly salary of the Operations Support Officer that I found. [5]
3.5. Median salary

By definition, median is the middle value of the dataset. If the number of elements in
the dataset is odd, then the median is the centre element. If the number is even, then
the median equals the average of two central elements.
GCD220190 27
To find the median of the salary column, I use the median() function. Usually, it would
be better to sort the values to make finding the median value easier, but luckily, the
median() function in R automatically sorts the values and calculates the median value
if needed.
# finding median of the "salary" column

salary_median <- median(input_df$salary)
[1] 628.05
3.6. Mode of salary

Mode is the value that has the highest frequency in the given dataset. However, in the
dataset, the frequency of all data points is the same, as there are no duplicate values in
the “salary” column. So, there is no mode in the “salary” column.
3.7. Variance of salary

The variance is a numerical measure of how the data values are dispersed around the
mean. It can be the sum of squares of differences between all numbers and means.
In R, we can find the variance with the built-in var() function.
# variance of the "salary" column

salary_variance <- var(input_df$salary)
[1] 10621.25
3.8. Standard deviation of salary

The standard deviation is the measure of the dispersion of the values. It can be defined
as the square root of variance.
GCD220190 28
The standard deviation can be defined using sd() function in R.
# standard deviation of the "salary" column

salary_sd <- sd(input_df$salary)
[1] 103.0595
3.9. Descriptive Analysis

From the metrics that have been calculated, here are some descriptive analyses that
can be made:
1. The mean average salary in the dataset is £656.88 per week. Among the
departments, the “Finance” department has the highest average salary, while IT
department salaries are generally on the lower side.
2. The median salary is £628.05, indicating that half of the employees earn less than
this amount and half earn more.
3. The variance is £10621.05, and the standard deviation is £103.06. This shows a
moderate spread of salaries around the mean, suggesting some variability but
not extreme differences.
4. star wars
4.2. Access file
To access the “starwars” dataset, it is essential to import the ‘dplyr’ library. You must
install the library first if you have not.
# install the library if needed

install.packages('dplyr')
# import the library
library(dplyr)
# view the dataset

View(starwars)
GCD220190 29
Figure 21. starwars dataset
The ‘starwars’ dataset is quite big, with 87 rows and 14 columns.
4.3. Clean and normalise

When looking at the first few rows of the dataset, it is noticeable that there are missing
values in the dataset. So, I will check the total missing values of each column first.
# check for NA values of each columns

colSums(is.na(starwars))
Figure 22. Missing values in ‘starwars’ dataset
As can be seen in the figure above, there are quite a lot of missing values in this
dataset, not including the wrong input values. After looking through the whole dataset,
I have noticed that there are “none” and “unknown” values.
GCD220190 30
Figure 23. “none” values Figure 24. “unknown” values
So, to make normalising data easier, I changed all these values to NA.
# replace "none" and "unknown" values to NA

starwars_df <- starwars %>%
mutate(across(where(is.character), ~na_if(.x, "none"))) %>%
mutate(across(where(is.character), ~na_if(.x, "unknown")))
# check for NA values after replacement

colSums(is.na(starwars_df))
Figure 25. Number of missing values after replacement to NA
4.4. Actors without black eyes and taller than 150 cm

Even though the task was to extract actors who are taller than 150m, in reality, there is
no actor who is taller than 150m. So for the tasks related to the “starwars” data frame, I
will use the “cm” metrics.
To extract the data about actors whose eyes are not black and are taller than 150 cm,
“height” and “eye_color” are the only columns I need to use. So, to save time on
cleaning, I will not clean all the dataset but only the two columns.
GCD220190 31
4.4.1. Normalise the “height” column

I extracted the NA values in the “height” column using the filter() function, and
there are only 6 missing values. I decided to fill in the missing values with the
information I have found about the “height” of the actors in Star Wars on SWRPGGM
and Wookieepedia.
# show NAs in height columns

filter(starwars_df, is.na(height))
# fill the missing values in height with information found on

https://swrpggm.com/ and https://starwars.fandom.com/
starwars_df <- starwars_df %>%
mutate(height = replace(height, name == "Finn", 178)) %>%
mutate(height = replace(height, name =="Rey", 170)) %>%
mutate(height = replace(height, name == "Poe Dameron", 172)) %>%
mutate(height = replace(height, name == "BB8", 67)) %>%
mutate(height = replace(height, name == "Captain Phasma", 200))
Figure 26. Missing values in “height” column
4.4.2. Extract actors taller than 150 cm and without black eyes
To extract actors taller than 150 cm and without black eyes, firstly, I extract the actors
that are taller than 150 cm as a new data frame named “starwars_150” using the
filter() function, and then extract actors whose eye colour is not black from that
data frame.
GCD220190 32
# extract actors taller than 150

starwars_150 <- filter(starwars_df, height > 150)
Figure 27. Actors who are taller than 150
As a result, there are 73 actors that are taller than 150 cm, with the shortest actor being
157.
# no black eye_color
non_black_150 <- filter(starwars_150, eye_color != "black")
Figure 28. Actors whose eyes are not black and taller than 150
The “non_black_150” data frame stores information about actors that are taller than
150 and do not have black eyes. There are 63 actors out of 87 actors who meet the
criteria.
4.4.3. Distribution of actors’ height
# distribution of actors taller than 150 cm vs all actors

ggplot() +
geom_histogram(data = starwars_df, aes(x = height), fill = 'blue',
alpha = 0.5, bins = 30) +
geom_histogram(data = starwars_150, aes(x = height), fill = 'red',
alpha = 0.5, bins = 30) +
labs(title = 'Height Distribution', x = 'Height', y = 'Count')
GCD220190 33
Figure 29. Height distribution of actors taller than 150 cm vs all actors
Looking at the distribution histogram above, it is noticeable that most actors are taller
than 150 cm. The percentage of tall actors among all actors is also calculated
# Ensure correct calculation

total_actors <- nrow(starwars_df)
tall_actors <- nrow(starwars_150)
# Display the number of actors

print(paste("Total actors:", total_actors))
print(paste("Tall actors:", tall_actors))
# Correct percentage calculation

percentage_tall_actors <- (tall_actors / total_actors) * 100
print(paste("Percentage of actors taller than 150 cm:",
round(percentage_tall_actors, 2), "%"))
[1] "Total actors: 87"

[1] "Tall actors: 73"
[1] "Percentage of actors taller than 150 cm: 83.91 %"
GCD220190 34
So we have that out of 87 actors, nearly 84% are taller than 150 cm.
4.4.4. Distribution of actors’ eye colour

First, I have another variable to store data frame about the actors, whose eyes are not
black.
starwars_nonblack <- filter(starwars_df, eye_color != "black")
Next, I plot bar chart to see the distribution of eye colours and comparison between all
actors and actors without black eyes.
ggplot() +
geom_bar(data = starwars_df, aes(x = eye_color), fill = 'blue', alpha =
0.5, position = 'dodge') +
geom_bar(data = starwars_nonblack, aes(x = eye_color), fill = 'red',
alpha = 0.5, position = 'dodge') +
labs(title = 'Eye Color Distribution', x = 'Eye Color', y = 'Count')
Figure 30. Eye Color Distribution
GCD220190 35

non_black_actors <- nrow(starwars_nonblack)
# Display number of actors

print(paste("Actors without black eyes:", non_black_actors))

percentage_non_black_actors <- (non_black_actors / total_actors) * 100
print(paste("Percentage of actors without black eyes:",
round(percentage_non_black_actors, 2), "%"))

[1] "Actors without black eyes: 74"
[1] "Percentage of actors without black eyes: 85.06 %"
As illustrated in figure 30, despite NA values, most of the actors do not have black eyes.
Also, 85% of the actors do not have black eyes.
4.4.5. Distribution of tall actors without black eyes
# distribution of actors taller than 150 cm without black eyes vs all

actors
ggplot() +
geom_histogram(data = starwars_df, aes(x = height), fill = 'blue',
alpha = 0.5, bins = 30) +
geom_histogram(data = non_black_150, aes(x = height), fill = 'red',
alpha = 0.5, bins = 30) +
labs(title = 'Tall actors without black eyes Distribution', x =
'Height', y = 'Count')
GCD220190 36
Figure 31. Tall Actors without Black Eyes Distribution

non_black_tall_actors <- nrow(non_black_150)
# Display number of actors

print(paste("Tall actors:", non_black_tall_actors))

percentage_non_black_tall_actors <- (non_black_tall_actors /
total_actors) * 100
print(paste("Percentage of actors taller than 150 cm and without black
eyes:", round(percentage_non_black_tall_actors, 2), "%"))

[1] "Tall actors: 63"
[1] "Percentage of actors taller than 150 cm and without black eyes:
72.41 %"
GCD220190 37
Despite having a lot of actors who are taller than 150 cm, only about 72% of the actors
are taller than 150 cm and do not have black eyes.
4.5. BMI
Body Mass Index (BMI) is a numerical value derived from a person’s mass (weight) and
height. It is used to identify whether a person is underweight, normal weight,
overweight, or obese.
The formula to calculate BMI is:
4.5.1. Normalise the “mass” column

Before calculating the BMI of the actors, we need to clean and fill the missing values of
the “mass” column. In total, there are 27 actors with missing values in the “miss”
column.
# Show NA values in the "mass" column

View(filter(starwars_df, is.na(mass)))
GCD220190 38
Figure 32. Actors with missing values in the “mass” column
To fill in the missing values, once again I have looked up the weight of a certain actor. As
a result, I could only find the weight of 6 actors and drop the rest NA values.
# fill in values of the "mass" column

starwars_df <- starwars_df %>%
mutate(mass = replace(mass, name == "Saesee Tiin", 78)) %>%
mutate(mass = replace(mass, name == "Finn", 73)) %>%
mutate(mass = replace(mass, name == "Rey", 54)) %>%
mutate(mass = replace(mass, name == "Poe Dameron", 80)) %>%
mutate(mass = replace(mass, name == "BB8", 18)) %>%
mutate(mass = replace(mass, name == "Captain Phasma", 76))
The "starwars_df" is now left with 65 actors out of 87 actors in the beginning. Now that
we already have the mass and height, we can calculate the BMI of the actors.
4.5.2. BMI of the actors

To make it easier for further analysis, I will store the information about “height” and
“mass” in another dataframe named “starwars_bmi”
GCD220190 39
# store data about height and mass in "starwars_bmi" df

starwars_bmi <- starwars_df %>%
select(height, mass)
# drop NA of height columns

starwars_bmi <- starwars_bmi %>%
drop_na(height)
# drop NA values in mass column

drop_na(mass)
Then calculate the BMI column.
# add bmi column

mutate(bmi = round(mass / ((height / 100) ^ 2), 2))
mutate() function is used to create a new column in the data frame.
round(..., 2) function rounds the BMI values to two decimal places.
4.5.3. Height vs BMI

To have a look at the distribution of the actors’ BMI, I use a scatter plot to visualise the
“BMI” column against the “height” column.
library(ggplot2)
ggplot(starwars_df,
aes(x = height,
y = bmi)) +
geom_point() +
labs(title = "Height vs BMI",
x = "Height",
y = "BMI")
GCD220190 40
Figure 33. Scatter plot of BMI against height
Looking at the plot above, there is an actor with a high BMI of more than "400," which
is strange. So when I looked through the data frame again, the actor with such a high
BMI is “Jabba Desilijic Tiure”, since his height is just 175 cm but his mass is 1358 kg.
Apparently, this actor is not a human but a Hutt species, so it is understandable that he
is not the same as other actors.
4.5.4. Mass vs BMI
ggplot(starwars_df,
aes(x = mass,
y = bmi)) +
geom_point() +
labs(title = "Mass vs BMI",
x = "Mass",
y = "BMI")
GCD220190 41
Figure 34. Scatter plot of Mass vs BMI
In figure 34, it is noticeable that the lower the mass, the lower the BMI. This is
explainable since BMI is calculated as mass / (height/100)^2, it is directly influenced
by mass. To better confirm this theory, I plotted a correlation matrix between “height”,
“mass” and “bmi” columns
# Convert data to numeric

starwars_bmi <- as.data.frame(lapply(starwars_bmi, as.numeric))
# Calculate the correlation matrix

cor_matrix <- cor(starwars_bmi[, c("mass", "height", "bmi")])
# Convert the correlation matrix to long format

cor_long <- melt(cor_matrix)
# Plot the correlation matrix using ggplot2

ggplot(cor_long, aes(x = Var1, y = Var2, fill = value)) +
geom_tile() +
GCD220190 42
scale_fill_gradient2(low = "blue", mid = "white", high = "red",

midpoint = 0, limits = c(-1, 1),
guide = guide_colorbar(title = "Correlation")) +
labs(title = "Correlation Matrix", x = "", y = "") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
geom_text(aes(label = round(value, 2)), color = "black", size = 3)
Figure 35. Correlation Matrix between mass, height, and bmi
The correlation between “bmi” and “mass” is really high (nearly 1). This means that as
mass increases, BMI increases proportionally. On the other hand, the correlation
between “bmi” and “height” is relatively weak, which means an increase in height does
not significantly affect the result of the BMI.
GCD220190 43
5. References
[1] (No date) Finance officer salaries - check average finance administrator salary rate
on Jooble. Available at: https://uk.jooble.org/salary/finance-officer#weekly (Accessed:
11 August 2024).
[2] (No date a) HR salaries - check average HR executive salary rate on Jooble. Available
at: https://uk.jooble.org/salary/HR#weekly (Accessed: 11 August 2024).
[3] (No date a) Software developer salaries - check average developer salary rate on
Jooble. Available at: https://uk.jooble.org/salary/software-developer#weekly
(Accessed: 11 August 2024).
[4] (No date a) Web developer salaries - check average developer salary rate on Jooble.
Available at: https://uk.jooble.org/salary/web-developer#weekly (Accessed: 11 August
2024).
[5] (No date a) Operations support officer salaries - check average flight operations
officer salary rate on Jooble. Available at:
https://uk.jooble.org/salary/operations-support-officer#weekly (Accessed: 11 August
2024).
[6] Star wars D6 the role playing game (2024) SWRPGGM. Available at:
https://swrpggm.com/ (Accessed: 12 August 2024).
[7] Wookieepedia (no date) Fandom. Available at: https://starwars.fandom.com/

(Accessed: 12 August 2024).
GCD220190 44

COMP1810 - Data and Web Analytics

Uploaded by

Copyright:

Available Formats

COMP1810 - Data and Web Analytics

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

COMP1810 - Data and Web Analytics

Uploaded by

Copyright:

Available Formats

001306081 COMP1810 - Data and Web Analytics

University of Greenwich ID Number: 001306081

FPT Student ID Number: GCD220190

Module Code: COMP1810

Module Assessment Title: Coursework

Lecturer Name: Tran Trong Minh

Submission Date: 13.08.2024

Number of words: 3782 (excluding code fields)

4.3. Clean and normalise............................................................................................30

2. e-shop clothing 2008

# install the package if needed

# import library for reading Excel file

# read excel file

Figure 1. Raw df when imported

1.2. Clean and normalise the data frame

# install packages if needed

# import necessary libraries for cleaning data

# split and rename columns

Figure 2. Split df into according columns

col = colnames(df)[1] specifies the column to be split. In this case,

# convert data to numeric

Figure 3. Warning when converting data type

There is a warning that NA values appeared. Apparently, the values of the

Figure 4. Values of “clothing_model” set to NA

# drop column "year" and "clothing_model"

Figure 5. Data frame after dropping “year” and “clothing_model” columns

1.3. Number of orders

# Add new column 'product_sold'

Figure 6. Added column “product_sold”

1.4. Correlation Matrix

# Draw Correlation Matrix

# Convert the correlation matrix to long format

# Plot the correlation matrix using ggplot2

guide = guide_colorbar(title = "Correlation")) +

Figure 7. Correlation Matrix of the columns in the data frame

1.5. Sales by months

# calculate total sales by month

# plot bar chart

Figure 8. Total Sales by Months

1.5.1. Total Sales by Day in April

# calculate daily sales in April

# plot bar chart

Figure 9. Daily sales in April

1.5.2. Sales of each category by month

# change value of main_category

# total sales by category

# plot bar chart

Figure 10. Total Sales by Categories

# calculate sales of each category for each month.

# plot bar chart

Figure 11. Total Sales of Each Category by Months

1.5.3. Mean Price by Category

# calculate mean price of each category

# plot bar chart

Figure 12. Mean Price by Category

1.5.4. Number of times a category appears on a page

# calculate the number of times a category appears on a page

# plot a stacked bar chart

Figure 13. Category Count by Page Number

1.5.5. Total Sales by Colour