Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

COMP1810 - Data and Web Analytics

Download as pdf or txt
Download as pdf or txt
You are on page 1of 47

001306081 COMP1810 - Data and Web Analytics

University of Greenwich ID Number: 001306081

FPT Student ID Number: GCD220190

Module Code: COMP1810

Module Assessment Title: Coursework

Lecturer Name: Tran Trong Minh

Submission Date: 13.08.2024

Number of words: 3782 (excluding code fields)

GCD220190
001306081 COMP1810 - Data and Web Analytics

Table of Contents
1. Introduction............................................................................................................ 1
2. e-shop clothing 2008............................................................................................... 1
1.1. Import file..............................................................................................................1
1.2. Clean and normalise the data frame..................................................................... 2
1.3. Number of orders.................................................................................................. 4
1.4. Correlation Matrix................................................................................................. 5
1.5. Sales by months.....................................................................................................6
1.5.1. Total Sales by Day in April..............................................................................8
1.5.2. Sales of each category by month...................................................................9
1.5.3. Mean Price by Category.............................................................................. 11
1.5.4. Number of times a category appears on a page......................................... 13
1.5.5. Total Sales by Colour................................................................................... 14
1.5.6. Total Sales by Price...................................................................................... 15
1.6. Revenue by month.............................................................................................. 17
1.6.1. Revenue of each category by months......................................................... 18
1.7. Conclusion........................................................................................................... 20
1.8. Recommended solutions.....................................................................................20
1.8.1. Optimise product placement.......................................................................20
1.8.2. Promote best-selling products.................................................................... 20
1.8.3. Enhance website navigation........................................................................21
3. input.csv............................................................................................................... 21
3.2. Import file............................................................................................................21
3.3. Clean and normalise the data..............................................................................22
3.4. Mean salary......................................................................................................... 24
3.5. Median salary...................................................................................................... 27
3.6. Mode of salary.....................................................................................................28
3.7. Variance of salary................................................................................................ 28
3.8. Standard deviation of salary................................................................................28
3.9. Descriptive Analysis.............................................................................................29
4. star wars............................................................................................................... 29
4.2. Access file............................................................................................................ 29

GCD220190
001306081 COMP1810 - Data and Web Analytics

4.3. Clean and normalise............................................................................................30


4.4. Actors without black eyes and taller than 150 cm.............................................. 31
4.4.1. Normalise the “height” column.................................................................. 32
4.4.2. Extract actors taller than 150 cm and without black eyes...........................32
4.4.3. Distribution of actors’ height.......................................................................33
4.4.4. Distribution of actors’ eye colour................................................................ 35
4.4.5. Distribution of tall actors without black eyes..............................................36
4.5. BMI...................................................................................................................... 38
4.5.1. Normalise the “mass” column.....................................................................38
4.5.2. BMI of the actors.........................................................................................39
4.5.3. Height vs BMI.............................................................................................. 40
4.5.4. Mass vs BMI................................................................................................ 41
5. References............................................................................................................ 44

GCD220190
001306081 COMP1810 - Data and Web Analytics

1. Introduction
In this coursework, I have to import, clean, normalise, and analyse the datasets to get
the insights of the scenario and provide solutions for an e-shop clothing, enabling them
to optimise and get more revenue and traffic.

As for file "input.csv," I have to clean, normalise, and carry out descriptive analysis for
the dataset.

Lastly, from the “starwars” dataset of the ‘dplyr’ library, I have to clean, normalise, and
extract actors with specific requirements, then calculate and analyse their BMI.

2. e-shop clothing 2008


1.1. Import file
The data I have been provided is in the “.xlsx” extension, and R does not support
reading Excel files without importing libraries, so I have to import the “readxl” library,
which supports reading Excel files.

# install the package if needed


# install.packages("readxl")

# import library for reading Excel file


library(readxl)

# read excel file


df <- read_excel('e-shop clothing 2008.xlsx')

GCD220190 1
001306081 COMP1810 - Data and Web Analytics

Figure 1. Raw df when imported

1.2. Clean and normalise the data frame


As can be seen above, the data frame is not cleaned and is hard to read and analyse. So,
I have to go through some steps to clean the data.

# install packages if needed


# install.packages('dplyr')
# install.packages('tidyr')

# import necessary libraries for cleaning data


library(dplyr)
library(tidyr)

# split and rename columns


df <- df %>%
separate(col = colnames(df)[1],
into = c('year', 'month', 'day', 'click_stream', 'country',
'session_id', 'main_category', 'clothing_model', 'colour', 'location',
'model_photography', 'price_usd', 'price_2', 'page'),
sep = ";", fill = "right")

Figure 2. Split df into according columns

GCD220190 2
001306081 COMP1810 - Data and Web Analytics

First, I import necessary libraries for data manipulation and cleaning, which are ‘dplyr’
and ‘tidyr’. Next, I pass the df through a series of transformations. The %>% operator
(pipe operator) allows chaining operations in a readable way.

col = colnames(df)[1] specifies the column to be split. In this case,


colnames(df)[1] indicates the first column of the data frame.

into = c(...) defines the names of the new columns that will be created from the
split.

sep = ";" specifies the delimiter used to split the values in the original column.

fill = "right" indicates how to handle missing values. If there are fewer values in
a row than the number of new columns, the fill = "right" option will fill missing
values from the right side with NA.

The data value is in the correct columns now. I changed the column names based on
the description in the “shop clothing infor 2008.txt” file. Previously, there was a column
named “order (sells)”, however, it is described as a sequence of clicks during one
session, so I changed the column name to "click_stream." The other columns are
changed for easier understanding (for example, “main_category” instead of “page 1
(main category)”).

Next, I convert the data types to integers for easy calculation and analysis.

# convert data to numeric


df <- as.data.frame(lapply(df, as.integer))

Figure 3. Warning when converting data type

There is a warning that NA values appeared. Apparently, the values of the


“clothing_model” cannot be converted to an integer since they have letters, so the
values of the column are set to NA.

GCD220190 3
001306081 COMP1810 - Data and Web Analytics

Figure 4. Values of “clothing_model” set to NA

When looking through the data frame and the description file, we now can conclude
that the data is in the year 2008 and except for the model code, we do not have any
further information about what the model code means and how it can help us with our
analysis, so I decided to drop the “year” and “clothing_model” columns.

# drop column "year" and "clothing_model"


df <- df[, !(colnames(df) == "year") ]
df <- df[, !(colnames(df) == "clothing_model") ]

Figure 5. Data frame after dropping “year” and “clothing_model” columns

1.3. Number of orders


Another important information is the number of orders, which is not included in the
data frame. There is no clear information about how an order is calculated. I assume
that the last click of each session is when the customer places the order. According to
this assumption, if the click stream is the highest of each session, then it will be
counted as an order.

# Add new column 'product_sold'


# if click_stream max of the session product_order = 1, else = 0

GCD220190 4
001306081 COMP1810 - Data and Web Analytics

df$click_stream = as.numeric(df$click_stream)

df <- df %>%
group_by(session_id) %>%
mutate(product_sold = ifelse(click_stream == max(click_stream), 1, 0))
%>%
ungroup()

Figure 6. Added column “product_sold”

1.4. Correlation Matrix


I choose to plot a correlation matrix to see if the “product_sold” column has strong
correlation with any other columns.

# Draw Correlation Matrix


library("reshape2")
library("ggplot2")
# Calculate the correlation matrix
cor_matrix <- cor(df)

# Convert the correlation matrix to long format


cor_long <- reshape2::melt(cor_matrix)

# Plot the correlation matrix using ggplot2


ggplot(cor_long, aes(x = Var1, y = Var2, fill = value)) +
geom_tile() +
scale_fill_gradient2(low = "blue", mid = "white", high = "red",
midpoint = 0, limits = c(-1, 1),

GCD220190 5
001306081 COMP1810 - Data and Web Analytics

guide = guide_colorbar(title = "Correlation")) +


labs(title = "Correlation Matrix", x = "", y = "") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
geom_text(aes(label = round(value, 2)), colour = "black", size = 3)

Figure 7. Correlation Matrix of the columns in the data frame

Except for the correlation between the “month” and “session_id” columns (which is
0.97), there are significant correlations between the columns, especially the
“product_sold” column. As for the correlation between the “month” and “session_id”
columns, there is no clear explanation on how they correlate with each other or what
analysis can be made.

1.5. Sales by months


To be able to find the insight of the data frame, it is essential to calculate sales by
month and visualise it. The total sales of the month is simply the sum of the order.

GCD220190 6
001306081 COMP1810 - Data and Web Analytics

# calculate total sales by month


sales_by_month <- df %>%
group_by(month) %>%
summarise(total_sales = sum(product_sold, na.rm = TRUE))

# plot bar chart


library(ggplot2)
ggplot(sales_by_month,
aes(x = month,
y = total_sales,
fill = month)) +
geom_bar(stat = "identity") +
labs(title = "Total Sales by Months",
x = "Month",
y = "Total Sales") +
scale_y_continuous(labels = function(x) format(x, scientific = FALSE))

Figure 8. Total Sales by Months

As can be seen on the figure above, the sales in April are the highest. To better
understand what could lead to this result, I made some further analysis and
visualisation.

GCD220190 7
001306081 COMP1810 - Data and Web Analytics

1.5.1. Total Sales by Day in April


After visualising the total sales by month, it is noticeable that the sales of April are
much higher than other months. Therefore, I think visualising the sales by day in April
might help me get some insight.

# calculate daily sales in April


april_daily_sales <- df %>%
filter(month == 4) %>%
group_by(day) %>%
summarise(total_sales = sum(product_sold, na.rm = TRUE))

# plot bar chart


ggplot(april_daily_sales,
aes(x = day,
y = total_sales,
fill = day)) +
geom_bar(stat = "identity") +
labs(title = "Daily Sales in April",
x = "Day",
y = "Total Sales")

Figure 9. Daily sales in April

GCD220190 8
001306081 COMP1810 - Data and Web Analytics

The sales of the first week of April are much higher than the other days in April. I could
not find any information that related to why the sales are high. I would assume that it
might be because of the marketing campaigns at that time.

1.5.2. Sales of each category by month


The distribution of sales might be because one category is more popular than another.
In the data frame, the category values are numbers from 1 to 4, and for easier
understanding and analysis, I change the numbers to according to values stated in the
information file.

# change value of main_category


# 1 = trousers; 2 = skirts; 3 = blouses; 4 = sale
df <- df %>%
mutate(main_category = recode(main_category,
'1' = 'trousers',
'2' = 'skirts',
'3' = 'blouses',
'4' = 'sale'))

Then I calculated the total sales by categories by grouping the data frame by
“main_category” and calculated the sum of "product_sold.”

# total sales by category


sales_by_category <- df %>%
group_by(main_category) %>%
summarise(total_sales = sum(product_sold), na.rm = TRUE)

# plot bar chart


ggplot(sales_by_category,
aes(x = main_category,
y = total_sales,
fill = main_category)) +
geom_bar(stat = "identity") +
labs(title = "Total Sales by Categories",
x = "Category",
y = "Total Sales") +
scale_y_continuous(labels = function(x) format(x, scientific = FALSE))

GCD220190 9
001306081 COMP1810 - Data and Web Analytics

Figure 10. Total Sales by Categories

Next, I calculated the sales of each category by months by grouping the data frame by
“month” and “main_category” and counting the sum of "product_sold," then plotting
the bar chart.

# calculate sales of each category for each month.


category_sales_by_month <- df %>%
group_by(main_category, month) %>%
summarise(category_sales = sum(product_sold, na.rm = TRUE)) %>%
ungroup()

# plot bar chart


ggplot(category_sales_by_month,
aes(x = month,
y = category_sales,
fill = main_category)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Total Sales of Each Category By Month",
x = "Month",

GCD220190 10
001306081 COMP1810 - Data and Web Analytics

y = "Total Sales") +
scale_y_continuous(labels = function(x) format(x, scientific = FALSE))

Figure 11. Total Sales of Each Category by Months

According to figure 10 and figure 11, “trousers” have the highest sales, while the “sale”
category is in third place. I think the reason might come from the price of the product in
each category. So, I will calculate and analyse if price affects the sales.

1.5.3. Mean Price by Category


To calculate the mean price of each category, I group the data frame by category and
then calculate the mean price by using the mean() function. .groups = 'drop' is
used to drop all the grouping after summarising.

Then, I use ggplot to plot the bar chart with the x-axis being the category and the y-axis
being the mean price.

# calculate mean price of each category


mean_price_by_category <- df %>%

GCD220190 11
001306081 COMP1810 - Data and Web Analytics

group_by(main_category) %>%
summarise(mean_price = mean(price_usd), .groups = 'drop')

# plot bar chart


ggplot(mean_price_by_category,
aes(x = main_category,
y = mean_price,
fill = main_category)) +
geom_bar(stat = 'identity') +
labs(title = "Mean Price of Each Category",
x = "Category",
y = "Mean Price")

Figure 12. Mean Price by Category

Looking at the bar chart, we see that the mean price of trousers and skirts is the
highest, not as my expectation of customers purchasing them the most because of the
low price. To further answer the question of why “trousers” and "skirts" are being
purchased the most, I tried to make some analysis based on what page they appear on
the most.

GCD220190 12
001306081 COMP1810 - Data and Web Analytics

1.5.4. Number of times a category appears on a page


I counted the number of times a category appears on a page by grouping the data
frame by “main_category” and “page,” then counting the number of entries.

# calculate the number of times a category appears on a page


category_page_count <- df %>%
group_by(main_category, page) %>%
summarise(count = n(), .groups = 'drop')

# plot a stacked bar chart


ggplot(category_page_count,
aes(x = factor(page),
y = count,
fill = main_category)) +
geom_bar(stat = 'identity') +
labs(title = "Category Count by Page Number",
x = "Page Number",
y = "Count")

Figure 13. Category Count by Page Number

GCD220190 13
001306081 COMP1810 - Data and Web Analytics

I use a stacked bar chart to better see the distribution of each category on a page. From
figure 12, we can see that most of the products appear on the first page; however,
nearly half of them are from the “trousers” category, then comes the “skirts” category.
With this bar chart, it is understandable why “trousers” have the most sales, since they
appear mostly on the first page and are distributed on the first three pages, which
makes the customers locate them easier.

1.5.5. Total Sales by Colour


To further analyse the customer’s behaviours, I summarised and visualised some
attributes of the products. This can also help find the trend among the customers.

Firstly, I change the code number of the colours to the actual name of the colour to
better understand the data.

# change numeric values to colour name


# 1 = beige; 2 = black; 3 = blue; 4 = brown; 5 = burgundy; 6 = gray; 7 =
green; 8 = navy; 9 = multi-colour; 10 = olive; 11 = pink; 12 = red; 13 =
violet; 14 = white
df <- df %>%
mutate(colour = recode(colour,
'1' = 'beige', '2' = 'black',
'3' = 'blue', '4' = 'brown',
'5' = 'burgundy', '6' = 'gray',
'7' = 'green', '8' = 'navy',
'9' = 'multi-colour', '10' = 'olive',
'11' = 'pink', '12' = 'red',
'13' = 'violet', '14' = 'white'))

# sales by colour
sales_by_colour <- df %>%
group_by(colour) %>%
summarise(colour_sales = sum(product_sold), na.rm = TRUE) %>%
ungroup() %>%
arrange(desc(colour_sales))

# plot chart
ggplot(sales_by_colour,
aes(x = reorder(colour, -colour_sales),

GCD220190 14
001306081 COMP1810 - Data and Web Analytics

y = colour_sales,
fill = colour)) +
geom_bar(stat = 'identity') +
labs(title = "Total Sales by Colour",
x = "Colour",
y = "Total Sales")

Figure 14. Total Sales by Colour

As can be seen in the chart above, the top 3 most popular colours are black, blue, and
gray, which are neutral colours and can mix and match well with each other.

1.5.6. Total Sales by Price


Next, I calculated the total sales by price to see how people spend their money on
clothing. I would expect the customers to spend on mid-range-priced clothing rather
than cheap clothes. The reason is that low-priced clothes might not have good quality.

# calculate total sales by price

GCD220190 15
001306081 COMP1810 - Data and Web Analytics

sales_by_price <- df %>%


group_by(price_usd) %>%
summarise(price_sales = sum(product_sold, na.rm = TRUE)) %>%
ungroup()

# plot chart
ggplot(sales_by_price, aes(x = price_usd, y = price_sales)) +
geom_line(color = "black", size = 1) +
geom_smooth(method = "loess", color = "red", se = FALSE) +
geom_point() +
scale_x_continuous(breaks = sales_by_price$price_usd, labels =
sales_by_price$price_usd) +
labs(title = "Total Sales by Price",
x = "Price",
y = "Total Sales")

Figure 15. Total Sales by Price

As can be seen, the most popular price range is from $28 to $48. I think for most
people, this is a reasonable price range for clothes.

GCD220190 16
001306081 COMP1810 - Data and Web Analytics

1.6. Revenue by month


The sales of April have been the highest, so I assume that the revenue of April would be
higher than other months. Moreover, the bar chart of “Total Revenue by Months”
would not have many differences from the “Total Sales by Months” chart.

The total revenue by months can be found by grouping the data frame by months and
calculating the sum of price multiplied by product_sold.

# calculate revenue by months


rev_by_month <- df %>%
group_by(month) %>%
summarise(total_revenue = sum(price_usd * product_sold), .groups =
'drop')

# plot bar chart


ggplot(rev_by_month,
aes(x = month,
y = total_revenue,
fill = month)) +
geom_bar(stat = 'identity') +
labs(title = "Total Revenue by Month",
x = "Month",
y = "Revenue") +
scale_y_continuous(labels = function(x) format(x, scientific = FALSE))

Figure 16. Total Revenue by Month

GCD220190 17
001306081 COMP1810 - Data and Web Analytics

1.6.1. Revenue of each category by months

# total revenue by category


rev_by_category <- df %>%
group_by(main_category) %>%
summarise(total_revenue = sum(price_usd * product_sold), na.rm = TRUE)
print(rev_by_category)

# plot bar chart


ggplot(rev_by_category,
aes(x = main_category,
y = total_revenue,
fill = main_category)) +
geom_bar(stat = "identity") +
labs(title = "Total Revenue by Categories",
x = "Category",
y = "Total Revenue") +
scale_y_continuous(labels = function(x) format(x, scientific = FALSE))

Figure 17. Revenue by Categories

GCD220190 18
001306081 COMP1810 - Data and Web Analytics

Looking at figure 17, the revenue of “trousers” and “skirts” is almost the same. This
might result because the price of “skirts” is higher than “trousers” while having a
similar number of products sold.

Then, I calculated and plotted the chart of revenue for each category by month.

# calculate revenue for each category of each month


category_rev_by_month <- df %>%
group_by(main_category, month) %>%
summarise(category_rev = sum(price_usd * product_sold, na.rm = TRUE),
.groups = 'drop')

# plot bar chart


ggplot(category_rev_by_month,
aes(x = month,
y = category_rev,
fill = main_category)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Total Revenue of Each Category By Month",
x = "Month",
y = "Total Revenue") +
scale_y_continuous(labels = function(x) format(x, scientific = FALSE))

Figure 18. Total Revenue of Each Category by Month

GCD220190 19
001306081 COMP1810 - Data and Web Analytics

As can be seen in figure 18, in April, the revenue of “skirts” is not as high as “trousers,”
but it is not far behind. In other months, the revenue of “skirts” is almost the same or
higher than "trousers." Besides, even though the sales of the “sale” category are not
bad, due to its cheap price, this category does not bring much revenue to the clothing
store.

1.7. Conclusion
After doing the analysis, I can conclude that the shop has a good number of revenue
from the “trousers” and “skirts” categories. They have the highest price and appear the
most on the first and second pages of the website, which reach more customers. On the
other hand, despite decent sales, the “sale” category did not significantly impact and
contribute to the revenue of the e-shop due to low prices.

Additionally, it is clear that neutral colours are the most favoured among customers,
whereas brighter colours are less appealing.

1.8. Recommended solutions


To boost the revenue and traffic of the e-shop even more, here are some
recommendations I think would help the e-shop owner optimise their website.

1.8.1. Optimise product placement


Given the success of "trousers" and "skirts," targeted promotions or discounts for these
categories could further drive revenue. During slower periods, such as August,
well-placed promotions for these products could maintain sales momentum and foster
growth. Marketing campaigns emphasising these popular items may attract more
customers and enhance sales and revenue. Additionally, focussing on neutral-coloured
clothing, which has proven popular, and promoting mid-range priced items ($28 to $48)
could be beneficial.

1.8.2. Promote best-selling products


With the popularity of “trousers” and "skirts," these categories might bring more
revenue with targeted promotions or discounts. Especially during slow months like
August, with the right promotions of these products, the e-shop could maintain sales
momentum and drive further growth. Marketing campaigns that especially highlight
these items might attract more customers, leading to increased sales and revenue.

GCD220190 20
001306081 COMP1810 - Data and Web Analytics

Besides, clothes in neutral colours got sold the most, so the store might focus more on
the products in these colours. They might also benefit from promoting products that
are in the mid-range price of the store, which are from $28 to $48.

1.8.3. Enhance website navigation


Improving website navigation can make popular products more accessible and increase
exposure for less popular items. Testing different layouts and tracking user interactions
can provide insights into how customers browse and engage with the site. This data
then can be used to optimise the site’s design, ensuring that all categories have a fair
chance of being noticed and purchased.

3. input.csv
3.2. Import file
First, the data we have been provided is not with the “.csv” extension as stated in the
coursework, but in “.xlsx” instead. So, I cannot use read.csv() to read the data. To be
able to read the data file, first we must import the library that supports reading Excel
files, which is ‘readxl’.

# install package if not already installed


install.packages('readxl')
# import package
library('readxl')

Next, we will use the read_excel() function to import the ‘input.xlsx’ file.

input_df <- read_excel('input.xlsx')


View(input_df)

GCD220190 21
001306081 COMP1810 - Data and Web Analytics

Figure 16. Raw input.xlsx dataset

3.3. Clean and normalise the data


As has been shown in the figure above, the format is not correct. All the values of the
dataset are in the “id” column. We can separate the values into correct columns using
the read_excel() function with additional arguments to specify the correct data range
and handling of the headers. Additionally, we can use the ‘dplyr’ library to use the pipe
operator (%>%) to make changes to the dataset.

# install the dplyr library if not installed


install.packages('dplyr')
# import the library
library('dplyr')

# normalise the data


input_df <- read_excel('input.xlsx', col_names = TRUE, na =
"NA")
input_df <- input_df %>%
separate(id, into = c("id", "name", "salary", "start_date",
"dept"), sep = "\\s+", fill = "right", extra = "merge")

GCD220190 22
001306081 COMP1810 - Data and Web Analytics

Figure 17. input_df after split columns

However, as shown in the figure above, the ID seems to be duplicated, which makes the
values and headers incorrect and not relevant. To clean this data, I use a similar
approach to how I separated the values into columns with extra steps.

First, I delete the values from the name column by setting them to NA values. Then, I
shift the values of the columns to the column on the left (for example, values of the
“salary” column to the “name” column). Finally, I set the value of the “dept” column to
NA and split the data from the “start_date” column and filled it in the “dept” column.

# delete values of "name" column


input_df$name <- NA
# shift data to the column on the left
input_df <- input_df %>%
mutate(
name = salary,
salary = start_date,
start_date = dept
)

GCD220190 23
001306081 COMP1810 - Data and Web Analytics

# set value of column "dept" to NA


input_df$dept <- NA
# split the start_date column and fill in the dept column
input_df <- input_df %>%
separate(start_date, into = c("start_date", "dept"), sep =
"\\s+", fill = "right", extra = "merge")

Figure 18. Final clean input_df

3.4. Mean salary


The mean value is the sum of observations divided by the total number of observations.
It is also defined as the average, which is the sum divided by count.

GCD220190 24
001306081 COMP1810 - Data and Web Analytics

Before finding the mean of the salary column, we have to convert the datatype of the
salary column to numeric (double) with the as.numeric() function. And then, use
the mean() function to find the mean of the salary column. I also assume that the
currency is pounds (£).

# convert "salary" column format to numeric


input_df$salary <- as.numeric(input_df$salary)

# mean of the "salary" column


salary_mean <- mean(input_df$salary)

[1] 656.8813

To better understand the distribution of the salary between the departments, I also
calculate the mean salary by department and visualise it.

# aggregate salary mean by department


mean_by_dept <- input_df %>%
group_by(dept) %>%
summarise(salary = mean(salary, na.rm = TRUE))

# Create the bar plot


ggplot(mean_by_dept, aes(x = dept, y = salary, fill = dept)) +
geom_bar(stat = "identity") +
labs(title = "Mean Salary by Department", x = "Department", y
= "Mean Salary")

GCD220190 25
001306081 COMP1810 - Data and Web Analytics

Figure 19. Mean Salary by Department

Looking at the figure above, it is noticeable that the mean salary of the “Finance”
department is the highest, significantly higher than the “Operations” department.

If we sort the salary column from lowest to highest, we can see that the salaries of
employees from the “IT” department are in the top lowest, even though they have the
most employee entries in this dataset.

Figure 20. Sorted input_df by salary from lowest to highest

GCD220190 26
001306081 COMP1810 - Data and Web Analytics

Since I have no information about the location of this company, as well as the position
of the employees, I made an assumption of the company location being in the UK and
the employees are just ordinary officers. After doing some research about the salary of
each department in the UK, here are some conclusions I can make.

Firstly, this must be the weekly salary of the employees since, compared to the monthly
salary I have found, the salary of these employees would be too low, but for the daily
salary, it would be too high. The only suitable choice is the weekly salary.

The salary of officers from the finance department is reasonable based on the
information I have found [1].

The salary of the HR worker in this dataset is a little lower than what I have found
online, which is £808. [2]

Since there is no clear information about the position of the worker in the “IT”
department, I can only assume that they are either software developers or web
developers. No matter what position, with the information about the salary of software
developer and web developer, the employees of the company in this dataset are
receiving the salary lower than the average salary of both software developer (£1,146)
[3] and web developer (£866) [4].

The salary of the officers in the “Operations” department is nearly the same as the
average weekly salary of the Operations Support Officer that I found. [5]

3.5. Median salary


By definition, median is the middle value of the dataset. If the number of elements in
the dataset is odd, then the median is the centre element. If the number is even, then
the median equals the average of two central elements.

GCD220190 27
001306081 COMP1810 - Data and Web Analytics

To find the median of the salary column, I use the median() function. Usually, it would
be better to sort the values to make finding the median value easier, but luckily, the
median() function in R automatically sorts the values and calculates the median value
if needed.

# finding median of the "salary" column


salary_median <- median(input_df$salary)

[1] 628.05

3.6. Mode of salary


Mode is the value that has the highest frequency in the given dataset. However, in the
dataset, the frequency of all data points is the same, as there are no duplicate values in
the “salary” column. So, there is no mode in the “salary” column.

3.7. Variance of salary


The variance is a numerical measure of how the data values are dispersed around the
mean. It can be the sum of squares of differences between all numbers and means.

In R, we can find the variance with the built-in var() function.

# variance of the "salary" column


salary_variance <- var(input_df$salary)

[1] 10621.25

3.8. Standard deviation of salary


The standard deviation is the measure of the dispersion of the values. It can be defined
as the square root of variance.

GCD220190 28
001306081 COMP1810 - Data and Web Analytics

The standard deviation can be defined using sd() function in R.

# standard deviation of the "salary" column


salary_sd <- sd(input_df$salary)

[1] 103.0595

3.9. Descriptive Analysis


From the metrics that have been calculated, here are some descriptive analyses that
can be made:

1. The mean average salary in the dataset is £656.88 per week. Among the
departments, the “Finance” department has the highest average salary, while IT
department salaries are generally on the lower side.
2. The median salary is £628.05, indicating that half of the employees earn less than
this amount and half earn more.
3. The variance is £10621.05, and the standard deviation is £103.06. This shows a
moderate spread of salaries around the mean, suggesting some variability but
not extreme differences.
4. star wars
4.2. Access file
To access the “starwars” dataset, it is essential to import the ‘dplyr’ library. You must
install the library first if you have not.

# install the library if needed


install.packages('dplyr')
# import the library
library(dplyr)

# view the dataset


View(starwars)

GCD220190 29
001306081 COMP1810 - Data and Web Analytics

Figure 21. starwars dataset

The ‘starwars’ dataset is quite big, with 87 rows and 14 columns.

4.3. Clean and normalise


When looking at the first few rows of the dataset, it is noticeable that there are missing
values in the dataset. So, I will check the total missing values of each column first.

# check for NA values of each columns


colSums(is.na(starwars))

Figure 22. Missing values in ‘starwars’ dataset

As can be seen in the figure above, there are quite a lot of missing values in this
dataset, not including the wrong input values. After looking through the whole dataset,
I have noticed that there are “none” and “unknown” values.

GCD220190 30
001306081 COMP1810 - Data and Web Analytics

Figure 23. “none” values Figure 24. “unknown” values

So, to make normalising data easier, I changed all these values to NA.

# replace "none" and "unknown" values to NA


starwars_df <- starwars %>%
mutate(across(where(is.character), ~na_if(.x, "none"))) %>%
mutate(across(where(is.character), ~na_if(.x, "unknown")))

# check for NA values after replacement


colSums(is.na(starwars_df))

Figure 25. Number of missing values after replacement to NA

4.4. Actors without black eyes and taller than 150 cm


Even though the task was to extract actors who are taller than 150m, in reality, there is
no actor who is taller than 150m. So for the tasks related to the “starwars” data frame, I
will use the “cm” metrics.

To extract the data about actors whose eyes are not black and are taller than 150 cm,
“height” and “eye_color” are the only columns I need to use. So, to save time on
cleaning, I will not clean all the dataset but only the two columns.

GCD220190 31
001306081 COMP1810 - Data and Web Analytics

4.4.1. Normalise the “height” column


I extracted the NA values in the “height” column using the filter() function, and
there are only 6 missing values. I decided to fill in the missing values with the
information I have found about the “height” of the actors in Star Wars on SWRPGGM
and Wookieepedia.

# show NAs in height columns


filter(starwars_df, is.na(height))

# fill the missing values in height with information found on


https://swrpggm.com/ and https://starwars.fandom.com/
starwars_df <- starwars_df %>%
mutate(height = replace(height, name == "Finn", 178)) %>%
mutate(height = replace(height, name =="Rey", 170)) %>%
mutate(height = replace(height, name == "Poe Dameron", 172)) %>%
mutate(height = replace(height, name == "BB8", 67)) %>%
mutate(height = replace(height, name == "Captain Phasma", 200))

Figure 26. Missing values in “height” column

4.4.2. Extract actors taller than 150 cm and without black eyes
To extract actors taller than 150 cm and without black eyes, firstly, I extract the actors
that are taller than 150 cm as a new data frame named “starwars_150” using the
filter() function, and then extract actors whose eye colour is not black from that
data frame.

GCD220190 32
001306081 COMP1810 - Data and Web Analytics

# extract actors taller than 150


starwars_150 <- filter(starwars_df, height > 150)

Figure 27. Actors who are taller than 150

As a result, there are 73 actors that are taller than 150 cm, with the shortest actor being
157.

# no black eye_color
non_black_150 <- filter(starwars_150, eye_color != "black")

Figure 28. Actors whose eyes are not black and taller than 150

The “non_black_150” data frame stores information about actors that are taller than
150 and do not have black eyes. There are 63 actors out of 87 actors who meet the
criteria.

4.4.3. Distribution of actors’ height

# distribution of actors taller than 150 cm vs all actors


ggplot() +
geom_histogram(data = starwars_df, aes(x = height), fill = 'blue',
alpha = 0.5, bins = 30) +
geom_histogram(data = starwars_150, aes(x = height), fill = 'red',
alpha = 0.5, bins = 30) +
labs(title = 'Height Distribution', x = 'Height', y = 'Count')

GCD220190 33
001306081 COMP1810 - Data and Web Analytics

Figure 29. Height distribution of actors taller than 150 cm vs all actors

Looking at the distribution histogram above, it is noticeable that most actors are taller
than 150 cm. The percentage of tall actors among all actors is also calculated

# Ensure correct calculation


total_actors <- nrow(starwars_df)
tall_actors <- nrow(starwars_150)

# Display the number of actors


print(paste("Total actors:", total_actors))
print(paste("Tall actors:", tall_actors))

# Correct percentage calculation


percentage_tall_actors <- (tall_actors / total_actors) * 100
print(paste("Percentage of actors taller than 150 cm:",
round(percentage_tall_actors, 2), "%"))

[1] "Total actors: 87"


[1] "Tall actors: 73"
[1] "Percentage of actors taller than 150 cm: 83.91 %"

GCD220190 34
001306081 COMP1810 - Data and Web Analytics

So we have that out of 87 actors, nearly 84% are taller than 150 cm.

4.4.4. Distribution of actors’ eye colour


First, I have another variable to store data frame about the actors, whose eyes are not
black.

starwars_nonblack <- filter(starwars_df, eye_color != "black")

Next, I plot bar chart to see the distribution of eye colours and comparison between all
actors and actors without black eyes.

ggplot() +
geom_bar(data = starwars_df, aes(x = eye_color), fill = 'blue', alpha =
0.5, position = 'dodge') +
geom_bar(data = starwars_nonblack, aes(x = eye_color), fill = 'red',
alpha = 0.5, position = 'dodge') +
labs(title = 'Eye Color Distribution', x = 'Eye Color', y = 'Count')

Figure 30. Eye Color Distribution

GCD220190 35
001306081 COMP1810 - Data and Web Analytics

# Ensure correct calculation


total_actors <- nrow(starwars_df)
non_black_actors <- nrow(starwars_nonblack)

# Display number of actors


print(paste("Total actors:", total_actors))
print(paste("Actors without black eyes:", non_black_actors))

# Correct percentage calculation


percentage_non_black_actors <- (non_black_actors / total_actors) * 100
print(paste("Percentage of actors without black eyes:",
round(percentage_non_black_actors, 2), "%"))

[1] "Total actors: 87"


[1] "Actors without black eyes: 74"
[1] "Percentage of actors without black eyes: 85.06 %"

As illustrated in figure 30, despite NA values, most of the actors do not have black eyes.
Also, 85% of the actors do not have black eyes.

4.4.5. Distribution of tall actors without black eyes

# distribution of actors taller than 150 cm without black eyes vs all


actors
ggplot() +
geom_histogram(data = starwars_df, aes(x = height), fill = 'blue',
alpha = 0.5, bins = 30) +
geom_histogram(data = non_black_150, aes(x = height), fill = 'red',
alpha = 0.5, bins = 30) +
labs(title = 'Tall actors without black eyes Distribution', x =
'Height', y = 'Count')

GCD220190 36
001306081 COMP1810 - Data and Web Analytics

Figure 31. Tall Actors without Black Eyes Distribution

# Ensure correct calculation


total_actors <- nrow(starwars_df)
non_black_tall_actors <- nrow(non_black_150)

# Display number of actors


print(paste("Total actors:", total_actors))
print(paste("Tall actors:", non_black_tall_actors))

# Correct percentage calculation


percentage_non_black_tall_actors <- (non_black_tall_actors /
total_actors) * 100
print(paste("Percentage of actors taller than 150 cm and without black
eyes:", round(percentage_non_black_tall_actors, 2), "%"))

[1] "Total actors: 87"


[1] "Tall actors: 63"
[1] "Percentage of actors taller than 150 cm and without black eyes:
72.41 %"

GCD220190 37
001306081 COMP1810 - Data and Web Analytics

Despite having a lot of actors who are taller than 150 cm, only about 72% of the actors
are taller than 150 cm and do not have black eyes.

4.5. BMI
Body Mass Index (BMI) is a numerical value derived from a person’s mass (weight) and
height. It is used to identify whether a person is underweight, normal weight,
overweight, or obese.

The formula to calculate BMI is:

4.5.1. Normalise the “mass” column


Before calculating the BMI of the actors, we need to clean and fill the missing values of
the “mass” column. In total, there are 27 actors with missing values in the “miss”
column.

# Show NA values in the "mass" column


View(filter(starwars_df, is.na(mass)))

GCD220190 38
001306081 COMP1810 - Data and Web Analytics

Figure 32. Actors with missing values in the “mass” column

To fill in the missing values, once again I have looked up the weight of a certain actor. As
a result, I could only find the weight of 6 actors and drop the rest NA values.

# fill in values of the "mass" column


starwars_df <- starwars_df %>%
mutate(mass = replace(mass, name == "Saesee Tiin", 78)) %>%
mutate(mass = replace(mass, name == "Finn", 73)) %>%
mutate(mass = replace(mass, name == "Rey", 54)) %>%
mutate(mass = replace(mass, name == "Poe Dameron", 80)) %>%
mutate(mass = replace(mass, name == "BB8", 18)) %>%
mutate(mass = replace(mass, name == "Captain Phasma", 76))

The "starwars_df" is now left with 65 actors out of 87 actors in the beginning. Now that
we already have the mass and height, we can calculate the BMI of the actors.

4.5.2. BMI of the actors


To make it easier for further analysis, I will store the information about “height” and
“mass” in another dataframe named “starwars_bmi”

GCD220190 39
001306081 COMP1810 - Data and Web Analytics

# store data about height and mass in "starwars_bmi" df


starwars_bmi <- starwars_df %>%
select(height, mass)

# drop NA of height columns


starwars_bmi <- starwars_bmi %>%
drop_na(height)

# drop NA values in mass column


starwars_bmi <- starwars_bmi %>%
drop_na(mass)

Then calculate the BMI column.

# add bmi column


starwars_bmi <- starwars_bmi %>%
mutate(bmi = round(mass / ((height / 100) ^ 2), 2))

mutate() function is used to create a new column in the data frame.

round(..., 2) function rounds the BMI values to two decimal places.

4.5.3. Height vs BMI


To have a look at the distribution of the actors’ BMI, I use a scatter plot to visualise the
“BMI” column against the “height” column.

library(ggplot2)

ggplot(starwars_df,
aes(x = height,
y = bmi)) +
geom_point() +
labs(title = "Height vs BMI",
x = "Height",
y = "BMI")

GCD220190 40
001306081 COMP1810 - Data and Web Analytics

Figure 33. Scatter plot of BMI against height

Looking at the plot above, there is an actor with a high BMI of more than "400," which
is strange. So when I looked through the data frame again, the actor with such a high
BMI is “Jabba Desilijic Tiure”, since his height is just 175 cm but his mass is 1358 kg.
Apparently, this actor is not a human but a Hutt species, so it is understandable that he
is not the same as other actors.

4.5.4. Mass vs BMI

ggplot(starwars_df,
aes(x = mass,
y = bmi)) +
geom_point() +
labs(title = "Mass vs BMI",
x = "Mass",
y = "BMI")

GCD220190 41
001306081 COMP1810 - Data and Web Analytics

Figure 34. Scatter plot of Mass vs BMI

In figure 34, it is noticeable that the lower the mass, the lower the BMI. This is
explainable since BMI is calculated as mass / (height/100)^2, it is directly influenced
by mass. To better confirm this theory, I plotted a correlation matrix between “height”,
“mass” and “bmi” columns

# Convert data to numeric


starwars_bmi <- as.data.frame(lapply(starwars_bmi, as.numeric))

# Calculate the correlation matrix


cor_matrix <- cor(starwars_bmi[, c("mass", "height", "bmi")])

# Convert the correlation matrix to long format


cor_long <- melt(cor_matrix)

# Plot the correlation matrix using ggplot2


ggplot(cor_long, aes(x = Var1, y = Var2, fill = value)) +
geom_tile() +

GCD220190 42
001306081 COMP1810 - Data and Web Analytics

scale_fill_gradient2(low = "blue", mid = "white", high = "red",


midpoint = 0, limits = c(-1, 1),
guide = guide_colorbar(title = "Correlation")) +
labs(title = "Correlation Matrix", x = "", y = "") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
geom_text(aes(label = round(value, 2)), color = "black", size = 3)

Figure 35. Correlation Matrix between mass, height, and bmi

The correlation between “bmi” and “mass” is really high (nearly 1). This means that as
mass increases, BMI increases proportionally. On the other hand, the correlation
between “bmi” and “height” is relatively weak, which means an increase in height does
not significantly affect the result of the BMI.

GCD220190 43
001306081 COMP1810 - Data and Web Analytics

5. References
[1] (No date) Finance officer salaries - check average finance administrator salary rate
on Jooble. Available at: https://uk.jooble.org/salary/finance-officer#weekly (Accessed:
11 August 2024).

[2] (No date a) HR salaries - check average HR executive salary rate on Jooble. Available
at: https://uk.jooble.org/salary/HR#weekly (Accessed: 11 August 2024).

[3] (No date a) Software developer salaries - check average developer salary rate on
Jooble. Available at: https://uk.jooble.org/salary/software-developer#weekly
(Accessed: 11 August 2024).

[4] (No date a) Web developer salaries - check average developer salary rate on Jooble.
Available at: https://uk.jooble.org/salary/web-developer#weekly (Accessed: 11 August
2024).

[5] (No date a) Operations support officer salaries - check average flight operations
officer salary rate on Jooble. Available at:
https://uk.jooble.org/salary/operations-support-officer#weekly (Accessed: 11 August
2024).

[6] Star wars D6 the role playing game (2024) SWRPGGM. Available at:
https://swrpggm.com/ (Accessed: 12 August 2024).

[7] Wookieepedia (no date) Fandom. Available at: https://starwars.fandom.com/


(Accessed: 12 August 2024).

GCD220190 44

You might also like