COMP1810 - Data and Web Analytics
COMP1810 - Data and Web Analytics
COMP1810 - Data and Web Analytics
GCD220190
001306081 COMP1810 - Data and Web Analytics
Table of Contents
1. Introduction............................................................................................................ 1
2. e-shop clothing 2008............................................................................................... 1
1.1. Import file..............................................................................................................1
1.2. Clean and normalise the data frame..................................................................... 2
1.3. Number of orders.................................................................................................. 4
1.4. Correlation Matrix................................................................................................. 5
1.5. Sales by months.....................................................................................................6
1.5.1. Total Sales by Day in April..............................................................................8
1.5.2. Sales of each category by month...................................................................9
1.5.3. Mean Price by Category.............................................................................. 11
1.5.4. Number of times a category appears on a page......................................... 13
1.5.5. Total Sales by Colour................................................................................... 14
1.5.6. Total Sales by Price...................................................................................... 15
1.6. Revenue by month.............................................................................................. 17
1.6.1. Revenue of each category by months......................................................... 18
1.7. Conclusion........................................................................................................... 20
1.8. Recommended solutions.....................................................................................20
1.8.1. Optimise product placement.......................................................................20
1.8.2. Promote best-selling products.................................................................... 20
1.8.3. Enhance website navigation........................................................................21
3. input.csv............................................................................................................... 21
3.2. Import file............................................................................................................21
3.3. Clean and normalise the data..............................................................................22
3.4. Mean salary......................................................................................................... 24
3.5. Median salary...................................................................................................... 27
3.6. Mode of salary.....................................................................................................28
3.7. Variance of salary................................................................................................ 28
3.8. Standard deviation of salary................................................................................28
3.9. Descriptive Analysis.............................................................................................29
4. star wars............................................................................................................... 29
4.2. Access file............................................................................................................ 29
GCD220190
001306081 COMP1810 - Data and Web Analytics
GCD220190
001306081 COMP1810 - Data and Web Analytics
1. Introduction
In this coursework, I have to import, clean, normalise, and analyse the datasets to get
the insights of the scenario and provide solutions for an e-shop clothing, enabling them
to optimise and get more revenue and traffic.
As for file "input.csv," I have to clean, normalise, and carry out descriptive analysis for
the dataset.
Lastly, from the “starwars” dataset of the ‘dplyr’ library, I have to clean, normalise, and
extract actors with specific requirements, then calculate and analyse their BMI.
GCD220190 1
001306081 COMP1810 - Data and Web Analytics
GCD220190 2
001306081 COMP1810 - Data and Web Analytics
First, I import necessary libraries for data manipulation and cleaning, which are ‘dplyr’
and ‘tidyr’. Next, I pass the df through a series of transformations. The %>% operator
(pipe operator) allows chaining operations in a readable way.
into = c(...) defines the names of the new columns that will be created from the
split.
sep = ";" specifies the delimiter used to split the values in the original column.
fill = "right" indicates how to handle missing values. If there are fewer values in
a row than the number of new columns, the fill = "right" option will fill missing
values from the right side with NA.
The data value is in the correct columns now. I changed the column names based on
the description in the “shop clothing infor 2008.txt” file. Previously, there was a column
named “order (sells)”, however, it is described as a sequence of clicks during one
session, so I changed the column name to "click_stream." The other columns are
changed for easier understanding (for example, “main_category” instead of “page 1
(main category)”).
Next, I convert the data types to integers for easy calculation and analysis.
GCD220190 3
001306081 COMP1810 - Data and Web Analytics
When looking through the data frame and the description file, we now can conclude
that the data is in the year 2008 and except for the model code, we do not have any
further information about what the model code means and how it can help us with our
analysis, so I decided to drop the “year” and “clothing_model” columns.
GCD220190 4
001306081 COMP1810 - Data and Web Analytics
df$click_stream = as.numeric(df$click_stream)
df <- df %>%
group_by(session_id) %>%
mutate(product_sold = ifelse(click_stream == max(click_stream), 1, 0))
%>%
ungroup()
GCD220190 5
001306081 COMP1810 - Data and Web Analytics
Except for the correlation between the “month” and “session_id” columns (which is
0.97), there are significant correlations between the columns, especially the
“product_sold” column. As for the correlation between the “month” and “session_id”
columns, there is no clear explanation on how they correlate with each other or what
analysis can be made.
GCD220190 6
001306081 COMP1810 - Data and Web Analytics
As can be seen on the figure above, the sales in April are the highest. To better
understand what could lead to this result, I made some further analysis and
visualisation.
GCD220190 7
001306081 COMP1810 - Data and Web Analytics
GCD220190 8
001306081 COMP1810 - Data and Web Analytics
The sales of the first week of April are much higher than the other days in April. I could
not find any information that related to why the sales are high. I would assume that it
might be because of the marketing campaigns at that time.
Then I calculated the total sales by categories by grouping the data frame by
“main_category” and calculated the sum of "product_sold.”
GCD220190 9
001306081 COMP1810 - Data and Web Analytics
Next, I calculated the sales of each category by months by grouping the data frame by
“month” and “main_category” and counting the sum of "product_sold," then plotting
the bar chart.
GCD220190 10
001306081 COMP1810 - Data and Web Analytics
y = "Total Sales") +
scale_y_continuous(labels = function(x) format(x, scientific = FALSE))
According to figure 10 and figure 11, “trousers” have the highest sales, while the “sale”
category is in third place. I think the reason might come from the price of the product in
each category. So, I will calculate and analyse if price affects the sales.
Then, I use ggplot to plot the bar chart with the x-axis being the category and the y-axis
being the mean price.
GCD220190 11
001306081 COMP1810 - Data and Web Analytics
group_by(main_category) %>%
summarise(mean_price = mean(price_usd), .groups = 'drop')
Looking at the bar chart, we see that the mean price of trousers and skirts is the
highest, not as my expectation of customers purchasing them the most because of the
low price. To further answer the question of why “trousers” and "skirts" are being
purchased the most, I tried to make some analysis based on what page they appear on
the most.
GCD220190 12
001306081 COMP1810 - Data and Web Analytics
GCD220190 13
001306081 COMP1810 - Data and Web Analytics
I use a stacked bar chart to better see the distribution of each category on a page. From
figure 12, we can see that most of the products appear on the first page; however,
nearly half of them are from the “trousers” category, then comes the “skirts” category.
With this bar chart, it is understandable why “trousers” have the most sales, since they
appear mostly on the first page and are distributed on the first three pages, which
makes the customers locate them easier.
Firstly, I change the code number of the colours to the actual name of the colour to
better understand the data.
# sales by colour
sales_by_colour <- df %>%
group_by(colour) %>%
summarise(colour_sales = sum(product_sold), na.rm = TRUE) %>%
ungroup() %>%
arrange(desc(colour_sales))
# plot chart
ggplot(sales_by_colour,
aes(x = reorder(colour, -colour_sales),
GCD220190 14
001306081 COMP1810 - Data and Web Analytics
y = colour_sales,
fill = colour)) +
geom_bar(stat = 'identity') +
labs(title = "Total Sales by Colour",
x = "Colour",
y = "Total Sales")
As can be seen in the chart above, the top 3 most popular colours are black, blue, and
gray, which are neutral colours and can mix and match well with each other.
GCD220190 15
001306081 COMP1810 - Data and Web Analytics
# plot chart
ggplot(sales_by_price, aes(x = price_usd, y = price_sales)) +
geom_line(color = "black", size = 1) +
geom_smooth(method = "loess", color = "red", se = FALSE) +
geom_point() +
scale_x_continuous(breaks = sales_by_price$price_usd, labels =
sales_by_price$price_usd) +
labs(title = "Total Sales by Price",
x = "Price",
y = "Total Sales")
As can be seen, the most popular price range is from $28 to $48. I think for most
people, this is a reasonable price range for clothes.
GCD220190 16
001306081 COMP1810 - Data and Web Analytics
The total revenue by months can be found by grouping the data frame by months and
calculating the sum of price multiplied by product_sold.
GCD220190 17
001306081 COMP1810 - Data and Web Analytics
GCD220190 18
001306081 COMP1810 - Data and Web Analytics
Looking at figure 17, the revenue of “trousers” and “skirts” is almost the same. This
might result because the price of “skirts” is higher than “trousers” while having a
similar number of products sold.
Then, I calculated and plotted the chart of revenue for each category by month.
GCD220190 19
001306081 COMP1810 - Data and Web Analytics
As can be seen in figure 18, in April, the revenue of “skirts” is not as high as “trousers,”
but it is not far behind. In other months, the revenue of “skirts” is almost the same or
higher than "trousers." Besides, even though the sales of the “sale” category are not
bad, due to its cheap price, this category does not bring much revenue to the clothing
store.
1.7. Conclusion
After doing the analysis, I can conclude that the shop has a good number of revenue
from the “trousers” and “skirts” categories. They have the highest price and appear the
most on the first and second pages of the website, which reach more customers. On the
other hand, despite decent sales, the “sale” category did not significantly impact and
contribute to the revenue of the e-shop due to low prices.
Additionally, it is clear that neutral colours are the most favoured among customers,
whereas brighter colours are less appealing.
GCD220190 20
001306081 COMP1810 - Data and Web Analytics
Besides, clothes in neutral colours got sold the most, so the store might focus more on
the products in these colours. They might also benefit from promoting products that
are in the mid-range price of the store, which are from $28 to $48.
3. input.csv
3.2. Import file
First, the data we have been provided is not with the “.csv” extension as stated in the
coursework, but in “.xlsx” instead. So, I cannot use read.csv() to read the data. To be
able to read the data file, first we must import the library that supports reading Excel
files, which is ‘readxl’.
Next, we will use the read_excel() function to import the ‘input.xlsx’ file.
GCD220190 21
001306081 COMP1810 - Data and Web Analytics
GCD220190 22
001306081 COMP1810 - Data and Web Analytics
However, as shown in the figure above, the ID seems to be duplicated, which makes the
values and headers incorrect and not relevant. To clean this data, I use a similar
approach to how I separated the values into columns with extra steps.
First, I delete the values from the name column by setting them to NA values. Then, I
shift the values of the columns to the column on the left (for example, values of the
“salary” column to the “name” column). Finally, I set the value of the “dept” column to
NA and split the data from the “start_date” column and filled it in the “dept” column.
GCD220190 23
001306081 COMP1810 - Data and Web Analytics
GCD220190 24
001306081 COMP1810 - Data and Web Analytics
Before finding the mean of the salary column, we have to convert the datatype of the
salary column to numeric (double) with the as.numeric() function. And then, use
the mean() function to find the mean of the salary column. I also assume that the
currency is pounds (£).
[1] 656.8813
To better understand the distribution of the salary between the departments, I also
calculate the mean salary by department and visualise it.
GCD220190 25
001306081 COMP1810 - Data and Web Analytics
Looking at the figure above, it is noticeable that the mean salary of the “Finance”
department is the highest, significantly higher than the “Operations” department.
If we sort the salary column from lowest to highest, we can see that the salaries of
employees from the “IT” department are in the top lowest, even though they have the
most employee entries in this dataset.
GCD220190 26
001306081 COMP1810 - Data and Web Analytics
Since I have no information about the location of this company, as well as the position
of the employees, I made an assumption of the company location being in the UK and
the employees are just ordinary officers. After doing some research about the salary of
each department in the UK, here are some conclusions I can make.
Firstly, this must be the weekly salary of the employees since, compared to the monthly
salary I have found, the salary of these employees would be too low, but for the daily
salary, it would be too high. The only suitable choice is the weekly salary.
The salary of officers from the finance department is reasonable based on the
information I have found [1].
The salary of the HR worker in this dataset is a little lower than what I have found
online, which is £808. [2]
Since there is no clear information about the position of the worker in the “IT”
department, I can only assume that they are either software developers or web
developers. No matter what position, with the information about the salary of software
developer and web developer, the employees of the company in this dataset are
receiving the salary lower than the average salary of both software developer (£1,146)
[3] and web developer (£866) [4].
The salary of the officers in the “Operations” department is nearly the same as the
average weekly salary of the Operations Support Officer that I found. [5]
GCD220190 27
001306081 COMP1810 - Data and Web Analytics
To find the median of the salary column, I use the median() function. Usually, it would
be better to sort the values to make finding the median value easier, but luckily, the
median() function in R automatically sorts the values and calculates the median value
if needed.
[1] 628.05
[1] 10621.25
GCD220190 28
001306081 COMP1810 - Data and Web Analytics
[1] 103.0595
1. The mean average salary in the dataset is £656.88 per week. Among the
departments, the “Finance” department has the highest average salary, while IT
department salaries are generally on the lower side.
2. The median salary is £628.05, indicating that half of the employees earn less than
this amount and half earn more.
3. The variance is £10621.05, and the standard deviation is £103.06. This shows a
moderate spread of salaries around the mean, suggesting some variability but
not extreme differences.
4. star wars
4.2. Access file
To access the “starwars” dataset, it is essential to import the ‘dplyr’ library. You must
install the library first if you have not.
GCD220190 29
001306081 COMP1810 - Data and Web Analytics
As can be seen in the figure above, there are quite a lot of missing values in this
dataset, not including the wrong input values. After looking through the whole dataset,
I have noticed that there are “none” and “unknown” values.
GCD220190 30
001306081 COMP1810 - Data and Web Analytics
So, to make normalising data easier, I changed all these values to NA.
To extract the data about actors whose eyes are not black and are taller than 150 cm,
“height” and “eye_color” are the only columns I need to use. So, to save time on
cleaning, I will not clean all the dataset but only the two columns.
GCD220190 31
001306081 COMP1810 - Data and Web Analytics
4.4.2. Extract actors taller than 150 cm and without black eyes
To extract actors taller than 150 cm and without black eyes, firstly, I extract the actors
that are taller than 150 cm as a new data frame named “starwars_150” using the
filter() function, and then extract actors whose eye colour is not black from that
data frame.
GCD220190 32
001306081 COMP1810 - Data and Web Analytics
As a result, there are 73 actors that are taller than 150 cm, with the shortest actor being
157.
# no black eye_color
non_black_150 <- filter(starwars_150, eye_color != "black")
Figure 28. Actors whose eyes are not black and taller than 150
The “non_black_150” data frame stores information about actors that are taller than
150 and do not have black eyes. There are 63 actors out of 87 actors who meet the
criteria.
GCD220190 33
001306081 COMP1810 - Data and Web Analytics
Figure 29. Height distribution of actors taller than 150 cm vs all actors
Looking at the distribution histogram above, it is noticeable that most actors are taller
than 150 cm. The percentage of tall actors among all actors is also calculated
GCD220190 34
001306081 COMP1810 - Data and Web Analytics
So we have that out of 87 actors, nearly 84% are taller than 150 cm.
Next, I plot bar chart to see the distribution of eye colours and comparison between all
actors and actors without black eyes.
ggplot() +
geom_bar(data = starwars_df, aes(x = eye_color), fill = 'blue', alpha =
0.5, position = 'dodge') +
geom_bar(data = starwars_nonblack, aes(x = eye_color), fill = 'red',
alpha = 0.5, position = 'dodge') +
labs(title = 'Eye Color Distribution', x = 'Eye Color', y = 'Count')
GCD220190 35
001306081 COMP1810 - Data and Web Analytics
As illustrated in figure 30, despite NA values, most of the actors do not have black eyes.
Also, 85% of the actors do not have black eyes.
GCD220190 36
001306081 COMP1810 - Data and Web Analytics
GCD220190 37
001306081 COMP1810 - Data and Web Analytics
Despite having a lot of actors who are taller than 150 cm, only about 72% of the actors
are taller than 150 cm and do not have black eyes.
4.5. BMI
Body Mass Index (BMI) is a numerical value derived from a person’s mass (weight) and
height. It is used to identify whether a person is underweight, normal weight,
overweight, or obese.
GCD220190 38
001306081 COMP1810 - Data and Web Analytics
To fill in the missing values, once again I have looked up the weight of a certain actor. As
a result, I could only find the weight of 6 actors and drop the rest NA values.
The "starwars_df" is now left with 65 actors out of 87 actors in the beginning. Now that
we already have the mass and height, we can calculate the BMI of the actors.
GCD220190 39
001306081 COMP1810 - Data and Web Analytics
library(ggplot2)
ggplot(starwars_df,
aes(x = height,
y = bmi)) +
geom_point() +
labs(title = "Height vs BMI",
x = "Height",
y = "BMI")
GCD220190 40
001306081 COMP1810 - Data and Web Analytics
Looking at the plot above, there is an actor with a high BMI of more than "400," which
is strange. So when I looked through the data frame again, the actor with such a high
BMI is “Jabba Desilijic Tiure”, since his height is just 175 cm but his mass is 1358 kg.
Apparently, this actor is not a human but a Hutt species, so it is understandable that he
is not the same as other actors.
ggplot(starwars_df,
aes(x = mass,
y = bmi)) +
geom_point() +
labs(title = "Mass vs BMI",
x = "Mass",
y = "BMI")
GCD220190 41
001306081 COMP1810 - Data and Web Analytics
In figure 34, it is noticeable that the lower the mass, the lower the BMI. This is
explainable since BMI is calculated as mass / (height/100)^2, it is directly influenced
by mass. To better confirm this theory, I plotted a correlation matrix between “height”,
“mass” and “bmi” columns
GCD220190 42
001306081 COMP1810 - Data and Web Analytics
The correlation between “bmi” and “mass” is really high (nearly 1). This means that as
mass increases, BMI increases proportionally. On the other hand, the correlation
between “bmi” and “height” is relatively weak, which means an increase in height does
not significantly affect the result of the BMI.
GCD220190 43
001306081 COMP1810 - Data and Web Analytics
5. References
[1] (No date) Finance officer salaries - check average finance administrator salary rate
on Jooble. Available at: https://uk.jooble.org/salary/finance-officer#weekly (Accessed:
11 August 2024).
[2] (No date a) HR salaries - check average HR executive salary rate on Jooble. Available
at: https://uk.jooble.org/salary/HR#weekly (Accessed: 11 August 2024).
[3] (No date a) Software developer salaries - check average developer salary rate on
Jooble. Available at: https://uk.jooble.org/salary/software-developer#weekly
(Accessed: 11 August 2024).
[4] (No date a) Web developer salaries - check average developer salary rate on Jooble.
Available at: https://uk.jooble.org/salary/web-developer#weekly (Accessed: 11 August
2024).
[5] (No date a) Operations support officer salaries - check average flight operations
officer salary rate on Jooble. Available at:
https://uk.jooble.org/salary/operations-support-officer#weekly (Accessed: 11 August
2024).
[6] Star wars D6 the role playing game (2024) SWRPGGM. Available at:
https://swrpggm.com/ (Accessed: 12 August 2024).
GCD220190 44