Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
65 views

Descriptive Analytics Coursework

A statistical analysis of cars listed on Autotrader website. Use of Python, MS Excel, and SPSS to generate project outcomes. Use of MS Word for reporting.

Uploaded by

Hassaan Haider
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views

Descriptive Analytics Coursework

A statistical analysis of cars listed on Autotrader website. Use of Python, MS Excel, and SPSS to generate project outcomes. Use of MS Word for reporting.

Uploaded by

Hassaan Haider
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 22

Visualization and Statistical Analysis of

Car Market
A report on Descriptive Analytics

Syed Hassaan Haider (Student) 3/31/23 Descriptive Analytics


Table of Contents

Part 1: Introduction:.................................................................2

Part 2: Data Visualization.........................................................3

Part 3: Descriptive Statistics.....................................................8

Part 4: Assumption of Second-hand car prices using


Confidence Intervals................................................................9

Part 5: Comparison of Average prices with the UK market


average...................................................................................11

Part 6: Car Features Affecting the Car Price..........................12

Part 7: Regression Analysis....................................................13

Part 8: The Use of Coefficients...............................................15

Part 9: Residual Analysis........................................................16

Part 10: Appendix...................................................................19

1
Part 1: Introduction:
This report presents descriptive analysis of a car dataset. The main purpose of the report is
to analyze and provide insightful information about car characteristics, including the price,
mileage, age, CC, transmission, car type, fuel type and horsepower.
The main statistical techniques used in this report are data visualization, descriptive
statistics, correlation analysis, regression analysis, and residual analysis.
Problem Description: In this report we will be using the data on one model of one specific
make of car. In our data, the model of our car will be Mercedes-Benz A Class, and the data
will be obtained from within the 30 miles radius of the postcode of London, which is SW1
11AA.
To collect the data properly, we used the software such as ‘Instant Data Scraper’ and ‘Octa
parse’, as well as randomly wrote down the car prices and the other features listed down on
the website, so we can carry out our statistical analysis on the effect of these features (such
as; mileage, fuel type, year etc.) on our chosen models’ price. The population of 275 cars
was chosen, and from that 99 cars were randomly selected.
Source of data: Autotrader.co.uk
Limitation/Problems in data: Since not all property listings is publicly available and we also
applied restrictions on the ‘Autotrade.co.uk’ website, the data might not be fully
representative of the entire car market in our selected area. Also, there might be some
missing or inaccurate information, which could hinder the accuracy of results, obtained
through statistical analysis.
Sampling Method:
Due to our constraints of using 5 year data, a certain mile radius, a specific car and then the
specific model of the car, the overall data collection from the website was done using
‘Convenience sampling’ method.
However, the data we used for our analysis was generated from “Random Sampling
Method”. A number was given to each car and each number was then assigned a random
number using Excel’s random function. Then the data was sorted in ascending order, which
allowed us to choose the first 99 cars.
Random Sampling: A sampling method in which the data is chosen at random from the population.
This research method helps to ensure that each member of the population has an equal chance of
being selected. This method allows biasness of sample to reduce as well as depicts a keen
generalizability of results obtained.

Use of 5-year range:


If the range is too small there may not be enough data to work with and if the range is too
large, there might be too much variation in the data. Also, the range we chose can capture

2
the currently available cars in the market. To ensure the relevancy and proper reflection of
current auto market, we chose a 5-year range of data to analyze current trends.

Part 2: Data Visualization


Box and Whisker Plot:

Boxplots are one of the most efficient plots to describe the data descriptively. It shows us
the range, median, smallest value, largest value, and outliers in our data. This is the
appropriate level of detail for the target audience, therefore following the IBCS standard of
‘Appropriateness’.
From the above plot we can derive that although the median of both Car types is almost
similar (approximately £23000), the most expensive Hatchback cost way more than the
most expensive Saloon. The price of around £50000 is an outlier and should not be given
much importance as it is there to create hindrance in our data.
The box plot of Hatchback shows no skewness, whereas the Saloon box plot is right skewed
because the median line (the line in the middle of the box) is closer to Q1 (the line where
box starts) than Q3 (the line where box ends).
One box plot present right in the middle of right half of the x axis and the other, exactly in
the middle of the left half of the x axis hence following the Gestalt’s principle of similarity.
Furthermore, Tufte’s principle of maximising data-to-ink ratio and IBCS standard of
‘Simplification’ has been followed as one both axis labels, and the gridlines have been
removed without impacting the readers’ understanding of data.

3
Stacked bar chart:

Consumers prefer Automatic Mercedes A class over


Manual
2022 5 0

2021 17 1
Automatic
2020 17 6
Manual
2019 25 5

2018 22 1

0 5 10 15 20 25 30 35

Number of Cars

This chart has been created in Excel and is sorted to explain the relationship between the
number of cars sold in each of the 5 years by their transmission type. We can interpret from
the data that the highest number of cars are sold in 2019 which equals to 30, and the lowest
number of cars sold in 2022 which equals 5.
Moreover, it is quite evident that more people prefer buying Automatic cars over Manual
cars, maybe because it easier to drive an automatic car for most of the people.
From our chart it must be highlighted that all the bars have a similar shape for each year,
and the transmission types of Automatic and Manual have a green and blue color
respectively, hence proving the Gestalt principle of Similarity. Furthermore, Tufte’s
principle of highlighting the important has been followed as the chart title has been
highlighted with the same color (green color) as to which the data holds more weightage:
this means that the Automatic cars, (colored green) are weighed over the Manual cars
(colored blue). The heading also conveys an important message following the IBCS
Standards of ‘Concept’ and ‘Emphasis’. The numbers are also ingrained within the bar to
encourage Tufte’s law of comparison.

4
Scatterplot:

The plot follows the Gestalt principle of similarity as consistent shape and color is used for
all the data points. For instance, ‘Automatic Hatchback’ cars have a similar red color and a
circle shape, whereas “Manual Hatchback’ cars have a similar light red color and square
shape; different shades red color and different shades of blue color indicating the Hatchback
car type and different shapes of circle and square indicating Automatic and Manual type
respectively. This explanation respects the Gestalt’s principle of common fate.
According the Tufte's principle of maximizing data to ink ratio we have eliminated
unnecessary chart junk, as the background grids and borders are removed. Additionally, the
use of consistent scales, labels and visual encoding have been injected to facilitate
comparisons between data points, hence, following the Tufte’s principle of encouraging
comparison.
The scatter plot depicts a negative relationship between the dependent variable (Price) and
independent variable (mileage) which means that as the mileage of the car increases, or the
more the car has travelled, the price of the car tends to decrease.
Two trendlines were added instead of 4 added to portray the relationship to make better
sense of data. The downward sloping trendline means that the cars with greater mileage
must have suffered more weathering and depletion, and hence, have a lower price in the
used car market.

5
Further interpretation depicts that Manual Saloon cars tend to be more expensive overall,
than the Manual hatchback cars while Automatic Saloon cars tends to be cheaper overall
than Automatic Hatchback cars, depicting an inverse relationship between the car type and
the Transmission type.
Independent variable(X):- A variable whose value does not change by the effect of other
variables and is used to manipulate the dependent variable.

Dependent variable(Y) :- A variable whose value changes when there is any manipulation in the
values of independent variable.

Scatterplot: It is a plot to which is used to observe relationship between two continuous variables. It
usually uses dots to represent those values.

Trendline: It is a line on a chart which shows overall direction of the data values.

Histogram:

To maximize the resolution of data to follow Tufte’s principle, Python’s matplotlib and
seaborn libraries are used to create this histogram. The use of dark purple color for the bar
aides the comprehensibility and memorability.

6
In the above histogram, we grouped the data by price to check how frequently each price
occurs within the data. We can clearly see from the taller and more lines towards the left
side (circled in red) of the graph which means there are more cars with lower values than
there are with higher values.
The red circle means that those data points follow the Gestalt principle of Enclosure. For
example, the greatest number of cars lie between GBP 17000 and GBP 22000. We labeled
the axis clearly, writing down units of measurement where necessary, hence following
IBCS standards. The graph also follows the Gestalt principle of ‘Similarity’, where the bars
representing the Price are all colored in the same purple color’.

Column Chart:

We created a column chart to check whether the price of Mercedes A-Class is affected by
the type of the car, either ‘Hatchback’ or ‘Saloon’. Here we can clearly see that the average
price of each type of car is significantly similar at around £23800. This means that our data
sample can not predict the type of car the consumers would prefer.
Although axis have been created, we have added the price labels ‘£23761.70’ and
‘£23853.16’ to guide the readers’ eyes and have removed the chart junk by getting rid of
the top and right spines, hence following the Tufte’s principles.

7
Part 3: Descriptive Statistics
Descriptive statistics allows us to summarize the basic features of our dataset found in our
study, showing the measurements of the data sample. For us analysts, it is the first
steppingstone towards the statistical analysis of the dataset.

We can observe from the table above that the minimum price of Mercedes A Class is
£13,999 and the priciest is £49,990. Such a large range tells us that consumers have a wide
variety of pricing options to choose from.
Standard deviation is defined as the variation of value from the mean. For our dataset, standard
deviation is calculated to be at £5,328, which shows our model of the car deviates £5328 from the
mean price of £23834.

25% value shows lower quartile of the data. Below this value 25% of data lies.

Median: A value which split data into 2 equal halves.

75% value shows upper quartile of the data. Above this value 75% of data lies.

Out of our total 99 cars, 25% of cars cost less than or equal to £19999, and 75% of the cars
cost less than or equal to £26802, which means that only a few cars lie between the
maximum price and the 75th percentile. Similarly, the mileage of the cars ranges from 245
to 85,126 miles, with a mean of 27,290.86 miles and a standard deviation of 15,427.87
miles. CC: The engine displacement (CC) ranges from 1.3 to 2.1, with a mean of 1.49 and a
standard deviation of 0.27, is quite low, which suggests that most of the cars have a similar
engine size.

8
Moving on, we calculated the price as per each fuel type; ‘Diesel’ and ‘Petrol’, as shown
above. The mean price of Diesel and Petrol are £22157 and £24599.7 respectively. This
denotes that petrol cars are certainly more expensive than diesel cars. Also, Petrol cars are
in a greater quantity than the Diesel cars. Additionally, it must be noted that, quite
surprisingly, the cheapest and the most expensive cars in our data belonged to ‘Fuel type’
petrol.

Part 4: Assumption of Second-hand car prices using Confidence


Intervals
Confidence interval (CI)- an interval which provides a range of values within which we can be
confident that the true population mean lies.

Our assumption for the mean:


When we take many random samples of same size from any population with a finite mean,
the distribution of the means tends to follow a normal distribution. In our problem, we used
the exact principle of Central Limit Theorem (CL to consider the sample mean as population
mean, so we can carry out confidence analysis next.

9
In the chart above, we carried out our analysis with 95% CI. It gave us an interval, with a
lower bound value of £22772 and the upper bound value of £24897. Since our sample mean
value is £23834 (it can be considered population mean based on CLT) it suggests that we can
be 95% confident the prices of secondhand Mercedes A Class car lies within the interval.
Hence, if more buyers have money within the interval range, the more they will be eligible
to buy the car.
After that we decided to be even more certain that the average price of ‘Mercedes A Class’
lies within the range (mentioned in the table below) £22428 and £25241. As we can see, the
99% CI has widened the interval, consequentially decreasing the precision of the estimate.
Therefore, a consumer wanting to buy the car would be more satisfied using the 95%
interval, however, still a few consumers might prefer higher confidence over precision when
deciding of buying the car, purely based on estimation.
A skewness of 0.243 suggests the distribution to be rightly-skewed - majority of data
clustered around the mean. (For further explanation about skewness, please see Appendix)

Part 5: Comparison of Average prices with the UK market


average
We used the website, motors.co.uk to find out the UK market average price of Mercedes-
Benz A Class and calculated an average of past 5 years by summing all of them and dividing
them by the number of years (5), to get the average price £25720.8.
One-sample t-test (Hypothesis testing):
Now, we will be conducting a one sample t-test of our model where will we compare the
mean of our sample with the mean of the population, in this case, UK market average price,
to check whether the sample belongs to the population.

10
Null hypothesis, represented with H0, states that there is no significant difference between the
sample mean and the population mean.

Alternative hypothesis, represented with H1, states that there is a significant difference between the
sample mean and the population mean.

For our sample:


H0: There’s no difference between the average price of Mercedes-Benz A Class in the
sample and the average price of Mercedes-Benz A class in the UK.
H1: There’s a difference between the average price of Mercedes-Benz A Class in our sample
and the average price of Mercedes-Benz A class in the UK.
A test was run in the SPSS software with a 95% confidence interval which gave us the
following outputs:

The results show that the average price of cars in our sample is £23834.70, with a standard
deviation of £5327.608. The t-value of -3.523 does not lie between the acceptance region,
lying between the critical values of t, which ranges from -1.96 to +1.96. Since p-value
provides the measure of strength of evidence against the null hypothesis, and if the p-
value<0.05, in our case it is even less than 0.001, so it provides strong evidence that we
must reject the null hypothesis.
Hence, we can conclude that there is a significant difference between our sample mean and
the population mean (UK) of Mercedes-Benz A Class, therefore, we will reject the null
hypothesis.
Overall, we can make the conclusion from this test that people should not think of buying
this car just based on the market data of car provided on the internet, but also, read a
sample report like ours, or any other to get the best analysis of price.

11
Part 6: Car Features Affecting the Car Price
Now we are going to find out how each feature of the car impacts its price. We must
perform correlation analysis to do that.
Correlation analysis is a statistical method that analyses the magnitude and direction of the
relationship between two or more variables. This analysis helps us determine the degree of
strongness and positivity and vice versa, between the variables involved.

Correlation Coefficient - measures the degree to which two variables are related to each other.

In this analysis, we are checking how strongly the price of our chose Mercedes A Class is
related to the other variables, and to what extent the age, mileage, CC, horsepower, the
type of car, its transmission type, and the fuel type affect the price of our car.
This is to be noted that in a two-tailed test, we thoroughly analyze the ‘Pearson correlation’
coefficient according to the significance level. If the correlation coefficient is greater than
0.500, a strong positive correlation is indicated, however, if the correlation coefficient is
greater than -0.500, a strong negative correlation is indicated. Also, the correlation closest
to 1 is the strongest, whereas the one closest to 0 is the weakest.
From the above table of correlation matrix, we can see that:
Age and mileage have a significant negative correlation with the price, indicated by -0.577
and -0.603 respectively. This suggests that the older the car, and the more miles it has been
driven, the lower will be the price of it. Therefore, the buyers looking for cheaper Mercedes
A Class cars should buy older cars with a higher mileage. However, it can be assumed that
those cars would not be of a better quality than the ones which are newer and driven less.

12
Secondly, the CC and horsepower has a significantly positive relationship with the price,
indicated by 0.344 and 0.674 respectively. This depicts that the greater the engine size, and
the greater the CC and horsepower, the higher the price of Mercedes A Class. Therefore,
buyers looking for strong engine and a stronger power that engine generates must buy the
car, significantly paying a higher price for it. This is to be noted that horsepower has a
greater impact on price than the CC since its value is closer to 1.
Finally, the type of car, its transmission type, and its fuel type has a non-significant
relationship with the price, indicated by -0.007, -0.096 and 0.143 respectively. This means
that the buyers must not consider these three factors when considering the price of the car.
Multicollinearity – high correlation between two or more independent variables.

A correlation value of 0.7 could lead to high multicollinearity issues in the regression model
we will be analyzing in the next step. Fortunately, there seems to be no multicollinearity
issues with our model.

Part 7: Regression Analysis


Regression analysis is a model which explains the relationship between one dependent variable and
one or more independent variables. A line of best fit is then generated to observe the relationship.

While we were carrying out our correlation analysis, we found out that the variables of
“Type car”, “Transmission”, and “Fuel type” had little or no correlation with price. Hence, to
make our model most parsimonious , we dropped these three variables and used the four
variables: “Horsepower”, “CC”, “age”, Mileage”, hence we made our model free or errors as
much as possible for better accuracy of results by using multiple linear regression.
Parsimonious model: A model used to balance model complexity and accuracy. This model uses the
minimum number of predictors to explain the dependent variable.

Multiple linear regression (MLR): a technique used to establish a linear relationship


between one dependent and two or more independent variables.

Dummy variables (a binary variable which carries a value of 0 or 1) used as categorical


variables for the MLR analysis for dropping the variables.

After calculating our statistics using SPSS software, the summary for the most parsimonious
model is created below. The three variables which have been removed had zero to no

13
significance with the price of our car selected. The p-value of less than 0.05 further proves
non-significance of those three variables.
The magnitude of the effect of the independent variable “horsepower”, has the strongest
positive impact on the dependent variable price since it has the highest coefficient of 0.674.
This means the Mercedes A Class will cost more its horsepower increases.
On the other hand, the strongest negative impact is caused by the independent variable
“mileage”, having a coefficient of -0.603. This means that the more miles the car travels, the
lesser will it cost.

R-squared: a value of the proportion of variance in the dependent variable explained by the
independent variables.

Adjusted R-squared: a value generated after modification of independent variables in the model. A
more accurate measure of goodness of fit of the model since all unnecessary variables are penalized.

Both R-squared and Adjusted R-squared values lie between 0 and 1 where:

 1 indicates all variance in dependent variables is explained by independent variables.


 0 indicates no variance is explained by the independent variables.
Hence, for our model, the adjusted R-squared value of 0.805 illustrates that 80.5% of
variance in price is explained by our four predictor variables mentioned in the model
summary. This tells us that our model is a good fit. However, the underlying assumptions of
linear regression must be satisfied to provide better judgement of the model.

Part 8: The Use of Coefficients

14
The multiple linear regression equation can be constructed, and hence the price can easily
calculated using the table above.
Equation: Price = 19764.138 – 1642.724*(Age) – 0.112*(Mileage) + 3216.470*(CC) +
51.271*(Horsepower) + Error values(E)
This equation denotes that for a car 1 year older than the other, has its price depreciated by
approximately GBP 1643, whereas for every mile increment in the car decreases the price by
GBP 0.112. The truth of this equation can be checked with the correlation we calculated
before where both the correlations of age and mileage with respect to price turned out to
be negative.
Furthermore, it can be stated that a single CC increase in car increases the car price by
approximately GBP 3217, whereas for every horsepower increment in the car increases its
price by GBP 51.271. The truth of this equation can also be checked with the correlation we
calculated before where both the correlations of CC and horsepower with respect to price
turned out to be positive.
Example demonstration to predict price:
Suppose:

 Age = 4
 Mileage = 10000
 CC = 2.0
 Horsepower = 200
 Error not to be considered for this calculation.
Solution:
Price = 19764.138 – 1642.724*(4) – 0.112*(10000) + 3216.470*(2.0) + 51.271*(200)
Price of Mercedes-Benz A Class = GBP 28760.382

Part 9: Residual Analysis


Residual analysis is the difference between actual values and predicted values (by the model) of the
dependent variable is an important technique to test the quality of fit of the model in regression
analysis.

The quality of the model can be checked by analyzing whether the model meets the 5
assumptions of Residual analysis:
Assumption 1: Linearity assumes that there is a direct linear relation between dependent
and independent variables.
Assumption 2: Independence assumes that there is no correlation between the independent
variables.

15
Assumption 3: Homoskedasticity assumes that the variance of the residuals(errors) should
be constant across every level of independent variables.
Assumption 4: Normality assumes that the distribution of residuals should be normally
distributed.
Assumption 5: No Multicollinearity assumes there should be no high correlation (standard is
0.7) between the independent variables.

As compared to the P-P plot, the histogram (below) is roughly bell-shaped and centered at
0, which suggests normal distribution of the residuals.
From the graph of P-P plot above, we can check for the normality of our model. If the dots
lie exactly on the line, then the model is perfectly normally distributed. Since most of our
dots are not on the line and are making a thin tail, however, the points deviated just slightly
from the line, we can say that our assumption 4 have been somewhat satisfied. We must
check for more assumptions first.

16
Since the scatterplot above doesn’t seem to show a pattern, such that, it is not U-shaped or
V-shaped, and the dots are approximately equally scattered around 0, shows that average
of residuals against the predicted value is zero, which suggest the acceptance of assumption
3.
Furthermore, it can be observed that the dots on our scatterplot are randomly scattered
which suggests no correlation hence our assumption 2 is true. Furthermore, while
conducting correlation analysis, we found no presence of multicollinearity, therefore, our
assumption 5 is also satisfied. Additionally, scatter plot depicts a slight moderate positive
relationship therefore our assumption 1 is satisfied.
Overall all of our assumptions have been satisfied, therefore, our model is adequate and
suitable for use.

17
Part 10: Appendix
Explanation of skewness from Confidence Interval charts in the main body:

18
Skewness is a measure of the asymmetry of a probability distribution, indicating how much
the distribution deviates from being perfectly symmetric. A skewness value of 0 indicates a
perfectly symmetric distribution, while positive values indicate right-skewed (long tail to the
right) and negative values indicate left-skewed (long tail to the left) distributions.
The standard error dictates the variability of sample mean compared to the population
mean
Equations for Linear regression:
Simple linear regression: Y = a + bX + E
Multiple linear regression: Y = a + b1X1 + b2X2 + ….. + bkXk
Collecting the average price for 5 years of our car in the UK Market
(19,384+22,538+25,415+28,630+32,637)/5 = 25720.8
https://www.motors.co.uk/car-price-guide/mercedes-benz/a-class/

Box plot explained:

19
Source: https://byjus.com/maths/box-plot/

T-value explained:

Source: https://stats.stackexchange.com/questions/137512/the-meaning-of-t-valuetest-
statistic-what-the-relation-between-student-t-dist

20
Normal distribution explained:

Source: https://www.scribbr.com/statistics/normal-distribution/

21

You might also like