Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Technologyname Phase2

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 20

SALES STORE DATA ANALYSIS

Submitted by:
TEAM MEMBER

S. B .MEENAKSHI

INTRODUCTION

In today's competitive business landscape, data analysis and visualization have become essential
tools for making informed decisions and gaining a competitive edge. One crucial aspect of business
operations is understanding product sales. Analyzing product sales data allows businesses to identify
trends, make strategic decisions, optimize inventory, and ultimately improve profitability.

Product sale analysis visualization is the practice of using visual representations of data to gain
insights into product sales performance. It involves transforming raw sales data into meaningful
charts, graphs, dashboards, and reports that can be easily understood by stakeholders. These
visualizations provide a clear and concise overview of key metrics and help in uncovering valuable
insights that might not be apparent through raw data alone.

DATA SOURCE :

DATA SET LINK – https://www.kaggle.com/datasets/ksabishek/product-sales-data

Part 1 IMPORT MATPLOTLIB

In [1]:

# 1) Import the libraries that we will need


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
In [2]:

# 2) Load the data file into a new Pandas data frame called `product_sales_df` by calling the `read_csv`
function in the Pandas library.
product_sales_df = pd.read_csv('/kaggle/input/product-sales-data/statsfinal.csv')

In [3]:

# 3) Display the size of this data frame using the `shape` data member of `product_sales_df` data frame
product_sales_df.shape

Out[3]:

(4600, 10)

In [4]:

## 3') display the column names using the `columns` data member
product_sales_df.columns

Out[4]:
Index(['Unnamed: 0', 'Date', 'Q-P1', 'Q-P2', 'Q-P3', 'Q-P4', 'S-P1', 'S-P2',
'S-P3', 'S-P4'],
dtype='object')

In [5]:

# 4) Display the first 10 rows of this data frame by calling the `head` method with `product_sales_df`

product_sales_df.head(10)

Out[5]:

Unnam
Date Q-P1 Q-P2 Q-P3 Q-P4 S-P1 S-P2 S-P3 S-P4
ed: 0
13-06- 17187. 23616. 3121.9 6466.9
0 0 5422 3725 576 907
2010 74 50 2 1
14-06- 22338. 4938.8 19392. 11222.
1 1 7047 779 3578 1574
2010 99 6 76 62
15-06- 4983.2 13199. 3224.9 8163.8
2 2 1572 2082 595 1145
2010 4 88 0 5
16-06- 17932. 15209. 17018. 11921.
3 3 5657 2399 3140 1672
2010 69 66 80 36
17-06- 11627. 20332. 11837. 5048.0
4 4 3668 3207 2184 708
2010 56 38 28 4
18-06- 9186.6 16097. 1685.6 10787.
5 5 2898 2539 311 1513
2010 6 26 2 69
19-06- 21911. 9319.8 8541.9 11465.
6 6 6912 1470 1576 1608
2010 04 0 2 04
20-06- 16512. 16167. 18509. 6003.4
7 7 5209 2550 3415 842
2010 53 00 30 6
21-06- 20040. 5401.6 19761. 9818.0
8 8 6322 852 3646 1377
2010 74 8 32 1
22-06- 21762. 2624.7 21148. 4007.0
9 9 6865 414 3902 562
2010 05 6 84 6

Part 2: Overview of the data

o What is the TRUE DATA TYPE of each variable: categorical (i.e. each value represents a
category or class of things), discrete numeric (i.e. represents
some countable quantity), continuous numeric, or neither of these? Be careful: do not
confuse this with the data type of the columns; you really need to read the descriptions of
the variables and understand what they represent ...
o Which of these variables are totally irrelevant to the main objective of predicting product
price?
o How many missing values and in which columns are they? You need to call
the isna().sum() methods with product_sales_df.

In [6]:

# 5) Display the data types of the data frame columns using the `dtypes` data member of
`product_sales_df`
product_sales_df.dtypes

Out[6]:

Unnamed: 0 int64
Date object
Q-P1 int64
Q-P2 int64
Q-P3 int64
Q-P4 int64
S-P1 float64
S-P2 float64
S-P3 float64
S-P4 float64
dtype: object

In [7]:

product_sales_df.isna().sum()
Out[7]:

Unnamed: 0 0
Date 0
Q-P1 0
Q-P2 0
Q-P3 0
Q-P4 0
S-P1 0
S-P2 0
S-P3 0
S-P4 0
dtype: int64

Part 3 : Visualize continuous numeric variables

 create a new data frame called continuous_data_df that contains only continuous numeric
columns.
 display a statistical summary of these columns by calling the describe method with the
continuous_data_df data frame.
 Visualize the bi-variate distributions of the continuous numeric columns by calling the pairplot
function from the seaborn library with the continuous_data_df data frame.
 Visually inspect the previous plot; based on this plot, which variables have a strong linear
relationship and which variables have a strong non-linear relationship with the price of product?
 Calculate the correlation matrix of these columns by calling the corr method with
continuous_data_df, and then plot this matrix by calling the sns.heatmap function from the
seaborn library.
 Based on the previous question, which pairs of variables are strongly linearly correlated? (have a
correlation coefficient above 0.9 or less than -0.9)

In [8]:

#1--Creating continuous_data_df
continuous_columns = ['S-P1', 'S-P2', 'S-P3', 'S-P4']
continuous_data_df = product_sales_df[continuous_columns]
continuous_data_df.head(5)

Out[8]:

S-P1 S-P2 S-P3 S-P4

0 17187.74 23616.50 3121.92 6466.91


1 22338.99 4938.86 19392.76 11222.62
2 4983.24 13199.88 3224.90 8163.85
3 17932.69 15209.66 17018.80 11921.36
4 11627.56 20332.38 11837.28 5048.04

In [9]:

#2--display a statistical summary of these columns


continuous_data_df.describe()

Out[9]:

S-P1 S-P2 S-P3 S-P4

count 4600.000000 4600.000000 4600.000000 4600.000000


mean 13066.261743 13505.984848 17049.910800 8010.555000
std 7114.340094 6909.228687 9061.330694 3546.359869
min 805.180000 1591.340000 1355.000000 1782.500000
25% 6817.085000 7403.535000 9190.965000 4962.480000
50% 13114.290000 13529.560000 17357.550000 8103.245000
75% 19248.240000 19465.385000 24763.980000 11008.720000
max 25353.660000 25347.320000 32520.000000 14260.000000

In [10]:

#3--Visualize the bi-variate distributions of the numeric columns


p1 = product_sales_df.drop(['Date','Unnamed: 0'], axis=1)
sns.pairplot(p1)

/opt/conda/lib/python3.10/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has


changed to tight
self._figure.tight_layout(*args, **kwargs)

Out[10]:

<seaborn.axisgrid.PairGrid at 0x78edb5a7ceb0>
There is strong linear corrolation between each Q-Pn variable and it's corresponding S-Pn variable.

In [11]:

corr = p1.corr()
sns.heatmap(data=corr,annot=True)

Out[11]:

<Axes: >
The result of the previous question is verified.

Part 4: Visualize discrete numeric variables

o For each discrete variable, display its distribution of values by calling the value_counts method
inside a for loop ...
o Plot the distribution of number of bedrooms as a bar plot, by calling value_counts().plot.bar().
o Based on the previous plot, what is the most frequent number of bedrooms?
o Plot the distribution of price by a choosen variable by calling the sns.boxplot function.
o Based on the previous plot, what can you say about the relationship between the price and
choosen variable?

In [12]:

discrete_columns = ['Q-P1','Q-P2','Q-P3','Q-P4']
for column in discrete_columns:
print(product_sales_df[column].value_counts())
print('\n')

Q-P1
6072 6
2357 5
5868 5
2269 5
3807 5
..
7061 1
4890 1
3580 1
1550 1
1234 1
Name: count, Length: 3488, dtype: int64

Q-P2
445 7
1109 6
3967 6
2656 6
3970 6
..
1559 1
3795 1
1879 1
2199 1
3143 1
Name: count, Length: 2667, dtype: int64

Q-P3
502 6
5210 6
4623 5
359 5
4276 5
..
1509 1
5588 1
4754 1
5668 1
5899 1
Name: count, Length: 3166, dtype: int64

Q-P4
637 9
489 9
1419 8
1568 8
934 8
..
1114 1
1884 1
630 1
289 1
1112 1
Name: count, Length: 1631, dtype: int64

In [13]:

discrete_columns = ['Q-P1','Q-P2','Q-P3','Q-P4']
for column in discrete_columns:
product_sales_df[column].value_counts().plot.bar()
plt.show()
For Q-P1: 6072

For Q-P2: 445

For Q-P3: 502 and 5210

For Q-P4: 637 and 489

In [14]:

sns.boxplot(x='Q-P3', y='S-P3', data=product_sales_df)


plt.xlabel('Q-P3')
plt.ylabel('S-P3')
plt.title('Distribution of S-P3 by Q-P3')
plt.show()
In [15]:

#Trend Plot for each product quantity throughout the years going each 100 days
df = product_sales_df
try:
df['Date'] = pd.to_datetime(df['Date'], format='%d-%m-%Y', errors='coerce')
except pd.errors.ParserError:
pass

df = df.dropna(subset=['Date'])
df['Date'] = pd.to_datetime(df['Date'], format='%d-%m-%Y')
df.set_index('Date', inplace=True)

products = ['Q-P1', 'Q-P2', 'Q-P3', 'Q-P4']

for product in products:


product_data = df[product]

moving_average = product_data.rolling(window=100).mean()

plt.figure(figsize=(12, 6))
plt.plot(df.index, moving_average, label=f'{product} - Moving Average')
plt.xlabel('Date')
plt.ylabel('Unit Sales')
plt.title(f'Unit Sales Over Time for {product}')
plt.grid(True)
plt.legend()
plt.show()

/tmp/ipykernel_20/4007365276.py:9: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

df['Date'] = pd.to_datetime(df['Date'], format='%d-%m-%Y')


In [16]:

#Trend Plot for each product quantity throughout the year 2020 going each 100 days
df = product_sales_df
try:
df['Date'] = pd.to_datetime(df['Date'], format='%d-%m-%Y', errors='coerce')
except pd.errors.ParserError:
pass

df = df.dropna(subset=['Date'])
df['Date'] = pd.to_datetime(df['Date'], format='%d-%m-%Y')
df.set_index('Date', inplace=True)

products = ['Q-P1', 'Q-P2', 'Q-P3', 'Q-P4']


for product in products:
product_data = df[product]['2020']

moving_average = product_data.rolling(window=100).mean()

# Create a clearer plot of the smoothed data


plt.figure(figsize=(12, 6))
plt.plot(product_data.index, moving_average, label=f'{product} - Moving Average')
plt.xlabel('Date')
plt.ylabel('Smoothed Unit Sales')
plt.title(f'Smoothed Unit Sales Over Time for {product} (Year 2020)')
plt.grid(True)
plt.legend()
plt.show()

/tmp/ipykernel_20/1799926232.py:9: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

df['Date'] = pd.to_datetime(df['Date'], format='%d-%m-%Y')


#use the plot_bar_chart function, enter the Revenue Columns and the Revenue string

, , ,

,
Unit sales 2011 - 2022
• PI has the highest unit sales for each year. And it's highest is in year 2014.
• We can observe that P4 has the lowest unit sales of all the products. Revenues
2011 - 2022
• We can observe that P3 brought in the most revenue. This could be as a result of multiple
things:
• P3 was sold for higher than the rest, as it had the second highest unit sales for each year.
• We can observe than PI and P2 brought in similar revenues for each year. With P2 bringing
in slightly more.
• PI despite having the most unit sold, brought in the second lowest revenue each
year.
Average Month Sales 2011 - 2022
• We can observe that all Products unit sales drop in Feb.
• We can observe that Feb and Dec have the lowest sales for each product For PI
• We can observe Mar -Jul having the highest unit sales For P2
• We can observe Jan, Mar - Aug having the highest unit sales For P3
• We can observe May & Sep having the highest unit sales For P4
• We can observe uniform sales from Jan - Dec

CONCLUSION
 Product and sale analysis, it's essential to consider factors like product performance, market
trends, pricing strategies, and customer behaviour. The conclusion would depend on the
specific insights drawn from the analysis, but a general conclusion might emphasize areas for
improvement, successful strategies, or potential growth opportunities.

You might also like