Technologyname Phase2
Technologyname Phase2
Technologyname Phase2
Submitted by:
TEAM MEMBER
S. B .MEENAKSHI
INTRODUCTION
In today's competitive business landscape, data analysis and visualization have become essential
tools for making informed decisions and gaining a competitive edge. One crucial aspect of business
operations is understanding product sales. Analyzing product sales data allows businesses to identify
trends, make strategic decisions, optimize inventory, and ultimately improve profitability.
Product sale analysis visualization is the practice of using visual representations of data to gain
insights into product sales performance. It involves transforming raw sales data into meaningful
charts, graphs, dashboards, and reports that can be easily understood by stakeholders. These
visualizations provide a clear and concise overview of key metrics and help in uncovering valuable
insights that might not be apparent through raw data alone.
DATA SOURCE :
In [1]:
# 2) Load the data file into a new Pandas data frame called `product_sales_df` by calling the `read_csv`
function in the Pandas library.
product_sales_df = pd.read_csv('/kaggle/input/product-sales-data/statsfinal.csv')
In [3]:
# 3) Display the size of this data frame using the `shape` data member of `product_sales_df` data frame
product_sales_df.shape
Out[3]:
(4600, 10)
In [4]:
## 3') display the column names using the `columns` data member
product_sales_df.columns
Out[4]:
Index(['Unnamed: 0', 'Date', 'Q-P1', 'Q-P2', 'Q-P3', 'Q-P4', 'S-P1', 'S-P2',
'S-P3', 'S-P4'],
dtype='object')
In [5]:
# 4) Display the first 10 rows of this data frame by calling the `head` method with `product_sales_df`
product_sales_df.head(10)
Out[5]:
Unnam
Date Q-P1 Q-P2 Q-P3 Q-P4 S-P1 S-P2 S-P3 S-P4
ed: 0
13-06- 17187. 23616. 3121.9 6466.9
0 0 5422 3725 576 907
2010 74 50 2 1
14-06- 22338. 4938.8 19392. 11222.
1 1 7047 779 3578 1574
2010 99 6 76 62
15-06- 4983.2 13199. 3224.9 8163.8
2 2 1572 2082 595 1145
2010 4 88 0 5
16-06- 17932. 15209. 17018. 11921.
3 3 5657 2399 3140 1672
2010 69 66 80 36
17-06- 11627. 20332. 11837. 5048.0
4 4 3668 3207 2184 708
2010 56 38 28 4
18-06- 9186.6 16097. 1685.6 10787.
5 5 2898 2539 311 1513
2010 6 26 2 69
19-06- 21911. 9319.8 8541.9 11465.
6 6 6912 1470 1576 1608
2010 04 0 2 04
20-06- 16512. 16167. 18509. 6003.4
7 7 5209 2550 3415 842
2010 53 00 30 6
21-06- 20040. 5401.6 19761. 9818.0
8 8 6322 852 3646 1377
2010 74 8 32 1
22-06- 21762. 2624.7 21148. 4007.0
9 9 6865 414 3902 562
2010 05 6 84 6
o What is the TRUE DATA TYPE of each variable: categorical (i.e. each value represents a
category or class of things), discrete numeric (i.e. represents
some countable quantity), continuous numeric, or neither of these? Be careful: do not
confuse this with the data type of the columns; you really need to read the descriptions of
the variables and understand what they represent ...
o Which of these variables are totally irrelevant to the main objective of predicting product
price?
o How many missing values and in which columns are they? You need to call
the isna().sum() methods with product_sales_df.
In [6]:
# 5) Display the data types of the data frame columns using the `dtypes` data member of
`product_sales_df`
product_sales_df.dtypes
Out[6]:
Unnamed: 0 int64
Date object
Q-P1 int64
Q-P2 int64
Q-P3 int64
Q-P4 int64
S-P1 float64
S-P2 float64
S-P3 float64
S-P4 float64
dtype: object
In [7]:
product_sales_df.isna().sum()
Out[7]:
Unnamed: 0 0
Date 0
Q-P1 0
Q-P2 0
Q-P3 0
Q-P4 0
S-P1 0
S-P2 0
S-P3 0
S-P4 0
dtype: int64
create a new data frame called continuous_data_df that contains only continuous numeric
columns.
display a statistical summary of these columns by calling the describe method with the
continuous_data_df data frame.
Visualize the bi-variate distributions of the continuous numeric columns by calling the pairplot
function from the seaborn library with the continuous_data_df data frame.
Visually inspect the previous plot; based on this plot, which variables have a strong linear
relationship and which variables have a strong non-linear relationship with the price of product?
Calculate the correlation matrix of these columns by calling the corr method with
continuous_data_df, and then plot this matrix by calling the sns.heatmap function from the
seaborn library.
Based on the previous question, which pairs of variables are strongly linearly correlated? (have a
correlation coefficient above 0.9 or less than -0.9)
In [8]:
#1--Creating continuous_data_df
continuous_columns = ['S-P1', 'S-P2', 'S-P3', 'S-P4']
continuous_data_df = product_sales_df[continuous_columns]
continuous_data_df.head(5)
Out[8]:
In [9]:
Out[9]:
In [10]:
Out[10]:
<seaborn.axisgrid.PairGrid at 0x78edb5a7ceb0>
There is strong linear corrolation between each Q-Pn variable and it's corresponding S-Pn variable.
In [11]:
corr = p1.corr()
sns.heatmap(data=corr,annot=True)
Out[11]:
<Axes: >
The result of the previous question is verified.
o For each discrete variable, display its distribution of values by calling the value_counts method
inside a for loop ...
o Plot the distribution of number of bedrooms as a bar plot, by calling value_counts().plot.bar().
o Based on the previous plot, what is the most frequent number of bedrooms?
o Plot the distribution of price by a choosen variable by calling the sns.boxplot function.
o Based on the previous plot, what can you say about the relationship between the price and
choosen variable?
In [12]:
discrete_columns = ['Q-P1','Q-P2','Q-P3','Q-P4']
for column in discrete_columns:
print(product_sales_df[column].value_counts())
print('\n')
Q-P1
6072 6
2357 5
5868 5
2269 5
3807 5
..
7061 1
4890 1
3580 1
1550 1
1234 1
Name: count, Length: 3488, dtype: int64
Q-P2
445 7
1109 6
3967 6
2656 6
3970 6
..
1559 1
3795 1
1879 1
2199 1
3143 1
Name: count, Length: 2667, dtype: int64
Q-P3
502 6
5210 6
4623 5
359 5
4276 5
..
1509 1
5588 1
4754 1
5668 1
5899 1
Name: count, Length: 3166, dtype: int64
Q-P4
637 9
489 9
1419 8
1568 8
934 8
..
1114 1
1884 1
630 1
289 1
1112 1
Name: count, Length: 1631, dtype: int64
In [13]:
discrete_columns = ['Q-P1','Q-P2','Q-P3','Q-P4']
for column in discrete_columns:
product_sales_df[column].value_counts().plot.bar()
plt.show()
For Q-P1: 6072
In [14]:
#Trend Plot for each product quantity throughout the years going each 100 days
df = product_sales_df
try:
df['Date'] = pd.to_datetime(df['Date'], format='%d-%m-%Y', errors='coerce')
except pd.errors.ParserError:
pass
df = df.dropna(subset=['Date'])
df['Date'] = pd.to_datetime(df['Date'], format='%d-%m-%Y')
df.set_index('Date', inplace=True)
moving_average = product_data.rolling(window=100).mean()
plt.figure(figsize=(12, 6))
plt.plot(df.index, moving_average, label=f'{product} - Moving Average')
plt.xlabel('Date')
plt.ylabel('Unit Sales')
plt.title(f'Unit Sales Over Time for {product}')
plt.grid(True)
plt.legend()
plt.show()
/tmp/ipykernel_20/4007365276.py:9: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
#Trend Plot for each product quantity throughout the year 2020 going each 100 days
df = product_sales_df
try:
df['Date'] = pd.to_datetime(df['Date'], format='%d-%m-%Y', errors='coerce')
except pd.errors.ParserError:
pass
df = df.dropna(subset=['Date'])
df['Date'] = pd.to_datetime(df['Date'], format='%d-%m-%Y')
df.set_index('Date', inplace=True)
moving_average = product_data.rolling(window=100).mean()
/tmp/ipykernel_20/1799926232.py:9: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
, , ,
,
Unit sales 2011 - 2022
• PI has the highest unit sales for each year. And it's highest is in year 2014.
• We can observe that P4 has the lowest unit sales of all the products. Revenues
2011 - 2022
• We can observe that P3 brought in the most revenue. This could be as a result of multiple
things:
• P3 was sold for higher than the rest, as it had the second highest unit sales for each year.
• We can observe than PI and P2 brought in similar revenues for each year. With P2 bringing
in slightly more.
• PI despite having the most unit sold, brought in the second lowest revenue each
year.
Average Month Sales 2011 - 2022
• We can observe that all Products unit sales drop in Feb.
• We can observe that Feb and Dec have the lowest sales for each product For PI
• We can observe Mar -Jul having the highest unit sales For P2
• We can observe Jan, Mar - Aug having the highest unit sales For P3
• We can observe May & Sep having the highest unit sales For P4
• We can observe uniform sales from Jan - Dec
CONCLUSION
Product and sale analysis, it's essential to consider factors like product performance, market
trends, pricing strategies, and customer behaviour. The conclusion would depend on the
specific insights drawn from the analysis, but a general conclusion might emphasize areas for
improvement, successful strategies, or potential growth opportunities.