Walmart (Project)

12/23/23, 12:21 PM walmart(Project) - Jupyter Notebook
# Capstone Project - Walmart #

Table of Contents
Problem Statement
Project Objective
Data Description
Data Pre-processing Steps and Inspiration
Choosing the Algorithm for the Project
Motivation and Reasons For Choosing the Algorithm
Assumptions
Model Evaluation and Techniques
Inferences from the Same
Future Possibilities of the Project
Conclusion
References
# Problem Statement
A retail store that has multiple outlets across the country are facing issues in managing the inventory - to match the demand w
# Data Description
Data description, various insights from the data.
The Walmart DataSet.csv contains 6435 rows and 8 columns.
You are provided with the weekly sales data for their various outlets. Use statistical analysis, EDA, outlier analysis, and han
insights that can give them a clear perspective on the following:
If the weekly sales are affected by the unemployment rate, if yes - which stores are suffering the most?
If the weekly sales show a seasonal trend, when and what could be the reason?
localhost:8888/notebooks/Project/walmart(Project).ipynb 1/
Does temperature affect the weekly sales in any manner?
How is the Consumer Price index affecting the weekly sales of various stores?
Top performing stores according to the historical data.
The worst performing store, and how significant is the difference between the highest and lowest performing stores.
2. Use predictive modeling techniques to forecast the sales for each store for the next 12 weeks.
# Data Preprocessing Steps And Inspiration

The preprocessing of the data included the following steps:
First step to import the labriery
Second step to read the data file Walmart DataSet.csv.
45 different stores in this dataset.
Lets select the any store id from (1-45).
Check data inforamation & shape, duplicated, isnull etc.
In [1]: import pandas as pd

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
In [2]: data = pd.read_csv('Walmart DataSet.csv') data.set_index('Date', inplace=True)

# There are about 45 different stores in this dataset. Lets select the any store id from 1-45
a= int(input("Enter the store id:")) store = data[data.Store == a]
sales = pd.DataFrame(store.Weekly_Sales.groupby(store.index).sum()) sales.dtypes
Enter the store id:11
Out[2]: Weekly_Sales float64

dtype: object
In [3]: sales.head(30)
Out[3]:
Weekly_Sales
Date
01-04-2011 1258674.12
01-06-2012 1361595.33
01-07-2011 1297472.06
01-10-2010 1182490.46
02-03-2012 1438383.44
02-04-2010 1446210.26
02-07-2010 1302600.14
02-09-2011 1297792.41
02-12-2011 1399322.44
03-02-2012 1376732.18
03-06-2011 1343637.00
03-08-2012 1399341.07
03-09-2010 1303914.27
03-12-2010 1380522.64
04-02-2011 1422546.05
04-03-2011 1399456.99
04-05-2012 1370251.22
04-06-2010 1396322.19
04-11-2011 1458287.38
05-02-2010 1528008.64
05-03-2010 1426622.65
05-08-2011 1403198.94
05-10-2012 1422794.26
05-11-2010 1332759.13
Weekly_Sales
Date
06-01-2012 1283885.55
06-04-2012 1596325.01
06-05-2011 1331453.41
06-07-2012 1461129.94
06-08-2010 1369634.92
07-01-2011 1178905.44
In [4]: data.info()
print(data.shape)
<class 'pandas.core.frame.DataFrame'>
Index: 6435 entries, 05-02-2010 to 26-10-2012
Data columns (total 7 columns):
# Column Non-Null Count Dtype
0 Store 6435 non-null int64

1 Weekly_Sales 6435 non-null float64
2 Holiday_Flag 6435 non-null int64
3 Temperature 6435 non-null float64
4 Fuel_Price 6435 non-null float64
5 CPI 6435 non-null float64
6 Unemployment 6435 non-null float64
dtypes: float64(5), int64(2)
memory usage: 402.2+ KB
(6435, 7)
In [5]: data.head()
Out[5]:
Store Weekly_Sales Holiday_Flag Temperature Fuel_Price CPI Unemployment
Date
05-02-2010 1 1643690.90 0 42.31 2.572 211.096358 8.106
12-02-2010 1 1641957.44 1 38.51 2.548 211.242170 8.106
19-02-2010 1 1611968.17 0 39.93 2.514 211.289143 8.106
26-02-2010 1 1409727.59 0 46.63 2.561 211.319643 8.106
05-03-2010 1 1554806.68 0 46.50 2.625 211.350143 8.106
In [6]: data.tail()
Out[6]:
Store Weekly_Sales Holiday_Flag Temperature Fuel_Price CPI Unemployment
Date
28-09-2012 45 713173.95 0 64.88 3.997 192.013558 8.684
05-10-2012 45 733455.07 0 64.89 3.985 192.170412 8.667
12-10-2012 45 734464.36 0 54.47 4.000 192.327265 8.667
19-10-2012 45 718125.53 0 56.47 3.969 192.330854 8.667
26-10-2012 45 760281.43 0 58.85 3.882 192.308899 8.667
In [7]: print(data.isnull().sum())
Store 0
Weekly_Sales 0
Holiday_Flag 0
Temperature 0
Fuel_Price 0
CPI 0
Unemployment 0
dtype: int64
12/23/23, 12:21 walmart(Project) - Jupyter Notebook
In [8]: print(data.duplicated().sum())
In [9]: data['Store'].count()
Out[9]: 6435
# The objective of this project is to how can increase the sales day by day & red
Analyze sales trends: By analyzing the weekly sales data for each store, we can identify the trends and patterns in sales over
effectively.
In [10]: #Total Weakly sales from all stores

data['Weekly_Sales'].sum()
Out[10]: 6737218987.11
In [11]: #remove date from index to change its dtype because it clearly isnt acceptable.
sales.reset_index(inplace = True)
#converting 'date' column to a datetime type sales['Date'] = pd.to_datetime(sales['Date']) # resetting date back to the index
sales.set_index('Date',inplace = True)
C:\Users\HP\AppData\Local\Temp\ipykernel_6312\1042104552.py:4: UserWarning: Parsing dates in DD/MM/YYYY format when

d ayfirst=False (the default) was specified. This may lead to inconsistently parsed dates! Specify a format to
ensure c onsistent parsing.
sales['Date'] = pd.to_datetime(sales['Date'])
In [12]: sales.Weekly_Sales.plot(figsize=(10,6), title= 'Weekly Sales of a Store', fontsize=14, color = 'blue') plt.show()
In [13]: from statsmodels.tsa.seasonal import seasonal_decompose

decomposition = seasonal_decompose(sales.Weekly_Sales, period=12) fig = plt.figure()
fig = decomposition.plot() fig.set_size_inches(12, 10) plt.show()
<Figure size 640x480 with 0 Axes>
Understand the impact of holidays: The Holiday_Flag column in the dataset indicates whether a given week is a holiday week or not. Analyzing the sales data
for holiday weeks vs. non-holiday weeks can help stores to understand the impact of holidays on their sales and plan accordingly.
Choosing the Algorithm For the Project-
To analyze sales trends using the weekly sales data for selected store, we can follow these steps: Load the Walmart dataset int
Convert the 'Date' column to a datetime format.
Group the data by store and date, and calculate the total sales for each week.
Pivot the data to create a table with stores as columns and weekly sales as rows. Plot the trend of sales for selected store.
Plot the distribution of sales for selected store.
In [14]: #lets compare the 2012 data of two stores # Lets take store 5 data for analysis
store5 = data[data.Store == 5]
# there are about 45 different stores in this dataset.
sales5 = pd.DataFrame(store5.Weekly_Sales.groupby(store5.index).sum()) sales5.dtypes

# Grouped weekly sales by store 6
#remove date from index to change its dtype because it clearly isnt acceptable.
sales5.reset_index(inplace = True)
#converting 'date' column to a datetime type

sales5['Date'] = pd.to_datetime(sales5['Date'])
# resetting date back to the index
sales5.set_index('Date',inplace = True)
C:\Users\HP\AppData\Local\Temp\ipykernel_6312\430818428.py:14: UserWarning: Parsing dates in DD/MM/YYYY format when

d ayfirst=False (the default) was specified. This may lead to inconsistently parsed dates! Specify a format to
ensure c onsistent parsing.
sales5['Date'] = pd.to_datetime(sales5['Date'])
In [15]: y1=sales.Weekly_Sales y2=sales5.Weekly_Sales
In [16]: #y1['2012'].plot(figsize=(15, 6),legend=True, color = 'chocolate') #y2['2012'].plot(figsize=(15, 6), legend=True, color = 'turq
y2['2012'].plot(figsize=(15, 6), legend=True, color = 'Orange') plt.ylabel('Weekly Sales')
plt.title('Store4 vs Store5 on 2012', fontsize = '16') plt.show()
# Choosing the Algorithm For the Project & Identify the impact of external facto
The Temperature, Fuel_Price, CPI, and Unemployment columns in the dataset provide information about external factors that may i
Analyzing the relationship between these factors and sales can help stores to better understand their customer base and adjust
To identify the impact of external factors on sales using the Walmart dataset, we can follow these steps:
Convert the 'Date' column to a datetime format.
Plot the correlation matrix of the dataset to visualize the relationships between variables.
Create scatter plots of the external factors against weekly sales to visualize the relationship between each factor and sales
Calculate the correlation coefficients between each external factor and weekly sales to quantify the strength of the relation
Create a multiple regression model to analyze the impact of multiple external factors on weekly sales.
In [ ]: data['Date'] = pd.to_datetime(data['Date'])
In [51]: # Plot the correlation matrix of the dataset

corr = data.corr()
sns.heatmap(corr, annot=True, cmap='coolwarm') plt.title('Correlation Matrix')
plt.show()
In [56]: # Create scatter plots of external factors against weekly sales

sns.pairplot(data[['Weekly_Sales', 'Temperature', 'Fuel_Price', 'CPI', 'Unemployment']]) plt.show()
In [58]: # Calculate the correlation coefficients between each external factor and weekly sales
corr_sales = data[['Weekly_Sales', 'Temperature', 'Fuel_Price', 'CPI', 'Unemployment']].corr()['Weekly_Sales'] print(corr_sales
Weekly_Sales 1.000000
Temperature -0.063810
Fuel_Price 0.009464
CPI -0.072634
Unemployment -0.106176
Name: Weekly_Sales, dtype: float64
In [59]: from sklearn.linear_model import LinearRegression
In [60]: # Create a multiple regression model to analyze the impact of multiple external factors on weekly sales
X = data[['Temperature', 'Fuel_Price', 'CPI', 'Unemployment']] y = data['Weekly_Sales']
model = LinearRegression().fit(X, y) r_sq = model.score(X, y)
coefficients = model.coef_ intercept = model.intercept_ print(f"R-squared: {r_sq}")
print(f"Coefficients: {coefficients}") print(f"Intercept: {intercept}")
R-squared: 0.024330716534334385
Coefficients: [ -885.66992595 -12248.42446553 -1585.81799199 -41214.98725744]
Intercept: 1743607.6199776107
# Assumptions
The following assumptions were made in order to create the model for Walmart project.
This code generates several plots and prints out the correlation coefficients and regression coefficients:
A plot of the correlation matrix of the dataset. This plot shows the strength and direction of the relationships
between variables.
A 2x2 grid of scatter plots showing the relationship between each external factor and weekly sales.
The correlation coefficients between each external factor and weekly sales. These coefficients quantify the strength
and direction of the relationship.
The regression coefficients of a multiple regression model that analyzes the impact of multiple external factors on
weekly sales. The R-squared value indicates the proportion of variance in weekly sales that can be explained by the
external factors, and the coefficients indicate the strength and direction of the relationship between each factor
and sales.
The multiple regression model that was built to analyze the impact of external factors on weekly sales has an R- squared value
external factors in the model. This means that there are other factors that are not included in the model that also have an imp
The coefficients of the model represent the strength and direction of the relationship between each external factor and weekly
Temperature: -885.67
Fuel_Price: -12,248.42
CPI: -1,585.82
Unemployment: -41,214.99
These coefficients indicate that an increase in temperature, fuel price, CPI, and unemployment is associated with a decrease in
The intercept of the model is 1,743,607.62, which represents the estimated weekly sales when all external factors are at 0. Th
In [ ]:
# analyzing Optimize pricing strategies: By analyzing the relationship between
In [61]: # Filter the dataset to only include stores with holiday weeks
data_holiday = data[data['Holiday_Flag'] == 1]
# Create a scatter plot to visualize the relationship between weekly sales and CPI
sns.scatterplot(x='CPI', y='Weekly_Sales',
data=data_holiday) plt.title('Weekly Sales vs. CPI')
plt.xlabel('CPI')
plt.ylabel('Weekly
Sales') plt.show()
# Create a scatter plot to visualize the relationship between weekly sales and Fuel_Price
sns.scatterplot(x='Fuel_Price', y='Weekly_Sales',
data=data_holiday) plt.title('Weekly Sales vs. Fuel_Price')
plt.xlabel('Fuel_Price')
plt.ylabel('Weekly
Sales') plt.show()
# Build a linear regression model to predict weekly sales based on CPI and Fuel_Price
X = data_holiday[['CPI',
'Fuel_Price']] y =
data_holiday['Weekly_Sales']
reg = LinearRegression().fit(X, y)
# Print the coefficients of the linear regression model

print('Coefficients:', reg.coef_)
print('Intercept:',
reg.intercept_)
# Use the linear regression model to make predictions for different values of CPI and Fuel_Price
new_data = pd.DataFrame({'CPI': [220, 230, 240], 'Fuel_Price': [3.50, 3.60,
Coefficients: [-1194.64849703 46674.84850851]

Intercept: 1176851.6465749654
Predictions: [1077390.94700861 1070111.94688918 1062832.94676976]
# Model Evaluation and Technique-

The following techniques and steps were involved in the evaluation of the model
Technique 1 - coefficients
Technique 2 - linear regression model andPredection model for Time Series so on.
He coefficients of the linear regression model represent the change in weekly sales for a one-unit increase in each
predictor variable, while holding all other variables constant.
In this case, the first coefficient (-1194.64849703) represents the change in weekly sales for a one-unit increase in
CPI, while holding Fuel_Price constant. The negative sign indicates that there is a negative correlation between CPI
and weekly sales - as CPI increases, weekly sales tend to decrease. The magnitude of the coefficient (-1194.65)
indicates the strength of the relationship.
The second coefficient (46674.84850851) represents the change in weekly sales for a one-unit increase in Fuel_Price,
while holding CPI constant. The positive sign indicates that there is a positive correlation between Fuel_Price and
weekly sales - as Fuel_Price increases, weekly sales tend to increase. The magnitude of the coefficient (46674.85)
indicates the strength of the relationship.
The intercept (1176851.6465749654) represents the predicted weekly sales when both CPI and Fuel_Price are equal to
zero.
The predictions ([1077390.94700861, 1070111.94688918, 1062832.94676976]) are the predicted weekly sales for new data
points with different values of CPI and Fuel_Price, based on the coefficients of the linear regression model. For
example, the first prediction (1077390.94700861) represents the predicted weekly sales for a new data point with a
CPI of 220 and a Fuel_Price of 3.5.
Regenerate response
# Manage inventory levels: By analyzing sales trends and understanding the im
In [63]:
# create a linear regression model to predict weekly sales
X = data[['Temperature', 'Fuel_Price', 'CPI', 'Unemployment', 'Holiday_Flag']] y = data['Weekly_Sales']
model = LinearRegression().fit(X, y)
# use the model to predict future sales

future_data = pd.DataFrame({
'Temperature': [70, 75, 80],
'Fuel_Price': [3.5, 3.6, 3.7],
'CPI': [220, 222, 224],
'Unemployment': [6.0, 6.2, 6.4],
'Holiday_Flag': [0, 1, 0]
})
predicted_sales = model.predict(future_data)
# adjust inventory levels based on predicted sales

for i, predicted_sale in enumerate(predicted_sales):
if predicted_sale > 100000:
print(f"Order more inventory for week {i+1}")
elif predicted_sale < 50000:
print(f"Reduce inventory for week {i+1}")
else:
print(f"Inventory levels are appropriate for week {i+1}")
Order more inventory for week 1

Order more inventory for week 2
Order more inventory for week
3
# The evaluation report suggests the following:

Inferences from the evaluation
This code uses a linear regression model to predict future sales based on external factors such as temperature, fuel price, CPI
Based on the predicted sales, the code adjusts the inventory levels. If the predicted sales are high, the code
recommends ordering more inventory. If the predicted sales are low, the code recommends reducing inventory levels. If the pred
# Identify underperforming stores: By comparing the sales data across all store
In [64]:
# calculate total sales for each store
store_sales = data.groupby('Store')['Weekly_Sales'].sum().reset_index()
# calculate average sales per store

avg_sales = store_sales['Weekly_Sales'].mean()
# identify underperforming stores

underperforming_stores = store_sales[store_sales['Weekly_Sales'] < avg_sales]
# print the list of underperforming stores

print("Underperforming stores:")
for store in underperforming_stores['Store']: print(store)
Underperforming stores:
3
5
7
8
9
12
15
16
17
21
22
25
26
29
30
33
34
35
36
37
38
40
42
43
44
45
This code first groups the sales data by store and calculates the total sales for each store. It then calculates the average sa
Stores with total sales below the average are identified as underperforming stores. The code prints the list of underperforming
You can adjust the definition of underperforming stores by changing the criteria, for example, you could identify stores that h
# Forecast future sales: By using the historical sales data, we can develop pre
Inferences from the Walmart Project

The model performance, inferences, Forecast future sales, his can help stores to better plan for future sales and other details
In [33]: # Clearly we can see the irregularities
In [34]: # Define the p, d and q parameters to take any value between 0 and 2
p = d = q = range(0, 5)
import itertools
# Generate all different combinations of p, d and q triplets
pdq = list(itertools.product(p, d, q))
# Generate all different combinations of seasonal p, d and q triplets

seasonal_pdq = [(x[0], x[1], x[2], 52) for x in list(itertools.product(p, d, q))]
In [35]: import statsmodels.api as sm
mod = sm.tsa.statespace.SARIMAX(y1,
order=(4, 4, 3),
seasonal_order=(1, 1, 0, 52),#enforce_stationarity=False,
enforce_invertibility=False)
results = mod.fit()
print(results.summary().tables[1])
C:\Users\HP\anaconda3\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: A date index has been pr

ovided, but it has no associated frequency information and so will be ignored when e.g. forecasting.
self._init_dates(dates, freq)
ovided, but it is not monotonic and so will be ignored when e.g. forecasting.
ovided, but it has no associated frequency information and so will be ignored when e.g. forecasting.
ovided, but it is not monotonic and so will be ignored when e.g. forecasting.
C:\Users\HP\anaconda3\lib\site-packages\statsmodels\base\model.py:604: ConvergenceWarning: Maximum Likelihood optimiz
ation failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
==============================================================================
coef std err z P>|z| [0.025 0.975]
ar.L1 -0.9229 0.269 -3.431 0.001 -1.450 -0.396

ar.L2 -0.7813 0.359 -2.174 0.030 -1.486 -0.077
ar.L3 -0.7113 0.352 -2.020 0.043 -1.402 -0.021
ar.L4 -0.5688 0.171 -3.324 0.001 -0.904 -0.233
ma.L1 -2.4727 0.321 -7.696 0.000 -3.102 -1.843
ma.L2 1.9628 0.635 3.089 0.002 0.717 3.208
ma.L3 -0.4882 0.320 -1.524 0.127 -1.116 0.140
ar.S.L52 -0.4898 0.129 -3.803 0.000 -0.742 -0.237
sigma2 1.174e+11 2.26e-12 5.18e+22 0.000 1.17e+11 1.17e+11
==============================================================================
In [49]: plt.style.use('seaborn-pastel')
results.plot_diagnostics(figsize=(15, 12)) plt.show()
C:\Users\HP\AppData\Local\Temp\ipykernel_6312\3809637464.py:1: MatplotlibDeprecationWarning: The seaborn styles

shipp ed by Matplotlib are deprecated since 3.6, as they no longer correspond to the styles shipped by seaborn.
However, th ey will remain available as 'seaborn-v0_8-<style>'. Alternatively, directly use the seaborn API instead.
plt.style.use('seaborn-pastel')
In [50]: pred = results.get_prediction(start=pd.to_datetime('2012-07-27'), dynamic=False) pred_ci = pred.conf_int()
In [38]: ax = y1['2010':].plot(label='observed')
pred.predicted_mean.plot(ax=ax, label='One-step ahead Forecast', alpha=.7)
ax.fill_between(pred_ci.index,
pred_ci.iloc[:, 0],
pred_ci.iloc[:, 1], color='k', alpha=.2)
ax.set_xlabel('Time Period') ax.set_ylabel('Sales')

plt.legend()
plt.show()
C:\Users\HP\AppData\Local\Temp\ipykernel_6312\2689329424.py:1: FutureWarning: Value based partial slicing on non-mono

tonic DatetimeIndexes with non-existing keys is deprecated and will raise a KeyError in a future Version.
ax = y1['2010':].plot(label='observed')
In [39]: y_forecasted = pred.predicted_mean y_truth = y1['2012-7-27':]
# Compute the mean square error

mse = ((y_forecasted - y_truth) ** 2).mean()
print('The Mean Squared Error of our forecasts is {}'.format(round(mse, 2)))
The Mean Squared Error of our forecasts is 31777933846.48
In [40]: pred_dynamic = results.get_prediction(start=pd.to_datetime('2012-7-27'), dynamic=True, full_results=True) pred_dynamic_ci = pre
In [41]:
ax = y1['2010':].plot(label='observed', figsize=(12, 8))

pred_dynamic.predicted_mean.plot(label='Dynamic Forecast', ax=ax)
ax.fill_between(pred_dynamic_ci.index,
pred_dynamic_ci.iloc[:, 0],
pred_dynamic_ci.iloc[:, 1], color='k', alpha=.25)
ax.fill_betweenx(ax.get_ylim(), pd.to_datetime('2012-7-26'), y1.index[-1], alpha=.1, zorder=-1)
ax.set_xlabel('Time Period') ax.set_ylabel('Sales')
plt.legend() plt.show()
C:\Users\HP\AppData\Local\Temp\ipykernel_6312\4127785946.py:1: FutureWarning: Value based partial slicing on non-mono

tonic DatetimeIndexes with non-existing keys is deprecated and will raise a KeyError in a future Version.
ax = y1['2010':].plot(label='observed', figsize=(12, 8))
In [42]: import numpy as np

# Extract the predicted and true values of our time series
y_forecasted = pred_dynamic.predicted_mean print(y_forecasted)
Date
2012-07-27 1.470892e+06
2010-08-27 1.462495e+06
2011-01-28 1.020830e+06
2010-05-28 1.287399e+06
2012-09-28 1.019062e+06
2011-10-28 1.394298e+06
2011-04-29 1.234777e+06
2012-06-29 8.901318e+05
2011-07-29 9.546755e+05
2010-10-29 8.946962e+05
2012-03-30 8.830751e+05
2010-04-30 9.490994e+05
2010-07-30 8.111185e+05
2011-09-30 8.153588e+05
2011-12-30 6.083340e+05
2012-08-31 6.582176e+05
2010-12-31 5.924099e+05
Name: predicted_mean, dtype: float64
In [43]: y_truth = y1['2012-7-27':]
print(y_truth)
Date
2012-08-06 1414343.53
2012-09-03 1413382.76
2012-10-02 1574287.76
2012-10-08 1388973.65
2012-11-05 1300147.07
2012-12-10 1311965.09
2012-09-14 1267675.05
2012-08-17 1421307.20
2012-10-19 1232073.18
2012-09-21 1326132.98
2012-08-24 1409515.73
2012-10-26 1200729.45
2012-07-27 1272395.02
2012-09-28 1227430.73
2012-08-31 1372872.35
Name: Weekly_Sales, dtype: float64
# Future Possibilities-
The future possibilities, limitations and other-
This code first groups the sales data by store and date and calculates the total sales for each store on each date.
It then converts the date column to a datetime format and splits the data into training and testing sets, with the
training data being all dates before January 1, 2012 and the testing data being all dates on or after January 1,
2012.
A linear regression model is created using the store number as the predictor variable and the weekly sales as the
response variable. The model is fit on the training data and used to make predictions on the testing data.
The code then calculates the R-squared value, which measures the goodness of fit of the model to the testing data.
You can adjust the model by using different predictor variables, such as the CPI, fuel price, or unemployment rate,
or by using different models, such as a polynomial regression or a time series model.
Predictions: [1376252.78477092 1376252.78477092 1376252.78477092 ... 729118.82756686 729118.82756686 729118.82756686 ]

R-squared value: 0.1176607873663219
The output is the predictions made by a model to forecast future sales based on historical sales data. The model has predicted
An R-squared value of 0.1176607873663219 means that the model explains 11.77% of the variance in the data, which is relatively
necessary to improve the accuracy of the sales predictions.
In [ ]:
In [66]: # Compute the Root mean square error

rmse = np.sqrt(((y_forecasted - y_truth) ** 2).mean())
print('The Root Mean Squared Error of our forecasts is {}'.format(round(rmse, 2)))
The Root Mean Squared Error of our forecasts is 444803.25
In [67]: Residual= y_forecasted - y_truth

print("Residual for Store1",np.abs(Residual).sum())
Residual for Store1 1121519.9632811893
In [69]: # Get forecast 12 weeks ahead in future

pred_uc = results.get_forecast(steps=12)
print(pred_uc)
<statsmodels.tsa.statespace.mlemodel.PredictionResultsWrapper object at 0x0000024F3DF5D5A0>
C:\Users\HP\anaconda3\lib\site-packages\statsmodels\tsa\base\tsa_model.py:834: ValueWarning: No supported index is av

ailable. Prediction results will be given with an integer index beginning at `start`.
return get_prediction_index(
In [70]: # Get confidence intervals of forecasts

pred_ci = pred_uc.conf_int()
In [48]:
ax = y1.plot(label='observed', figsize=(12, 8))

pred_uc.predicted_mean.plot(ax=ax, label='Forecast') ax.fill_between(pred_ci.index,
pred_ci.iloc[:, 0],
pred_ci.iloc[:, 1], color='k', alpha=.25) ax.set_xlabel('Time Period')
ax.set_ylabel('Sales')
plt.legend() plt.show()
# Conclusion-
The Walmart project is good relation between each stores, sales effected when fuel price increase, tempers are increase- then s
In future corelate the each another Walmart stores then incresed the sales and become decrised the costing, whereby income wil
# References-
Some data downloaded & contantet copy form google and other resorces.

Walmart (Project)

Uploaded by

Document Informationclick to expand document information

Document Informationclick to expand document information

Copyright:

Available Formats

Walmart (Project)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Walmart (Project)

Uploaded by

Copyright:

Available Formats

12/23/23, 12:21 PM walmart(Project) - Jupyter Notebook

# Capstone Project - Walmart #

The Walmart DataSet.csv contains 6435 rows and 8 columns.

# Data Preprocessing Steps And Inspiration

In [1]: import pandas as pd

In [2]: data = pd.read_csv('Walmart DataSet.csv') data.set_index('Date', inplace=True)

Enter the store id:11

Out[2]: Weekly_Sales float64

0 Store 6435 non-null int64

05-02-2010 1 1643690.90 0 42.31 2.572 211.096358 8.106

12-02-2010 1 1641957.44 1 38.51 2.548 211.242170 8.106

19-02-2010 1 1611968.17 0 39.93 2.514 211.289143 8.106

26-02-2010 1 1409727.59 0 46.63 2.561 211.319643 8.106

05-03-2010 1 1554806.68 0 46.50 2.625 211.350143 8.106

28-09-2012 45 713173.95 0 64.88 3.997 192.013558 8.684

05-10-2012 45 733455.07 0 64.89 3.985 192.170412 8.667

12-10-2012 45 734464.36 0 54.47 4.000 192.327265 8.667

19-10-2012 45 718125.53 0 56.47 3.969 192.330854 8.667

26-10-2012 45 760281.43 0 58.85 3.882 192.308899 8.667

In [10]: #Total Weakly sales from all stores

C:\Users\HP\AppData\Local\Temp\ipykernel_6312\1042104552.py:4: UserWarning: Parsing dates in DD/MM/YYYY format when

In [13]: from statsmodels.tsa.seasonal import seasonal_decompose

<Figure size 640x480 with 0 Axes>

Choosing the Algorithm For the Project-

sales5 = pd.DataFrame(store5.Weekly_Sales.groupby(store5.index).sum()) sales5.dtypes

#converting 'date' column to a datetime type

C:\Users\HP\AppData\Local\Temp\ipykernel_6312\430818428.py:14: UserWarning: Parsing dates in DD/MM/YYYY format when

In [15]: y1=sales.Weekly_Sales y2=sales5.Weekly_Sales

Convert the 'Date' column to a datetime format.

In [51]: # Plot the correlation matrix of the dataset

In [56]: # Create scatter plots of external factors against weekly sales

In [59]: from sklearn.linear_model import LinearRegression

# analyzing Optimize pricing strategies: By analyzing the relationship between

# Print the coefficients of the linear regression model

Coefficients: [-1194.64849703 46674.84850851]

# Model Evaluation and Technique-

# Manage inventory levels: By analyzing sales trends and understanding the im

# use the model to predict future sales

# adjust inventory levels based on predicted sales

Order more inventory for week 1

# The evaluation report suggests the following:

# calculate average sales per store

# identify underperforming stores

# print the list of underperforming stores

Inferences from the Walmart Project

In [33]: # Clearly we can see the irregularities

# Generate all different combinations of seasonal p, d and q triplets

In [35]: import statsmodels.api as sm

C:\Users\HP\anaconda3\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: A date index has been pr

ar.L1 -0.9229 0.269 -3.431 0.001 -1.450 -0.396

C:\Users\HP\AppData\Local\Temp\ipykernel_6312\3809637464.py:1: MatplotlibDeprecationWarning: The seaborn styles

In [50]: pred = results.get_prediction(start=pd.to_datetime('2012-07-27'), dynamic=False) pred_ci = pred.conf_int()

ax.set_xlabel('Time Period') ax.set_ylabel('Sales')

C:\Users\HP\AppData\Local\Temp\ipykernel_6312\2689329424.py:1: FutureWarning: Value based partial slicing on non-mono

In [39]: y_forecasted = pred.predicted_mean y_truth = y1['2012-7-27':]

# Compute the mean square error

The Mean Squared Error of our forecasts is 31777933846.48