Chapter 1: Introduction: 1.1 Background Theory
Chapter 1: Introduction: 1.1 Background Theory
Chapter 1: Introduction: 1.1 Background Theory
The industry is in need of data mining techniques and intelligent prediction models of sales
trends with the highest possible level of accuracy and reliability. Sales Analysis and
forecasting has become an essential need in industry as it enables us to make informed
business decisions and predict short-term and long-term performance. Sales forecasting gives
insight into how a company should manage its workforce, cash flow, and resources. One of
the solutions is to use predictive analytics and rely on machine learning algorithms which will
help in predicting future sales based on years of past business data.
1. Define project: Define the project outcomes, deliverable, scope of the effort,
business objectives, identify the data sets that are going to be used.
2. Data collection: Data mining for predictive analytics prepares data from
multiple sources for analysis. This provides a complete view of customer
interactions.
3. Data analysis: Data Analysis is the process of inspecting, cleaning and
modelling data with the objective of discovering useful information, arriving at
conclusion
4. Statistics: Statistical Analysis enables to validate the assumptions, hypothesis
and test them using standard statistical models.
5. Modelling: Predictive modelling provides the ability to automatically create
accurate predictive models about the future. There are also options to choose
the best solution with multi-modal evaluation.
1
6. Deployment: Predictive model deployment provides the option to deploy the
analytical results into everyday decision-making processes to get results,
reports and output by automating decisions based on the modelling.
7. Model monitoring: Models are managed and monitored to review the model
performance to ensure that it is providing the results expected.
In the context of supermarkets, various machine learning processes and predictive analysis
can be used for customer centric sales analysis and prediction, product centric sales analysis
and prediction and location centric sales analysis and prediction.
Our project solely focuses on product centric sales analysis and prediction, we analyse the
sale of various categories of products in general, compare with the previous year’s sale and
predict the future sales of the product category in general and also the items within the
category. For prediction of sale of a product, data mining is particularly called into play since
it involves mining of certain patterns which can be done by following the steps of KDD
process. The steps are listed as follows:
A complexity of sales dynamics often forces decision makers to make decisions based on
subjective mental models, reflecting their experiences. However, research has shown that
companies perform better when they apply data driven decision making. So, we have
developed a custom data analytics model to help the supermarket generate necessary insights
about various aspects of sales, so that they can make the right decision and achieve growth.
1.3 OBJECTIVES:
2
● Overview the trend of total sales of all stores and items over time.
● Choose and reconstruct features that have impacts on Supermarket Sales.
● Take a future decision in terms of inventory management, marketing activities,
schemes based on sales analysis.
This project is applicable to every supermarket as it will help the supermarket to manage its
resources and help to make a future decision in terms of inventory management, marketing
activities. Schemes or offers to be rolled and changes in manufacturing processes of the
products if applicable. Sales analysis will also show the current market trends to the company
and based on sales data.
1.5 APPLICATIONS:
1. High-Quality Lead Generation: With predictive analytics, marketers can gauge the
customer’s propensity to buy with greater accuracy.
2. Targeted Profiling of Customers: It gives marketers a greater understanding of how
customers responded to a marketing activity, the reasons behind why they did or did
not make a purchase and helps them identify how to convert a prospect into a paying
customer
3. Improved Content Distribution: Predictive analytics tackles that problem head-on by
analysing the types of content that most resonate with customers of certain
demographic or behavioural backgrounds, and then automatically distributing similar
content to leads that mirror the same demographic or behavioural habits.
4. Improved Determination of Product Fit: Equipped with historical, sales, and leads
data, businesses can better understand exactly what customers’ needs and wants are,
which is key to developing better future products.
3
Big data science and analytics have changed the course of market strategies and paved
altogether new paths for the growth and profit of the companies. We have entered the digital
age in this decade and big data analysis is the latest digital technology that has accomplished
even unbelievable tasks in real-time. By the end of 2020, the big data volume is going to
reach 44 trillion gigabyte [3], breaking down all the previous trends and setting a new
business world.
Data science is the most current “tool” for businesses that want to meet consumer demand.
The beauty of it is that it is not based on what we may “feel” about consumers and what they
want or need. It is based upon actual patterns of behaviours and trends that the facts reveal.
A high competition exists in the Fast Moving Consumer Goods (FMCG) market to increase
the profits. Accurate sales forecasting is an inexpensive method to reduce lost sales, product
returns and support efficient product planning. Moreover, accurate forecasts of retail sales
may improve portfolio investors’ ability to predict movements in the stock prices of retailing
chains. Aggregate retail sales time series are usually preferred because they contain both
trend and seasonal patterns, providing a good testing ground for comparing forecasting
methods, and because companies can benefit from more accurate forecasts. Retail sales time
series often exhibit strong trend and seasonal variations presenting challenges in developing
effective forecasting models. . Exponential smoothing and Autoregressive Integrated Moving
Average (ARIMA) models are the two most widely used approaches to time series
forecasting, and provide complementary approaches to the problem. While exponential
smoothing methods are based on a description of trend and seasonality in the data, ARIMA
models aim to describe the autocorrelations in the data. The ARIMA framework to
forecasting originally developed by Box et al. [link: GEP Box, GM Jenkins, GC Reinsel, GM
Ljung - 2015 - John Wiley & Sons] involves an iterative three-stage process of model
selection, parameter estimation and model checking
4
A customer pattern recognition tool from Lattice, SalesPRISM helps brands to collect data
about sales and predict potential sales leads. Every brand has a lot of data about customers
and by using factors like CRM data, site traffic, and sales history, SalesPRISM also analyses
external data like LinkedIn activity and LexisNexis reports.
2. Medio Platform:
With big platforms like Amazon and Flipkart, understanding the needs and habits of the
customers will go a long way in helping a brand to create a comprehensive e-commerce
strategy. Media Platform helps brands to analyse the problems related to customers leaving a
website and to ensure corrective action.
3. TIBCO Software:
Understanding customer behaviour has always been an important aspect for the success of
any brand/ organization. The predictive analytics tool in TIBCO software can effectively help
brands to understand data in a much better manner, thereby enabling them to make smarter
business decisions.
4. Lattice
Another great predictive analytics tool, Lattice provides immense insights into sales so as to
help brands to market their products in a much better manner. By using the predictive
analytics tool provided by Lattice, brands can easily find effective tools to convert their leads
into sales.
2.3. RESOLUTION:
Forecasting of behavioural time series has benefitted many businesses with the aim of
predicting future trends by understanding the past.
The feature of our project is that the user can analyse and compare the sales of products at the
same period of time this year and of that previous year. Further, we also have year based
5
predictive analysis in which the user can analyse and compare the sales based on the time
period. Another feature in our project is a user-friendly interface. The user will not have any
problems navigating the app since all the data visualizations are shown on the home page.
CHAPTER 3: METHODOLOGY
6
development is done in steps from analysis design, implementation, testing/verification,
maintenance.
Each iteration passes through the requirements, design, coding and testing phases. And each
subsequent release of the system adds function to the previous release until all designed
functionality has been implemented.
The system is put into production when the first increment is delivered. The first increment is
often a core product where the basic requirements are addressed, and supplementary features
are added in the next increments. Once the core product is analysed by the client, there is plan
development for the next increment.
7
RELATED THEORY :
Autoregressive integrated moving average (ARIMA) model is used in time series data either
to better understand the data or to predict future points in the series (forecasting). The term
ARIMA can be decomposed into three parts : AR(Auto Regressive), MA (Moving Average)
and I (Integrated).
Seasonal ARIMA models are usually denoted ARIMA(p,d,q)(P,D,Q)m, where m refers to the
number of periods in each season, and the uppercase P,D,Q refer to the autoregressive,
differencing, and moving average terms for the seasonal part of the ARIMA model.
Auto Regressive :
The AR part of ARIMA indicates that the evolving variable of interest is regressed on its own
lagged (prior) values.
An autoregressive model of order can be written as
yt =c+ϕ1yt−1+ϕ2yt−2+⋯+ϕpyt−p+εt,
where εt is white noise. This is like a multiple regression but with lagged values of yt as
predictors. We refer to this as an autoregressive model of order p, where p is the number of
lags to be used as predictors. Autoregressive models are remarkably flexible at handling a
wide range of different time series patterns.
Moving Average:
The MA part indicates that the regression error is actually a linear combination of error terms
whose values occurred contemporaneously and at various times in the past. A moving
average model uses past forecast errors in a regression-like model.
8
We refer to this as an MA(q) model, a moving average model of order q, where q is the
number of lagged forecast errors . In this , we do not observe the values of εt, so it is not
really a regression in the usual sense.
Integrated:
The I (for "integrated") indicates that the data values have been replaced with the difference
between their values and the previous . d is the degree of differencing (the number of times
the data have had past values subtracted),
Information Criteria
Akaike’s Information Criterion (AIC), which was useful in selecting predictors for
regression, is also useful for determining the order of an ARIMA model. It can be written as
AIC=−2log(L)+2(p+q+k+1),
where L is the likelihood of the data, k=1 if c≠0 and k=0 if c=0.
9
Fig 3.2.1: Block Diagram
The sales data provided from the user is used for the analysis of sales and prediction of sales
for the upcoming 4 years . For the analysis we use the current sales data and perform
10
exploratory data analysis for graphical representation of sales data. The total sales and profit
of sales is calculated so that the user can have an overview of sales and profit.
For the prediction part the training and testing data is fed into the ARIMA model. From the
training data the parameters p,d,q,P,D,Q,m are determined where p is the order (number of
time lags) of the autoregressive model, d is the degree of differencing (the number of times
the data have had past values subtracted), and q is the order of the moving-average model and
P,D,Q are the autoregressive, differencing, and moving average terms for the seasonal part of
the ARIMA model respectively and m refers to the number of periods in each season.
Forecasting is done by the ARIMA model and the best fit is selected by AIC( Akaike’s
Information Criterion). The forecasted data is exported to an xlsx file.
When the user accesses the website , the user has to insert their valid login credentials and on
the dashboard the user has options on the tab to view the analysis or prediction.
If the user selects analysis, they can view the analysed number of sales of each category and
subcategories in pie chart, year wise sales comparison of each category in a bar graph with
slider. The user can also view the graphical representation of total sales vs profit in general
and for each individual category in the form of line graphs.
If the user selects predictions, the user can view predicted sales in general and also for each
category in the form of a line graph. They can view the general predicted total sales vs profit
in form of line graph and date wise prediction table. To view the predicted sales for each
category the user can select the category and view the sales of the selected category in the
form of a line graph.
Users can also get an overview of overall estimated growth , expected growth of each
category and prediction of the category that will have the highest sales in upcoming years.
3.3.ALGORITHMS
Step 1 : Start
11
Step 3: If Login details are correct
Go to step 4
Else
Go to step 5
Else
Go to step 11
Step 5: Show options: Sales Analysis and Sales Prediction and logout
Goto step 7
Goto Step 8
Step 7.2 : Show the sales comparison of all category according to year in bar
graph
Step 8.1. Show total sale vs profit prediction for upcoming 3 years
12
Step 8.2.Show the date wise sales prediction in a table
Step 8.3 :Show sales prediction of each category as selected by the user
Step 10 : Display “User disconnected - Please login to view the success screen again”
Goto Step 2
Else
Goto Step 11
Step 1: Start
Step 2: Enter the training and forecasting data to the ARIMA model
13
Else
Go to step 5
Go to step 6
Step 6: Determine the presence or absence of the constant term in the model
Step 8: Select the best suited structure for representation using AIC (Akaike’s
Step 9.1:Compare the obtained forecasted data with the actual sales data
3.4. FLOWCHARTS
14
15
16
3.4.2. FLOWCHART OF ARIMA
17
3.5 DATA FLOW DIAGRAM
The DFD level 0 diagram above also represents the black box diagram of the system. The
user has to provide the sales data for a significant period of time (in this case- 4 years of data)
to the system. The admin reads the data from the system, analyses the data and represents it
in various form into the system. The user also uses the data to predict the future sales of the
supermarket with the help of ARIMA model. Then the system represents the data model
prepared after machine learning algorithm is applied to it and represents it in the form of
dashboard to the user.
18
3.6. UML USE CASE DIAGRAM
19
3.8. SEQUENCE DIAGRAM FOR LOGIN
20
1. Python:
2. Dash
Dash is a user interface library for creating analytical web applications. Those who use
Python for data analysis, data exploration, visualization, modelling, instrument control, and
reporting will find immediate use for Dash.
3. Plotly
4.Pandas
Pandas is a high-level data manipulation tool developed by Wes McKinney. It is built on the
Numpy package and its key data structure is called the Data Frame. Data Frames allow you to
store and manipulate tabular data in rows of observations and columns of variables.
5.CSS:
CSS(Cascading Style Sheet) is a style sheet language used to format the layout of Web pages.
Then can be used to define text style, table sizes, and other aspects of Web pages that
previously could only be defined in a page’s HTML.
6.Bootstrap:
Bootstrap is a free and open-source CSS framework directed at responsive, mobile first front-
end web development. It contains CSS and JS based design templates for forms, buttons,
navigation, and other interface components.
7. ARIMA:
21
ARIMA is an acronym that stands for AutoRegressive Integrated Moving Average. This is
one of the easiest and most effective machine learning algorithms to perform time series
forecasting. It uses time series data to either better understand the data set or to predict future
trends.
8.Flask
9.SQLite
22
the help of tsa.seasonal_decompose(). This process helps us determine whether to use the
ARIMA model or not and to find if the trend is seasonal or not.
23
Figure: Decomposition of Office Supplies Sales
The plots above clearly show that the sales are unstable, along with its obvious seasonality.
So we use Seasonal ARIMA model.
Initially we have split our data into train and test sets where train dataset is used to train our
model whereas test is used to test it. Now to help us understand the accuracy of our forecasts,
we compare the predicted value for the timeframe for the test dataset and compare it to real
data of the test set.
24
1. Furniture
2.Technology
25
3. Office Supplies
4. Total sales
26
The line plot is showing the observed values compared to the rolling forecast predictions for
all the different categories. Overall, our forecasts align with the true values very well,
showing an upward trend starting from the beginning of the year and capturing the
seasonality toward the end of the year.
VALIDATION
The table shows the summary of validation of the different parameters for Furniture sales,
technology sales, office Supplies and Technology Sales respectively.
1.Furniture
2. Technology
27
3.Office Supplies:
28
3. ar.S.L1 and ma.S.L2 refer to the seasonal ‘autoregressive’ and ‘moving average’
terms respectively with a lag of 12. All of these coefficients are part of the ARIMA
equation.
4. The ‘std err’ columns is an estimate of the error of the predicted value. It tells you
how strong is the effect of the residual error on your estimated parameters (the first
column).As we can see the value of standard error in all this is mostly around 0.5
which is good for our model.
5. The ‘z’ is equal to the values of ‘coef’ divided by ‘std err’. It is thus the standardised
coefficient.
6. The P>|z| column is the p-value of the coefficient. It is really important to check these
p-values before you continue using the model. The lower the p value better the
parameter.
7. [0.025 and 0.975] are both measurements of values of our coefficients within 95% of
our data, or within two standard deviations. Outside of these values can generally be
considered outliers.
CHAPTER 4: EPILOGUE
29
4.1. RESULT
After training the model with four years of past Sales record we were able to forecast the
record of the following 3 years i.e until 2022. This result can be shown in the graph below.
For Furniture:
For Technology:
30
For Total Sales:
4.2. CONCLUSION
31
We have made an web app that takes current sales data from the user and performs
exploratory data analysis and conducts prediction of sales of upcoming four years using
ARIMA model
The sales data is accessed through the excel file given by the user and the analysis of sales is
represented by graphical representations like pie chart and bar graph which allows the user to
view the sales of each category. With the ARIMA model the sales prediction is done and the
prediction is represented in a line graph. This web application allows the user to overview the
sales and make future decisions based on the predictions.
The following feature can be added in the webapp to make it more efficient and marketable :
In this application for the enhancement we can add a upload data feature page in the
application so that any user than upload their sales data for analysis and prediction.
32
REFERENCES/BIBLIOGRAPHY
[5]https://www.geeksforgeeks.org/python-data-analysis-using-pandas/
[6]https://www.learnpython.org/en/Pandas_Basics
[7]https://plot.ly/python/getting-started/
[8]https://medium.com/analytics-vidhya/5-reasons-every-aspiring-data-scientist-
must-learn-sql-2bab007a8d76
[9]https://www.machinelearningplus.com/time-series/arima-model-time-series-forecasting-
python
https://otexts.com/fpp2/
[link: P. Ramos, et al., Performance of state space and ARIMA models for consumer retail
sales forecasting, Robotics and Computer Integrated Manufacturing (2015),
http://dx.doi.org/10.1016/j.rcim.2014.12.015i ]
33
SCREENSHOTS
34
Fig: Bar graph to view and compare categories(yearwise)
35
Fig: Page displayed after logout
36