HE Ffect of Optimization of Error Metrics: Ales Orecasting Domain
HE Ffect of Optimization of Error Metrics: Ales Orecasting Domain
HE Ffect of Optimization of Error Metrics: Ales Orecasting Domain
OF ERROR METRICS
SALES FORECASTING DOMAIN
Name 2
Autumn 2010:MI21
1
Abstract
It is important for a retail company to forecast its sale in correct and accurate way to be able
to plan and evaluate sales and commercial strategies. Various forecasting techniques are
available for this purpose. Two popular modelling techniques are Predictive Modelling and
Econometric Modelling. The models created by these techniques are used to minimize the
difference between the real and the predicted values. There are several different error
metrics that can be used to measure and describe the difference. Each metric focuses on
different properties in the forecasts and it is hence important which metrics that is used when
a model is created. Most traditional techniques use the sum of squared error which have
good mathematical properties but is not always optimal for forecasting purposes. This thesis
focuses on optimization of three widely used error metrics MAPE, WMAPE and RMSE.
Especially the metrics protection against overfitting, which occurs when a predictive model
catches noise and irregularities in the data, that is not part of the sought relationship, is
evaluated in this thesis.
Genetic Programming, a general optimization technique based on Darwin’s theories of
evolution. In this study genetic programming is used to optimize predictive models based on
each metrics. The sales data of five products of ICA (a Swedish retail company) has been
used to observe the effects of the optimized error metrics when creating predictive models.
This study shows that all three metrics are quite poorly protected against overfitting even if
WMAPE and MAPE are slightly better protected than MAPE. However WMAPE is the most
promising metric to use for optimization of predictive models. When evaluated against all
three metrics, models optimized based on WMAPE have the best overall result. The results of
training and test data shows that the results hold in spite of overfitted models.
Acknowledgements
Allah Almighty has enabled me to conceptualize, undertake and complete this work which is
conducted as a part of academic requirement of my masters program.
I wish to acknowledge my gratitude to my supervisor Mr. Rikard König for providing perfect
guideline and supporting me from the beginning till the end of project completion. Without
his timely corrections, comments and valuable suggestions, I would not have been able to
complete this thesis in time.
I extend my gratitude to my friend Mr. Dinesh Bajracharya for providing valuable comments
during the dissertation progress and to my mates in Boras, Sweden.
Special and grateful appreciations are due to my very special friend Miss. Sadia Batool for
her constant moral support and encouragement in difficult times.
3
Table of Contents
Chapter 1 Introduction.................................................................................................................................6
1.1 Background..................................................................................................................................5
1.2 Thesis Statement..........................................................................................................................7
1.3 Purpose........................................................................................................................................7
1.4 Expected Result...........................................................................................................................8
1.5 Thesis Outline..............................................................................................................................8
2.1 Introduction.................................................................................................................................9
2.3 Forecasting...................................................................................................................................9
2.9.1 MAPE..........................................................................................................................13
2.9.2 WMAPE......................................................................................................................14
2.9.3 RMSE..........................................................................................................................15
2.10 Optimization..............................................................................................................................16
2.11.3 Autoregression................................................................................................................18
2.15.1 Crossover.....................................................................................................................23
2.15.2 Mutation......................................................................................................................24
2.15.3 Reproduction...............................................................................................................24
2.16 Bloating.....................................................................................................................................24
3.1 Introduction...............................................................................................................................27
3.2 Data............................................................................................................................................27
3.3 Input...........................................................................................................................................29
3.6 Rank...........................................................................................................................................29
Chapter 4 Results.......................................................................................................................................31
4.2 Result-1......................................................................................................................................31
4.3 Result-2......................................................................................................................................35
4.4 Result-3......................................................................................................................................36
Chapter 5 Analysis.....................................................................................................................................38
5.1 Introduction...............................................................................................................................38
5.2 Analysis.....................................................................................................................................38
5.3 Discussion..................................................................................................................................39
Chapter 6 Conclusion.................................................................................................................................40
References..............................................................................................................................................................41
5
List of Tables
Table X Example for MAPE, WMAPE and RMSE................................................................................16
List of Figures
Figure 1 Sales prediction of a hypothetical product and forecast model.................................................15
Figure 4 Crossover...................................................................................................................................23
Figure 5 Mutation.....................................................................................................................................24
Chapter 1___________________________________________________________________
INTRODUCTION
1.1 Background
The business environment is rapidly changing and is becoming more complex and
competitive. Anticipating and dealing with new challenges it is imperative to be proactive.
Forecasting is the process of anticipating the future, based on available historical data. In a
business firm, the higher management may use forecasts for setting goals, allocating funds,
determining risks, opportunities and problems (Wang G., Jain C., 2003). Timely forecasting
helps retail companies to achieve competitive advantage by integrating customer focus sales
and comprehensive marketing plans for current and new products (STL Warehousing, 2010).
Forecasting also enables management to start/change operations at the right time in order to
obtain the greatest benefit. It helps companies to prevent losses by making appropriate
decisions based on the forecasted information. Forecasting can also help in making decision
for developing new products or product lines. Finally forecasting facilitates a better decision
support when management are judging whether the product or product line will be successful
or not.
The importance of forecasting, in planning and decision making process, increases the need
of good and accurate forecasting models. The model represents the relationship between a set
of variables. Forecasting models can also show the cause-and-effect relationship and time
bearing between variables. There are numerous forecasting techniques that can be used to
build such models.
According to the Cambridge Advanced Learner‟s Dictionary a technique is “a way of doing
an activity which needs skill.” Thus the forecasting modelling technique is a way of creating
a forecast model. The aim of forecast models is to keep the difference between forecasted and
real value at a minimum. The difference between the forecasted value and the actual value is
referred to as the error which can be calculated by several different error metrics. The error
can be measured in different ways; it can be measure against percentage, averages, etc.
Forecasting models are optimized according to a specific error metric, to give the minimum
possible value for an error between forecasted and real value. The value of an error metric
represents the accuracy of the forecasted values.
Standard forecasting techniques are designed to optimize a single error metric, most often
RMSE. There are however several error metrics that can measure different properties of a
given forecast. Hence Armstrong (2001) suggest that as many metrics as possible should be
regarded when a model is evaluated by a decision maker.
A well known problem of all machine learning techniques is that they are inclined towards
overfitting, i.e. the models are too closely fitted to the data and hence learns noise and other
errors such as measuring errors present in the data. An overfitted model will perform well on
7
the data it was optimized for but poorly on new unseen data since the noise and measuring
errors will differ.
Does optimization give optimal solution for the same metric which is optimized?
Which of the most commonly used error metric is best protected against over fitting?
Which of the most commonly used error metric should be used for optimization to get
the best overall result?
1.3 Purpose
The purpose of this study is to
This will help to create more reliable forecasts. The focus and target of this study are those
business communities who are working in marketing, sales and planning departments. This
study also contains information for the researchers in the field of forecasting.
8
Point out one or more error metrics which is better protected against
overfitting, and will give more accurate forecast models, i.e. models that
perform well at novel data.
Initial guidelines regarding the importance the choice of error metrics when
optimizing a forecast model.
Chapter 2___________________________________________________________________
THEORETICAL FOUNDATION
2.1 Introduction
This chapter describes various concepts/terminology/techniques which are useful for
understanding the problem, choosing appropriate techniques/models and carrying out
analysis. It includes types of forecasting, modelling techniques and brief description of
Genetic Programming used for optimization of error metrics.
2.2 Dataset
Data can have various formats and can be saved in different files. At the most basic level a
single unit of information of data is called a variable. A variable can take a number of
different values. The variables which have numeric values such as 1, 2, 3.5,-8 etc are numeric
variables. Height, weight, age, income are some of the examples of numeric variables. On the
other hand the variables that function as labels rather than numbers are called categorical
variables. For example the variable may use value 1 to represent customer as Male and 2 as
Female in data. The categorical variables are stored and considered as strings (non-numeric
values). The objects, described by same variables are combined to form a dataset. The dataset
is presented in the form of table. The single row of the dataset is called the instance of data.
The variables used in the dataset have two types, independent variables and dependent
variables. Cios K.J., Pedrycz W., Swiniarski R.W. and Kurgan L. A., (2007)
The independent variables can take any value which affect the value of dependent variable.
For example the price of pineapple depends on the weight, here weight is independent
variable as it can take any value (in this case only positive value) and the price is dependent
variable. If the weight increases the price will also increase. The dependent variable is the
variable which is to be predicted. Sometime it depends on more than one independent
variable. The information in the dataset represented by variables is fed to the forecasting
model for making future predictions.
2.3 Forecasting
Forecasting is the estimation of future value of dependent variable based on different
independent variables. In other words forecasting is about how future will look like. The
planners/managers need forecast when there is any uncertainty about the future, e.g. there is
no need to forecast whether the sun will rise tomorrow. Many decisions however involve
uncertainty where formal forecasting procedures would be useful. Forecasting makes
assumption about future based on past experiences (Morrell J., 2001).
For example a retailing firm that has been in a business since 20 years can forecast its coming
year sales, depending upon the experiences and historical data of last 20 years.
10
The independent variables drive the growth or shrinkage of demand of a particular product or
service. Very few companies realise the forces that drive their industries or brands.
Understanding these forces provides the foundation for strategy development and business
planning (Griliches Z. and Intriligator M. D., 1983).
The forecasting techniques are used to create models which catch the patterns and
relationships between the variables in the dataset.
Qualitative forecasting
Quantitative forecasting
2.5 Model
The model represents the relationship of variables, expert‟s knowledge and opinion in
simplified form. The Model can also be defined as the description of causal relationships
between dependent and independent variables (Cios K.J., Pedrycz W., Swiniarski R.W. and
Kurgan L. A., 2007). This description can take different forms such as classifiers, decision
tree, production rules, mathematical equation etc. The most famous form of these models
taken in statistics is mathematical equation. The formulation of mathematical equation
representing the relationship between the dependent and independent variables in dataset is
also known as statistical model.
11
Training of a model
The training of a model is basically allowing the technique to adjust the parameters of the
model so it produces a minimal error according to some error metric on a training set, i.e. a
special dataset only used for training. Generally large portion of data is used for training of
model and rest for testing. The training of model has two steps (Rahou A.A.M., Al-Madfai
H., Coombs H., Gillelland D. and Ware A., 2007). The first step is to fit the model in relation
to the data in such a way that the model captures all the independent variables, dependent
variables and the relationship between them. The second step is to verify the validity of
model by comparing the sample forecasts and the actual values. A test set is used to evaluate
how the model would perform on new unseen data. Since the training data most often contain
noise, the error estimated on the training set is generally too low. A test set is needed since all
machine learning techniques are prone to overfitting, i.e. they do not only learn the sought
relationship but also some of the noise present in the data.
The predictive modelling techniques build models which use historical data and express the
relationship of dependent and independent variables in the form of equations.
Error metrics are the mathematical equations which are used to describe error between actual
and predicted values. The difference between actual and predicted values shows how well the
model has performed. The main idea of forecasting techniques is to minimize this value since
this should influence the performance and reliability of the model. These error metrics have
significant importance in forecasting of future sales, as the measurements taken by these
metrics, highly influence the future planning of the organizations. According to Bryan T.,
(2005) any metric which measures the error should have five basic qualities which are:
validity, easy to interpret, reliability, presentable and have statistical equation.
Here validity refers to the degree to which error metric measures what it claim to measure. In
other words validity refers to whether error metric really measures what it intend to measure.
The metric should measure the results which can relate to the available data. For example, if
13
the metric is indented to measure result in binary form, then only binary result will be
consider as valid output, otherwise validity of the metric can be questioned. Validity also
refers to the authenticity of the measurement i.e. how authentic the error measured by an
error metric is.
Easy to Interpret refers to the simplicity of the metric. The metric should be easy to
understand and avoid complexity. The practitioners normally avoid using complex metrics
since especially financial forecasting is complex and there is a need to be able to explain and
motivate decisions, which is easier if the metric is simple and easy to understand.
Reliability is the consistency of the error metric to measure error using the same measurement
method on same subject. If repeated measurements are taken and every time the results are
highly consistent or even identical then there is a high degree of reliability, but if the
measurements have large variations then reliability is low. The error metric is reliable when it
is evident that it will measure the same result every time it is given the same data.
Presentable refers to the ability of error metric and its measurement to be represented in a
form which is easy to understand.
Statistical equation suggests that the error metric can be represented in the form of
mathematical equation. Mathematical equations are most common and easy to interpret form
of error metrics.
On the basis of reliability, validity and wide use, following performance (error) measuring
metrics are selected for this thesis. Mentzer J.T. and Moon M.A., (2005); Barreto H. and
Howland F.M., (2006) all elaborate the significance of these metrics.
MAPE
WMAPE
RMSE
2.9.1 MAPE
Mean Absolute Percentage Error (MAPE) is one of the most common and popular error
measuring metric for forecasting. MAPE calculate the mean of absolute percentage error
which is easy to understand and calculate. The MAPE is represented by the following
equation:
(2.1)
Where in Equation 2.1 is the actual value and is the forecasted value.
14
MAPE meets most of the qualities [Section 2.8], except the reliability which could be
questioned in some (Mentzer J.T. and Moon M.A., 2005), because MAPE focuses on error
between small values in data and hence can perform poorly for the data which has sudden
increase in sales. For example, consider week three and six in Table X which show the
hypothetical sales of a product and the related forecast of some model: To calculate MAPE
the AE (absolute error) is divided with the actual sales, for week 3 the absolute error is 10
when the actual sale is 20 which gives an MAPE of 10/20 = 50%, but for week 6 where the
actual value is 100 but the error is still 10 and it only becomes 10/100 = 10%. This example
is shown in figure 1 and table X. Hence when an actual sale is low the small error will
become bigger so MAPE focuses on the data with lower values. MAPE can be used where
sudden increase or decrease of sales volume is not expected.
2.9.2 WMAPE
WMAPE calculate the weighted MAPE and gives different weights to the values. High
values have the same importance as that of the lower ones. The WMAPE equation is:
(2.2)
Where in Equation 2.2 is the actual value and is the forecasted value
WMAPE have the same properties as that of MAE (Mean Absolute Error) since the Absolute
Error (AE) is the core element of both MAE and WMAPE. MAE measures the absolute
deviation of forecasted values from the actual value. It gives the average of the total sum of
absolute error where as WMAPE gives the result with weights assigned to every value. MAE
is suitable when the cost of forecast errors is relative to the size of the forecast error
(Kennedy P., 2003) and since WMAPE is based on AE this is also true for WMAPE.
However in MAPE each calculated error get the same amount of weight, which at times may
distort the final value. When there is very small error in large revenue generating products
and large error in low revenue generating products, MAPE does not treat the errors according
to their significance. In WMAPE it is possible to give more weight to the products generating
high revenue and lower weight to the other products. This weight is equal to the actual sale of
the product divided with the total sale of the product. When only a single forecast is
considered WMAPE and MAPE will give identical errors. However If a whole time series is
considered WMAPE will scale each MAPE error according to how much the actual sale
contribute to the total sale. In this way all AE will influence the total WMAPE equally but the
result will still be given in percentage which are easy to understand.
For example week 1 and 6 in table X both have an AE error of 10 but the MAPE is
10/20=50% for week 1 and 10/100 = 10% for week six giving an total MAPE of
(50+10)/2=30%. If WMAPE is used instead the resulting error would be (10/20) *
20/(20+100) + (10/100)*100/(20+100) = 16.7%. WMAPE is better than MAPE when sale
volume of products varies significantly from one to another (Jain C., 2001). Hence the
WMAPE overcomes the issue in MAPE and its reliability can be trusted more than MAPE.
15
2.9.3 RMSE
The Root Mean Square Error is one of the most commonly used metric. The SE (Square
Error) is the core element of RMSE. RMSE rules out the issue of negative and positive errors
cancelling out each other by taking the square of each error. Finally taking square root of the
average square of error between predicted and actual values. The RMSE equation is
represented as:
(2.3)
Where in Equation 2.3 is the actual value, is the forecasted value and n is number of
observations.
Since the errors are squared before they are averaged, the RMSE give relatively high weight
age to larger errors, but this affect is somewhat controlled by taking the square root at the
end. The larger the value of RMSE signifies a poorer forecast.
For example consider the errors for week three and four: the AE is 10 for week 3 and 30 for
week 4. If the absolute error of these forecast is considered, week three contributes with
10/(10+30) = 25% of the error and week four with 75%. However if SE (as it is the core
element of RMSE) is used week three contributes with 10^2/(10^2+30^2) = 10% of the error
and week four with 90%. This example demonstrates how the RMSE put higher weight on
larger error than the other metrics.
The reliability of RMSE is also questionable in some cases and according to Armstrong J.,
(2001) RMSE is one of the worst error metrics to rely upon but still it is widely used by
practitioners.
2.10 Optimization
Optimization is the process of finding the minima or maxima of a function to solve a
particular problem (Koza J.R., 1992). It usually refers to the process of finding the shortest
and best way of solving problem. It also refers to the selection of best possible solution from
a set of available alternatives. The selection depends on different criteria which may include
accuracy, time, efficiency and reliability etc. When creating a predictive model an error
metric is optimized to minimize the error between the predicted and actual value. According
to Randall M., (2005); Harton R.M. and Leonard M.H., (2005); Zhou Z., (2010) traditional
forecasting models mostly try to minimize Root Mean Square Error (RMSE) and Sum of
Square Error (SSE).
The method of minimising the sum of square error (SSE) is highly appreciated and used in
the field of forecasting. Although it has some drawbacks but still most of the forecasting
models use RMSE and SSE as their error measure metric. Armstrong (2001)
According to Wang G.C.S. and Jain C.L., (2003) following are the most commonly used
forecasting models:
Logistic Regression
Linear Regression
Autoregression
Autoregression Moving Average (ARMA)
Regression
In statistics regression refers to the problem of determining the strength of the relationship
between one dependent variable and series of independent variables. This relation is defined
in the form of line that best approximates the individual data points. Regression can also be
used to determine which specific factors (for example shape of product, price, promotional
17
discount etc) influence the price movement of an asset. According to Zopounidis C. and
Pardalos P.M., (1998); Sykes A.O., (1992) the regression analysis technique for building
forecasting models mostly optimizes Sum of Square Error (SSE) due to computational
convenience and its popular statistical properties.
Analysts use logistic regression to predict whether the particular event will occur or not.
Logistic regression gives its result in binary form (Hosmer D.W. and Lemeshow S., 2000).
The binary nature of logistic regression makes it very popular and different from other
forecasting models. The logistic regression can take any value of a dependent variable but the
result will always remain 0 or 1 (Kleinbaum D.G. and Klein M., 2010). If the available data is
not represented in binary form, then normalizing must first be performed.
The logistic model only optimize SSE is presented in the following equation.
(2.5)
The linear regression is the simplest model that attempts to explain the relationship between
two variables in a straight line using linear equation (Wiener J.D., Tan K. and Leong G.K.,
2008). Linear regression also tries to minimize SSE and is represented in equation 2.6
(2.6)
In Equation 2.6 Y is the depended variable and X is the independent variable, whereas b is the
regression coefficient and a is the intercept (the value of y when x=0)
18
2.11.3 Autoregression
The autoregression is used for time series modelling and assign weight to past values of the
dependent variable. Every output is given a new weight and this value is reused as an input to
predict the new unknown value. This means it will give more importance to the new value
than the previous old output (Armstrong J., 2001).
(2.7)
In Equation 2.8 is constant and is the noise (error), is the parameter of the model
Comparison of forecasting models is required when different models are available and a
single model is needed for a particular situation. Since data most often contains noise or is
not 100% representative, a test set is always used to give a better approximation of the error
on new unseen data. Poor results on a test set can be the result of a too small training dataset
or that the model is overfitted.
19
Overfitted Model
The model is overfitted when it is excessively complex, such as having too many parameters
relative to the number of observations in data. If this is true then the model may fit the noise
(error) along with the relationship and trend in data. The model which has been overfitted can
lead to the predictions which are far beyond the range of training data.
Ranking
For comparison purposes it is important and useful to rank the results produced by
model/metrics. Since the results according by different metrics can vary in scale and since the
difficulty of the data sets also can vary, it is crucial to make the comparison independent of
the scale of the errors.
Discussion in [Section 2.11] reveals that each modelling technique creates a model by
optimizing it according to a single predefined metric. In general they cannot optimize
arbitrary metrics. Due to this very limitation the selection of various modelling techniques
needs to be considered to optimize other metrics as and when needed for forecasting. The
most common metrics which are being used to optimize in forecasting models are SSE and
RMSE. However as discussed above this is not always the best choice.
Genetic programming (GP) is inspired by Darwin theory of evolution and natural selection.
Darwin (1959) presents the following requirement evolution to appear in nature:
To be involved in the next generation every organism has to compete with other organisms of
its type and fitter individuals will have a greater chance of winning.
20
According to Poli R., Langdon W.B. and McPhee N.F., (2008) genetic programming works
on the evolutionary nature and survival of the fittest which are discussed above. GP evolve a
generation of computer programs, i.e. generation by generation GP stochastically transform
the population of programs into new population of programs. Like nature GP is successful of
evolving novel and unexpected results for solving a problem. GP finds out how well a
program works by running it and then comparing its fitness with the other program in the
population. The fitness is quantified by a numerical value. The program with high fitness
level will survive and gets to reproduces. The offspring contains portions of its parent
programs and make up the foundation of the next generation.
Fitness Function
The performance of programs is judged by the numerical value of a fitness function. Koza
J.R. (1990) says that, the fitness function identifies the fineness of a computer program
(solution) and it is presented as a numerical value. In this study the error metrics are used as
fitness functions to determine the error between real and predicted values. Each program in
the generated population is measured against how well it will perform in a particular problem
environment. The value of fitness function is used to rank different solutions and depending
on selection criteria the best program is selected.
Initial population
The population of the programs is generated by using the sets of functions and terminals.
These sets are made according to the problem defined. For example in the problem where
desired criteria is to minimize the error between the predicted and real value the set of
functions may include arithmetic operations (+, -, %, /, ... etc) mathematical notations (cosine,
sin, tan, ...etc.), or even (if, if-then, ...etc) and the set of terminals contains the variables or
constants. (Koza J.R., 1990)
In GP the programs are best represented in the form of tree which is the most common form
and called a program tree. The Initial population of the programs starts with randomly
selected initial node of the tree in the set of functions, this node referred as root of the tree.
Whenever the function is selected as a node (which takes N arguments) N lines are created
from the node. The elements for the end point of the lines are randomly selected from
terminal and function sets. If the end point of the line is again a function then the process
continues until the end element is a terminal.
The set of functions (if, if-then...etc) can involve all arithmetic and mathematical equations
for building up a program tree. The rules (if, if-then...etc) perform operations depending on
the state of the conditions i.e. true or false. The process starts with selecting a root node from
the function set (if, if-then...etc). GP will create the respective number of lines and select the
elements for endpoints. For example: Suppose the root node is an „If‟ function, three lines
will be created, here the root node will return values of second line (node) if the first line
(node) returns true, else it will return the value of third node. The example is graphically
presented on next page.
21
IF
/
>
X1
50
X1 X2
*
If{(X1*X2) > 50, then (X1), else(X1/X2)}
X1 X2
Ramped-half-and-half
The Grow method takes nodes from both set of functions and terminals until enough
terminals are selected to end the tree or the maximum length of the tree is attained. It is well
known for the production of irregular shapes and size of trees. Both the Full and Grow
methods are influenced by the size of the number of terminals in terminal set.
Ramped-half-and-half method produces trees by performing the process 50% through full
method and 50% through grow method. This method creates tree with an ascending size and
introduce even more diversity in initial population.
Selection Methods
The two commonly used program selection method for generation of population are
discussed below.
Tournament Selection
Langdon W.B. (1998) says number of individual programs is selected randomly from the
population in tournament selection method. These programs are compared with each other on
the basis of their fitness. The best among them is selected as a parent program to be involved
in next generation of population. It is notable that the tournament selection only considers
which program is better than another; it does not focus on how much better. An element of
chance is inherent in tournament selection due to random nature of selecting the programs for
comparison. So while preferring the best, there is fair a chance of average fitness program to
be selected as a parent for next generation. Tournament selection is easy to implement and it
is commonly used in GP.
22
Roulette-Wheel Selection
(2.9)
In Equation 2.9 is the fitness of the program and is the probability of the program. The
higher probability of a program increases the chances of this program to be considered for the
next generation, but lower probability does not completely rule out the chances of adding the
program in next generation (Kattan A., 2010).
The basic work flow of the genetic programming is represented in the figure below.
Initial random
population of
programs
Run Programs
and evaluate the
Final Slution
quality on fitness
meausre
Re-produce new
programs
In figure 3 we can see the initial generated population of programs (solutions) are evaluated
against the value of fitness function and the generation of programs continues until the best
solution reached.
Koza J.R. (1992) describes the following main genetic operations for population of programs:
Crossover
Mutation
Reproduction
23
2.15.1 Crossover
The crossover creates variation in population of programs by creating new program which
consist parts taken by each parent. Crossover takes two parent programs among the randomly
generated programs to make two new child programs. The crossover operation starts by
randomly selecting one point in each parent program. These points can be selected at any
level of the tree which may vary the length of the program but these points must be identical
i.e. the node at each cut off point must be a terminal or function with equal number of inputs.
Then new child program is created which is the combination of two parent programs, see
figure 4 on next page.
(X+Y)+3 (Y+1)*(X/2)
+
Parent 1 Parent 2 *
3
+ + /
X Y Y 1 X 2
Child 1 Child 2
(Y+1)*(X+Y) (X/2)+3
* +
+ 3
+ /
Y 1 X Y X 2
Figure 4: Crossover
24
2.15.2 Mutation
This operation is done by altering only one selected program (single parent). Mutation is
done by, randomly selecting a mutation point in the parent program, to generate a new
program. Mutation removes whatever currently at the selected point and inserts a randomly
generated sub program at the randomly selected point in the given program. This process is
controlled by a constraint of maximum length for newly created or inserted sub-program. The
mutation process is shown in the following figure.
(X+Y)+3 Y*(X/2)
Randomly
+ Generated sub-
Parent program *
3
+ Y /
X Y Mutation point X 2
Child
(X+Y)+(Y*(X/2)
/ *
X Y Y
Figure 5: Mutation /
X 2
2.15.3 Reproduction
The reproduction operation also operates on one individual program which is selected using
the value of probability of the program. The same program is copied and included in the new
population of programs. Since the program does not change so there is no need for the new
fitness calculation. Hence reproduction has a significant effect on total time required to run
generation of programs. Because the new individual will have the same fitness value as of its
parent thus the reproduced program does not need fitness test (Walker M., 2001).
2.16 Bloating
To control the length of program due to bloating effect many methods have been suggested.
These methods are used, because, smaller programs tends to show better generalization
performance and takes less time and space to run (Iba H., 1999). Among all these methods,
parsimony pressure is the most widely used. Parsimony pressure, based on the size of the
program, is subtracted from the fitness of the program, hence decreasing the chances of the
program to be selected in next generation. Poli R., Langdon W.B., and McPhee N.F., (2008)
(2.10)
Where the original fitness of a program , is the size (length) of the program and
is a constant known as the parsimony coefficient.
The metrics which are discussed earlier are presented below in the form of fitness functions.
Fitness-MAPE ( )= , + size(x)*0.035
Fitness-WMAPE ( )= , + size(x)*0.03
Fitness-RMSE ( ) = , + size(x)*0.025
26
The evolution of population of programs using the crossover, mutation and reproduction is
graphically presented by Poli R., Langdon W.B. and McPhee N.F., (2008) in the figure 6
below. The following flow chart describes the working of genetic programming to produce
and select the best possible solution for the problem while going through three main
operations.
Start
Selection
Final Resultant
criteria Yes
Program
achieved
NO
END
Evaluate each program
against Fitness measure
No
Mutation/ Crossover
Reproduction Select
Genetic
Operation
Select two
Select one
programs
program based
based on
on fitness
fitness
Perform
alteration(inser
tion) Perform
crossover
Create new
Copy into new programs from
program two partial
programs
Chapter 3___________________________________________________________________
EXPERIMENT
3.1 Introduction
This chapter describes the data and steps involved in the experimental work.
3.2 Data
The data used for experiment is collected from ICA AB, Sweden. ICA is one of the leading
retail companies in Europe. ICA has more than 2200 stores in Sweden, Norway, Estonia,
Latvia and Lithuania. ICA offers a very large number of daily use groceries to its customer.
ICA also runs a club membership scheme for its customer which offer special discounts from
time to time (www.ica.se).
The data is point of sale (POS) data for one of ICA‟s store types. The data contains two years
sales record of ICA‟s five different popular products. The products are Frozen Chicken,
sausages, Frozen Vegetables, Frozen Fish Gratin and Sandwich Ham. The sale of products is
first aggregated to weekly sale from all stores. Then the data is normalized according to
products per week.
The data consists of dependent and independent variables. The description of data is as
follows:
Sale i.e. the number of items sold for a certain product is considered as the dependent
variable. Independent variables are commercials, child support, salary, price index. The
variable commercial refers to the advertisements made for the product. Child support is
binary variable which shows if a governmental child support have been paid for the current
week or not. Similarly, salary signals if the salaries are paid during the current week or not.
Price Index is a variable which shows how the current price differs from the average price for
the current year.
For every variable mentioned above, another lagged variable was also available in data which
captures the value of previous week for the particular variable. These variables were added to
show the effect of their previous value on current sale of the product. The retail data often
vary from many other problem domains since commercials, special days etc have strong
affect on the sales of a particular product. Hence results from other forecasting domains
problems may not be applicable to sales forecasting. The products with their total number of
variables and records are given in following table.
28
Number of Independent
Product Number of records Variables
Sausages 101 10
Frozen Vegetables 96 12
The figure below shows the graph for sales of chicken. It is to identify that whenever the
commercial is launched the sale of frozen chicken increases. Data does not contain
information about the time or number of times a commercial is been played. C-A represents
the commercials.
Sales
C_A
1 26 51 76 Weeks101
3.3 Input
Weekly sales data of five products: Frozen Chicken, Sausages, Frozen Vegetables, Frozen
Fish Gratin and Sandwich Ham.
The data was available in Microsoft excel format which was then transformed in an
executable format for G-REX.
Fitness functions were written in G-REX according to each error metric and named as
Fitness-MAPE, Fitness-WMAPE and Fitness-RMSE.
75% of data is treated as train data and remaining 25% for test data.
Ramped-half-and-half method was selected to produce the population of programs.
The parsimony pressure was selected and altered for every fitness function to keep the
length of the programs between 35 and 45 (number of nodes). Small number of nodes
helps program to execute in less time and space.
G-REX is used to evolve a predictive model which represents the relationship
between variables involved in data and most importantly minimizes the current error
metric implemented in the fitness function.
Models are trained as long as possible to ensure overfitting.
Genetic programming is a non deterministic technique that may produce different
solutions every time when it is executed. Hence the average of three runs is reported
for each value against each fitness function to make the results more robust.
The results are recorded separately for train and test data sets.
These values of error metrics are recorded in matrix for ranking and analysis
purposes.
3.6 Rank
For the purpose of comparison, the values of fitness functions are ranked. The ranking of
metrics is done due to the reason that MAPE and WMAPE measure result in percentage
where are RMSE does not use percentage scale. The fitness function which has the least error
among all is ranked high. Numerical value „1‟ represent the fitness function with lowest error
and„2‟ represents the fitness function with second lowest error and so on.
30
The result of Frozen Chicken train dataset is presented below to understand how the ranking
is done.
Each model, regardless of which fitness function that was used to create it, is evaluated
against every error metric. Each column presents the value for a single error metric and the
rows present the results for model created using a specific fitness function. The fitness
function with lowest error metric value is ranked 1 in every column and second lowest is
ranked 2 and so on. For example the value 0.051 is the lowest WMAPE value for Fitness-
WMAPE so it is ranked 1. The above steps are repeated to rank the results of other products
with train and test data. Then average rank of each fitness function is also calculated against
each error metrics. Finally the mean rank value is calculated against each fitness function.
31
Chapter 4___________________________________________________________________
RESULTS
The results are presented in matrix for each product. The rows in the matrix represent the
fitness functions and the columns contain the values of error metrics.
4.2 Result-1
All the calculated values and ranks are stored in Microsoft Excel format and presented in the
form of matrix. This matrix representation makes it easy to understand the effect of the
optimization of a model according to a certain metric. The results for all five products are
presented in the following tables.
Frozen Chicken-Train
MAPE Rank WMAPE Rank RMSE Rank
Fitness-MAPE 0.0007 1 0.1031 3 51.7201 3
Fitness-WMAPE 0.0009 2 0.051 1 17.2749 2
Fitness-RMSE 0.0009 2 0.0571 2 14.2096 1
Table 3-a indicates that Fitness-MAPE has performed better than Fitness-WMAPE and
Fitness-RMSE for reducing the value of MAPE. However it has performed poorly to keep
WMAPE and RMSE low.
Frozen Chicken-Test
MAPE Rank WMAPE Rank RMSE Rank
Fitness-MAPE 0.0048 2 0.4596 2 207.0057 2
Fitness-WMAPE 0.0047 1 0.4338 1 183.8704 1
Fitness-RMSE 0.0049 3 0.4958 2 220.5851 3
The results for test data of Frozen Chicken are quite different from those of train dataset. In
Table 3-b Fitness-WMAPE has the lowest value for all metrics involved in the experiment.
32
Sausage-Train
MAPE Rank WMAPE Rank RMSE Rank
Fitness-MAPE 0.0004 1 0.0966 3 6.0854 3
Fitness-WMAPE 0.0006 2 0.0505 1 3.3805 2
Fitness-RMSE 0.00069 3 0.0676 2 3.3522 1
For training dataset of Sausages the results of Fitness-MAPE are the worst for WMAPE and
RMSE (same as in table 3-a).
4-b: Sausage-Test
Sausage-Test
MAPE Rank WMAPE Rank RMSE Rank
Fitness-MAPE 0.004 3 0.3947 3 27.5843 2
Fitness-WMAPE 0.0034 1 0.3842 2 29.6485 3
Fitness-RMSE 0.0035 2 0.3605 1 27.3470 1
Table 4-b shows the results for test dataset of Sausages. It indicates the good performance of
Fitness-RMSE compared to other metrics but once again Fitness-MAPE has performed
poorly.
Frozen Vegetables-Train
MAPE Rank WMAPE Rank RMSE Rank
Fitness-MAPE 0.0006 1 0.117 3 4.1967 3
Fitness-WMAPE 0.0007 3 0.061 1 2.2455 2
Fitness-RMSE 0.00067 2 0.0642 2 1.8452 1
Table 5-a shows that Fitness-RMSE has better results than the remaining two fitness
functions for all error metrics while also have best result for RMSE.
33
Frozen Vegetables-Test
MAPE Rank WMAPE Rank RMSE Rank
Fitness-MAPE 0.0023 1 0.3165 3 5.1624 3
Fitness-WMAPE 0.0029 2 0.2589 1 4.3661 1
Fitness-RMSE 0.00291 3 0.2839 2 4.4246 2
Fitness-RMSE results was not found suitable for the test dataset as shown in table 5-b,
whereas Fitness-WMAPE was found good to get the minimum values of RMSE and
WMAPE.
The results of Frozen Fish Gratin shows the same results as all of the results for train data so
far. However in table 6-a the results of RMSE for Fitness-WMAPE and Fitness-RMSE is
almost equal.
As can be seen in the table above MAPE and WAMPE are unsuccessful to find the best value
for their respective error metrics. WMAPE and MAPE actually have the best value with
Fitness-RMSE.
34
Sandwich Ham-Train
MAPE Rank WMAPE Rank RMSE Rank
Fitness-MAPE 0.0003 1 0.0727 3 18.4529 3
Fitness-WMAPE 0.0004 2 0.0395 1 9.3254 2
Fitness-RMSE 0.00045 3 0.0445 2 9.2288 1
As expected table 7-(a) shows the best results for all fitness functions for their respective
error metrics. It also shows the worst result of RMSE when using fitness function of MAPE.
Sandwich Ham-Test
MAPE Rank WMAPE Rank RMSE Rank
Fitness-MAPE 0.0018 1 0.1774 1 27.3029 1
Fitness-WMAPE 0.0029 3 0.2782 3 42.403 3
Fitness-RMSE 0.0024 2 0.2393 2 37.0967 2
Like all test datasets, Table 7-b also gives unexpected result. For test data of Sandwich Ham,
Fitness-MAPE has managed to get the best values among all metrics.
35
4.3 Result-2
In following tables the average rank for each fitness function is calculated against each error
metric and mean rank for each fitness function is also calculated. This result is tabulated in
table 8-a for training dataset and in table 8-b for test dataset.
Average Rank-Train
MAPE WMAPE RMSE Mean rank
Fitness-MAPE 1 3 3 2.33
Fitness-WMAPE 2.2 1 2 1.73
Fitness-RMSE 2.6 2 1 1.87
The average rank for every fitness function shows that each fitness function has performed
best for its respective error metric. The column containing the value of mean rank speaks for
the best overall performance of Fitness-WMAPE.
Table 8-b: Average Rank of 5 Products -Test
Average Rank-Test
MAPE WMAPE RMSE Mean Rank
Fitness-MAPE 2 2.4 2.2 2.20
Fitness-WMAPE 1.8 1.8 2 1.87
Fitness-RMSE 2.2 1.8 1.8 1.93
For test dataset the Fitness-MAPE has performed differently than for the trained dataset. The
optimization of WMAPE gives better result of MAPE on test data. Although there is not
much difference in the overall performance of Fitness-WMAPE and Fitness-RMSE but Table
8-b shows that Fitness-WMAPE also has best overall performance in test dataset as well.
36
4.4 Result-3
To graphically see the effect of optimization of error metrics the graphs for frozen chicken
(as an example) are presented below for each fitness function.
Pred
Real
1 26 51 76
Figure 8 shows the graphical representation of real and predicted values for sale of frozen
chicken after MAPE is optimized
Pred
Real
1 26 51 76
Figure 9 shows the graphical representation of real and predicted values for sale of frozen
chicken after WMAPE is optimized.
37
Pred
Real
1 26 51 76
Figure 10 shows the graphical representation of real and predicted values for sale of frozen
chicken after RMSE is optimized
38
Chapter 5___________________________________________________________________
ANALYSIS
5.1 Introduction
Based upon the current study this chapter contains the analysis.
5.2 Analysis
Section 4.2 in chapter 4 clearly shows that GP succeeds in optimizing each metric which can
be seen in the experiments since each fitness function have rank one (1) for the related metric
for training dataset. This result encourages optimizing the same error metric which is
involved in a forecasting model for predictions.
Referring to [Section 4.3] it is obvious from the results that MAPE is the worst metric to
optimize. This is probably due the reason that it focuses on small values so the sudden
increase in data is feebly predicted. This could be seen in figure 8 of chapter 4.
Fitness-RMSE also performs well. This could be expected because the data show spikes in
sales graph influenced by commercials [Section 3.2, Figure 7]. These spikes are predicted
quite accurately by Fitness-RMSE which covers the bad performance for lower values in
data. This is shown in the figure 10 in chapter 4.
Referring to the [Section 2.9] MAPE and RMSE are expected to perform feebly for data with
high and low values respectively as compare to WMAPE. The sales data for the experiment is
highly influenced by the commercials and shows sudden increase and decrease in sales
volume (Section 3.2, Figure 7). The experiments also to some extent support this statement
since WMAPE achieves the best overall result (closely followed by RMSE) for both training
and test data. The result of Fitness-WMAPE for frozen chicken is shown in figure 9 in
chapter 4.
The models are clearly overfitted since the ranking on test data differs greatly, compared to
the ranking of the training data. Considering results of test data, all metrics are quite poorly
protected against overfitting since the ranking differs on training and test data but RMSE and
WMAPE perform comparably, slightly better than MAPE.
Another reason of the difference in train and test results could of course be that the training
set was not very representative for the test set. Especially since the commercials were rare
and affect the sales in an extreme way. Furthermore the information about commercial was
only in binary form.
39
5.3 Discussion
The unexpected change in the pattern of result from train to test data is most probably due to
overfitting of the model. The Overfitting could possibly be avoided by introducing the
validation dataset in the experiment. The validation set is used to validate the training of
model before undergoing the process for test dataset. However the purpose of this study was
to investigate if some of the error metrics was better protected from overfitting and thus a
validation set was not used. It should also be noted that it would not even be sure that a
validation set would improve the results since the data sets were relative small and would
result in very small sets if divided into three parts i.e. training, validating and testing. Another
problem is that the data sets only contained a few but very influential commercials. Dividing
these into three sets would furthermore decrease the chance of finding a strong general
model.
Despite that the models were overfitted it can observed that the optimized error metrics have
significant effect on other metrics. The overall performance of Fitness-WMAPE is best
among the three optimized metrics both for train and test datasets. However retail data often
differ from many other domains as commercials often show significant effect on the sales of
products. Hence the results produced here may not be applicable to other domains.
According to analysis there is very little difference between the results of WMAPE
and RMSE. More research could be carried out to find ways of making WMAPE
more reliable than RMSE.
Investigation of metrics other than those discussed in this thesis to observe the effect
of optimization on them and compare the results with WMAPE.
40
Chapter 6___________________________________________________________________
CONCLUSION
The study shows that it is promising to use Genetic Programming as it allows optimizing
arbitrary metrics. The experiments also clearly show that GP can produce models that
minimize an error metric over a training set.
This study also shows that all of the evaluated metrics are poorly protected against
overfitting. WMAPE and RMSE are equally protected against overfitting and slightly better
than MAPE which is the least protected metric.
When evaluated against all three metrics, models optimized based on WMAPE have the best
overall result. The results hold for training and test data which shows that the result holds
inspite of overfitted models.
Base on these conclusions WMAPE can be used as an alternative optimization criterion to
RMSE. WMAPE is also easier to interpret and leaves the judgment of the importance of an
error to the decision maker.
41
References
Iba H., 1999, Bagging, Boosting, Bloating in Genetic Programming, Available at:
http://www.cs.bham.ac.uk/~wbl/biblio/gecco1999/GP-407.pdf [Accessed on 5
Feb 2011]
Jain C.L., 2001, Forecasting Error in the Consumers Products industry The
Journal of business Forecasting Methods and Systems, Volume 22, Number 2,
Available at http://forecast.umkc.edu/ftppub/BDS545/SUM-03.pdf [Accessed 14
Nov 2010].
Jain C.L. and Malehorn J., 2006, Benchmarking Forecasting Practices: A guide to
Improve Forecasting Performance. Graceway Publishing Company Inc.
Kattan A., 2010: Evolutionary Synthesis of Lossless, Compression Algorithms:
the GP-zip Family, Available at: http://www.ahmedkattan.com/PhD.pdf
[Accessed 18 Dec 2010].
Kleinbaum D.G., Klein M., 2010, Logistic Regression, A Self-Learning Text third
Edition, Springer Science+Buisness Media Inc.
University of Borås is a modern university in the city center. We give courses in business administration
and informatics, library and information science, fashion and textiles, behavioral sciences and teacher
education, engineering and health sciences.
In the School of Business and Informatics (IDA), we have focused on the students' future needs.
Therefore we have created programs in which employability is a key word. Subject integration and
contextualization are other important concepts. The department has a closeness, both between students
and teachers as well as between industry and education.
Our courses in business administration give students the opportunity to learn more about different
businesses and governments and how governance and organization of these activities take place. They
may also learn about society development and organizations' adaptation to the outside world. They have
the opportunity to improve their ability to analyze, develop and control activities, whether they want to
engage in auditing, management or marketing.
Among our IT courses, there's always something for those who want to design the future of IT-based
communications, analyze the needs and demands on organizations' information to design their content
structures, integrating IT and business development, developing their ability to analyze and design
business processes or focus on programming and development of good use of IT in enterprises and
organizations.
The research in the school is well recognized and oriented towards professionalism as well as design and
development. The overall research profile is Business-IT-Services which combine knowledge and skills in
informatics as well as in business administration. The research is profession-oriented, which is reflected in
the research, in many cases conducted on action research-based grounds, with businesses and
government organizations at local, national and international arenas. The research design and professional
orientation is manifested also in InnovationLab, which is the department's and university's unit for
research-supporting system development.