Data Preprocessing in Data Mining PDF
Data Preprocessing in Data Mining PDF
Data Preprocessing in Data Mining PDF
Salvador García
Julián Luengo
Francisco Herrera
Data
Preprocessing
in Data
Mining
Intelligent Systems Reference Library
Volume 72
Series editors
Janusz Kacprzyk, Polish Academy of Sciences, Warsaw, Poland
e-mail: kacprzyk@ibspan.waw.pl
The aim of this series is to publish a Reference Library, including novel advances
and developments in all aspects of Intelligent Systems in an easily accessible and
well structured form. The series includes reference works, handbooks, compendia,
textbooks, well-structured monographs, dictionaries, and encyclopedias. It contains
well integrated knowledge and current information in the field of Intelligent
Systems. The series covers the theory, applications, and design methods of Intel-
ligent Systems. Virtually all disciplines such as engineering, computer science,
avionics, business, e-commerce, environment, healthcare, physics and life science
are included.
Francisco Herrera
Data Preprocessing
in Data Mining
123
Salvador García Francisco Herrera
Department of Computer Science Department of Computer Science
University of Jaén and Artificial Intelligence
Jaén University of Granada
Spain Granada
Spain
Julián Luengo
Department of Civil Engineering
University of Burgos
Burgos
Spain
To our families.
Preface
Data preprocessing is an often neglected but major step in the data mining process.
The data collection is usually a process loosely controlled, resulting in out of range
values, e.g., impossible data combinations (e.g., Gender: Male; Pregnant: Yes),
missing values, etc. Analyzing data that has not been carefully screened for such
problems can produce misleading results. Thus, the representation and quality of
data is first and foremost before running an analysis. If there is much irrelevant and
redundant information present or noisy and unreliable data, then knowledge dis-
covery is more difficult to conduct. Data preparation can take considerable amount
of processing time.
Data preprocessing includes data preparation, compounded by integration,
cleaning, normalization and transformation of data; and data reduction tasks; such
as feature selection, instance selection, discretization, etc. The result expected after
a reliable chaining of data preprocessing tasks is a final dataset, which can be
considered correct and useful for further data mining algorithms.
This book covers the set of techniques under the umbrella of data preprocessing,
being a comprehensive book devoted completely to the field of Data Mining,
including all important details and aspects of all techniques that belonging to this
families. In recent years, this area has become of great importance because the data
mining algorithms require meaningful and manageable data to correctly operate and
to provide useful knowledge, predictions or descriptions. It is well known that most
of the efforts made in a knowledge discovery application is dedicated to data
preparation and reduction tasks. Both theoreticians and practitioners are constantly
searching for data preprocessing techniques to ensure reliable and accurate results
together trading off with efficiency and time-complexity. Thus, an exhaustive and
updated background in the topic could be very effective in areas such as data
mining, machine learning, and pattern recognition. This book invites readers to
explore the many advantages the data preparation and reduction provide:
vii
viii Preface
• To adapt and particularize the data for each data mining algorithm.
• To reduce the amount of data required for a suitable learning task, also
decreasing its time-complexity.
• To increase the effectiveness and accuracy in predictive tasks.
• To make possible the impossible with raw data, allowing data mining algorithms
to be applied over high volumes of data.
• To support to the understanding of the data.
• Useful for various tasks, such as classification, regression and unsupervised
learning.
The target audience for this book is anyone who wants a better understanding of
the current state-of-the-art in a crucial part of the knowledge discovery from data:
the data preprocessing. Practitioners in industry and enterprise should find new
insights and possibilities in the breadth of topics covered. Researchers and data
scientist and/or analysts in universities, research centers, and government could find
a comprehensive review in the topic addressed and new ideas for productive
research efforts.
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Data Mining and Knowledge Discovery. . . . . . . . . . . . . . . . . 1
1.2 Data Mining Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4.1 Pattern Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4.2 Outlier Detection . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5 Other Learning Paradigms . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5.1 Imbalanced Learning . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5.2 Multi-instance Learning . . . . . . . . . . . . . . . . . . . . . . 9
1.5.3 Multi-label Classification . . . . . . . . . . . . . . . . . . . . . 9
1.5.4 Semi-supervised Learning . . . . . . . . . . . . . . . . . . . . 9
1.5.5 Subgroup Discovery . . . . . . . . . . . . . . . . . . . . . . . . 9
1.5.6 Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.5.7 Data Stream Learning . . . . . . . . . . . . . . . . . . . . . . . 10
1.6 Introduction to Data Preprocessing . . . . . . . . . . . . . . . . . . . . 10
1.6.1 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.6.2 Data Reduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
ix
x Contents
9 Discretization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
9.2 Perspectives and Background . . . . . . . . . . . . . . . . . . . . . . . . 247
9.2.1 Discretization Process . . . . . . . . . . . . . . . . . . . . . . . 247
9.2.2 Related and Advanced Work . . . . . . . . . . . . . . . . . . 250
9.3 Properties and Taxonomy. . . . . . . . . . . . . . . . . . . . . . . . . . . 251
9.3.1 Common Properties. . . . . . . . . . . . . . . . . . . . . . . . . 251
9.3.2 Methods and Taxonomy . . . . . . . . . . . . . . . . . . . . . 255
9.3.3 Description of the Most Representative
Discretization Methods . . . . . . . . . . . . . . . . . . . . . . 259
9.4 Experimental Comparative Analysis . . . . . . . . . . . . . . . . . . . 265
9.4.1 Experimental Set up . . . . . . . . . . . . . . . . . . . . . . . . 265
9.4.2 Analysis and Empirical Results. . . . . . . . . . . . . . . . . 268
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
Acronyms
xv
Chapter 1
Introduction
Vast amounts of data are around us in our world, raw data that is mainly intractable
for human or manual applications. So, the analysis of such data is now a necessity.
The World Wide Web (WWW), business related services, society, applications and
networks for science or engineering, among others, are continuously generating data
in exponential growth since the development of powerful storage and connection
tools. This immense data growth does not easily allow to useful information or orga-
nized knowledge to be understood or extracted automatically. This fact has led to the
start of Data Mining (DM), which is currently a well-known discipline increasingly
preset in the current world of the Information Age.
DM is, roughly speaking, about solving problems by analyzing data present in
real databases. Nowadays, it is qualified as science and technology for exploring
data to discover already present unknown patterns. Many people distinguish DM as
synonym of the Knowledge Discovery in Databases (KDD) process, while others
view DM as the main step of KDD [16, 24, 32].
There are various definitions of KDD. For instance, [10] define it as “the nontrivial
process of identifying valid, novel, potentially useful, and ultimately understandable
patterns in data” [11] considers the KDD process as an automatic exploratory data
analysis of large databases. A key aspect that characterizes the KDD process is the
way it is divided into stages according the agreement of several important researchers
in the topic. There are several methods available to make this division, each with
advantages and disadvantages [16]. In this book, we adopt a hybridization widely
used in recent years that categorizes these stages into six steps:
1. Problem Specification: Designating and arranging the application domain, the
relevant prior knowledge obtained by experts and the final objectives pursued by
the end-user.
2. Problem Understanding: Including the comprehension of both the selected data
to approach and the expert knowledge associated in order to achieve high degree
of reliability.
3. Data Preprocessing: This stage includes operations for data cleaning (such as
handling the removal of noise and inconsistent data), data integrationdata integra-
tion (where multiple data sources may be combined into one), data transformation
(where data is transformed and consolidated into forms which are appropriate for
specific DM tasks or aggregation operations) and data reduction, including the
selection and extraction of both features and examples in a database. This phase
will be the focus of study throughout the book.
4. Data Mining: It is the essential process where the methods are used to extract
valid data patterns. This step includes the choice of the most suitable DM task
(such as classification, regression, clustering or association), the choice of the
DM algorithm itself, belonging to one of the previous families. And finally, the
employment and accommodation of the algorithm selected to the problem, by
tuning essential parameters and validation procedures.
5. Evaluation: Estimating and interpreting the mined patterns based on interesting-
ness measures.
6. Result Exploitation: The last stage may involve using the knowledge directly;
incorporating the knowledge into another system for further processes or simply
reporting the discovered knowledge through visualization tools.
Figure 1.1 summarizes the KDD process and reveals the six stages mentioned
previously. It is worth mentioning that all the stages are interconnected, showing that
the KDD process is actually a self-organized scheme where each stage conditions
the remaining stages and reverse path is also allowed.
A large number of techniques for DM are well-known and used in many applications.
This section provides a short review of selected techniques considered the most
important and frequent in DM. This review only highlights some of the main features
of the different techniques and some of the influences related to data preprocessing
procedures presented in the remaining chapters of this book. Our intention is not to
1.2 Data Mining Methods 3
provide a complete explanation on how these techniques operate with detail, but to
stay focused on the data preprocessing step.
Figure 1.2 shows a division of the main DM methods according to two methods
of obtaining knowledge: prediction and description. In the following, we will give
a short description for each method, including references for some representative
and concrete algorithms and major considerations from the point of view of data
preprocessing.
Within the prediction family of methods, two main groups can be distinguished:
statistical methods and symbolic methods [4]. Statistical methods are usually char-
acterized by the representation of knowledge through mathematical models with
computations. In contrast, symbolic methods prefer to represent the knowledge by
means of symbols and connectives, yielding more interpretable models for humans.
The most applied statistical methods are:
• Regression Models: being the oldest DM models, they are used in estimation tasks,
requiring the class of equation modelling to be used [24]. Linear, quadratic and
logistic regression are the most well known regression models in DM. There are
basic requirement that they impose on the data. Among them, the use of numerical
attributes are not designed for dealing with missing svalues, they try to fit outliers
to the models and use all the features independently whether or not they are useful
or dependent on one another.
4 1 Introduction
make the prediction, their influence when using voting or weighting mechanisms
and the use of efficient algorithms to find the nearest examples, as KD-Trees or
hashing schemes. The K-Nearest Neighbor (KNN) is the most applied, useful and
known method in DM. Nevertheless, it suffers from several drawbacks such as
high storage requirements, low efficiency in prediction response, and low noise
tolerance. Thus, it is a good candidate to be improved through data reduction
procedures.
• Support Vector Machines: SVMs are machine learning algorithms based on
learning theory [30]. They are similar to ANNs in the sense that they are used for
estimation and perform very well when data is linearly separable. SVMs usually do
not require the generation of interaction among variables, as regression methods
do. This fact should save some data preprocessing steps. Like ANNs, they require
numeric non-missing data and are commonly robust against noise and outliers.
Regarding symbolic methods, we mention the following:
• Rule Learning: also called separate-and-conquer or covering rule algorithms [12].
All methods share the main operation. They search for a rule that explains some
part of the data, separate these examples and recursively conquer the remaining
examples. There are many ways for doing this, and also many ways to interpret the
rules yielded and to use them in the inference mechanism. From the point of view
of data preprocessing, generally speaking, they require nominal or discretized data
(although this task is frequently implicit in the algorithm) and dispose of an innate
selector of interesting attributes from data. However, MVs, noisy examples and
outliers may prejudice the performance of the final model. Good examples of these
models are the algorithms AQ, CN2, RIPPER, PART and FURIA.
• Decision Trees: comprising predictive models formed by iterations of a divide-
and-conquer scheme of hierarchical decisions [28]. They work by attempting to
split the data using one of the independent variables to separate data into homoge-
neous subgroups. The final form of the tree can be translated to a set of If-Then-Else
rules from the root to each of the leaf nodes. Hence, they are closely related to rule
learning methods and suffer from the same disadvantage as them. The most well
known decision trees are CART, C4.5 and PUBLIC.
Considering the data descriptive task, we prefer to categorize the usual problems
instead of the methods, due to the fact that both are intrinsically related to the case
of predictive learning.
• Clustering: it appears when there is no class information to be predicted but the
examples must be divided into natural groups or clusters [2]. These clusters re-
flect subgroups of examples that share some properties or have some similarities.
They work by calculating a multivariate distance measure between observations,
the observations that are more closely related. Roughly speaking, they belong to
three broad categories: Agglomerative clustering, divisive clustering and partition-
ing clustering. The former two are hierarchical types of clustering opposite one
another. The divisive one applies recursive divisions the entire data set whereas
6 1 Introduction
universal instance space is defined as a cartesian product of all input attribute domain
and the target attribute domain.
The two basic and classical problems that belong to the supervised learning cat-
egory are classification and regression. In classification, the domain of the target
attribute is finite and categorical. That is, there are a finite number of classes or cate-
gories to predict a sample and they are known by the learning algorithm. A classifier
must assign a class to a unseen example when it is trained by a set of training data.
The nature of classification is to discriminate examples from others, attaining as a
main application a reliable prediction: once we have a model that fits the past data,
if the future is similar to the past, then we can make correct predictions for new
instances. However, when the target attribute is formed by infinite values, such as in
the case of predicting a real number between a certain interval, we are referring to
regression problems. Hence, the supervised learning approach here has to fit a model
to learn the output target attribute as a function of input attributes. Obviously, the
regression problem present more difficulties than the classification problem and the
required computation resources and the complexity of the model are higher.
There is another type of supervised learning that involves time data. Time series
analysis is concerned with making predictions in time. Typical applications include
analysis of stock prices, market trends and sales forecasting. Due to the time depen-
dence of the data, the data preprocessing for time series data is different from the
main theme of this book. Nevertheless, some basic procedures may be of interest
and will be also applicable in this field.
We have seen that in supervised learning, the aim is to obtain a mapping from the
input to an output whose correct and definite values are provided by a supervisor. In
unsupervised learning, there is no such supervisor and only input data is available.
Thus, the aim is now to find regularities, irregularities, relationships, similarities and
associations in the input. With unsupervised learning, it is possible to learn larger
and more complex models than with supervised learning. This is because in super-
vised learning one is trying to find the connection between two sets of observations.
The difficulty of the learning task increases exponentially with the number of steps
between the two sets and that is why supervised learning cannot, in practice, learn
models with deep hierarchies.
Apart from the two well-known problems that belong to the unsupervised learning
family, clustering and association rules, there are other related problems that can fit
into this category:
8 1 Introduction
It is adopted as a more general term than frequent pattern mining or association mining
since pattern mining also covers rare and negative patterns as well. For example, in
pattern mining, the search of rules is also focused on multilevel, multidimensional,
approximate, uncertain, compressed, rare/negative and high-dimensional patterns.
The mining methods do not only involve candidate generation and growth, but also
interestingness, correlation and exception rules, distributed and incremental mining,
etc.
Also known as anomaly detection, it is the process of finding data examples with
behaviours that are very different from the expectation. Such examples are called
outliers or anomalies. It has a high relation with clustering analysis, because the latter
finds the majority patterns in a data set and organizes the data accordingly, whereas
outlier detection attempts to catch those exceptional cases that present significant
deviations from the majority patterns.
Some DM problems are being clearly differentiated from the classical ones and some
of them even cannot be placed into one of the two mentioned learning categories,
neither supervised or unsupervised learning. As a result, this section will supply a
brief description of other major learning paradigms which are widespread and recent
challenges in the DM research community.
We establish a general division based on the nature of the learning paradigm.
When the paradigm presents extensions on data acquirement or distribution, imposed
restrictions on models or the implication of more complex procedures to obtain
suitable knowledge, we refer to extended paradigm. On the other hand, when the
paradigm can only be understood as an mixture of supervised and unsupervised
learning, we refer to hybrid paradigm. Note that we only mention some learning
paradigms out of the universe of possibilities and its interpretations, assuming that
this section is just intended to introduce the issue.
number of examples representing the class of interest is much lower than that of
the other classes. Its presence in many real-world applications has brought along a
growth of attention from researchers.
This paradigm arises as an hybrid between the classification predictive task and the
clustering descriptive analysis. It is a learning paradigm concerned with the design
of models in the presence of both labeled and unlabeled data. Essentially, the de-
velopments in this field use unlabeled samples to either modify or re-prioritize the
hypothesis obtained from the labeled samples alone. Both semi-supervised classifi-
cation and semi-supervised clustering have emerged extending the traditional para-
digms by including unlabeled or labeled examples, respectively. Another paradigm
called Active Learning, with the same objective as Semi-supervised Learning, tries
to select the most important examples from a pool of unlabeled data, however these
examples are queried by an human expert.
Also known as Contrast Set Mining and Emergent Pattern Mining, it is formed as the
result of another hybridization between supervised and unsupervised learning tasks,
10 1 Introduction
Aims to extract the knowledge from one or more source tasks and apply the knowl-
edge to a target task. In this paradigm, the algorithms apply knowledge about source
tasks when building a model for a new target task. Traditional learning algorithms
assume that the training data and test data are drawn from the same distribution and
feature space, but if the distribution changes, such methods need to rebuild or adapt
the model in order to perform well. The so-called data shift problem is closely related
to transfer learning.
Once some basic concepts and processes of DM have been reviewed, the next step is
to question the data to be used. Input data must be provided in the amount, structure
and format that suit each DM task perfectly. Unfortunately, real-world databases are
highly influenced by negative factors such the presence of noise, MVs, inconsistent
and superfluous data and huge sizes in both dimensions, examples and features. Thus,
low-quality data will lead to low-quality DM performance [27].
In this section, we will describe the general categorization in which we can divide
the set of data preprocessing techniques. More details will be given in the rest of
chapters of this book, but for now, our intention is to provide a brief summary of
the preprocessing techniques that we should be familiar with after reading this book.
For this purpose, several subsections will be presented according to the type and set
of techniques that belong to each category.
1.6 Introduction to Data Preprocessing 11
Throughout this book, we refer to data preparation as the set of techniques that ini-
tialize the data properly to serve as input for a certain DM algorithm. It is worth
mentioning that we prefer the data preparation notation to design parts of data pre-
processing, which is a confusing nomenclature used in previous texts as the whole
set of processes that perform data preprocessing tasks. This is not incorrect and we
respect this nomenclature, however we prefer to clearly distinguish between data
preparation and data reduction due to raised importance that the latter set of tech-
niques have been achieving in recent years and some of the clear differentiations that
can be extracted from this understanding.
Data preparation is normally a mandatory step. It converts prior useless data into
new data that fits a DM process. First of all, if data is not prepared, the DM algorithm
might not receive ir in order to operate or surely it will report errors during its runtime.
In the best of cases, the algorithm will work, but the results offered will not make
sense or will not be considered as accurate knowledge.
Thus, what are the basic issues that must be resolved in data preparation? Here,
we provide a list of questions accompanied with the correct answers involving each
type of process that belongs to the data preparation family of techniques:
• How do I clean up the data?—Data Cleaning.
• How do I provide accurate data?—Data Transformation.
• How do I incorporate and adjust data?—Data Integration.
• How do I unify and scale data?—Data Normalization.
• How do I handle missing data?—Missing Data Imputation.
• How do I detect and manage noise?—Noise Identification.
Next, we will briefly describe each one of these techniques listed above. Figure 1.3
shows an explanatory illustration of the forms of data preparation. We recall that they
will be studied more in-depth in the following chapters of this book.
Or data cleansing, includes operations that correct bad data, filter some incorrect data
out of the data set and reduce the unnecessary detail of data. It is a general concept
that comprises or overlaps other well-known data preparation techniques. Treatment
of missing and noise data is included here, but both categories have been separated in
order to devote a deeper analysis of the intelligent proposals to them further into this
book. Other cleaning data tasks involve the detection of discrepancies and dirty data
(fragments of the original data which do not make sense). The latter tasks are more
related to the understanding of the original data and they generally require human
audit.
12 1 Introduction
In this preprocessing step, the data is converted or consolidated so that the mining
process result could be applied or may be more efficient. Subtasks inside data trans-
formation are the smoothing, the feature construction, aggregation or summarization
of data, normalization, discretization and generalization. Most of them will be seg-
regated as independent tasks, due to the fact that data transformation, such as the
case of data cleaning, is referred to as a general data preprocessing family of tech-
niques. Those tasks that require human supervision and are more dependent on the
data are the classical data transformation techniques, such as the report generation,
new attributes that aggregate existing ones and generalization of concepts especially
in categorical attributes, such as the replacing complete dates in the database with
year numbers only.
It comprises the merging of data from multiple data stores. This process must be
carefully performed in order to avoid redundancies and inconsistencies in the re-
sulting data set. Typical operations accomplished within the data integration are the
identification and unification of variables and domains, the analysis of attribute cor-
relation, the duplication of tuples and the detection of conflicts in data values of
different sources.
1.6 Introduction to Data Preprocessing 13
The measurement unit used can affect the data analysis. All the attributes should be
expressed in the same measurement units and should use a common scale or range.
Normalizing the data attempts to give all attributes equal weight and it is particularly
useful in statistical learning methods.
It is a form of data cleaning, where the purpose is to fill the variables that contain
MVs with some intuitive data. In most of the cases, adding a reasonable estimate of
a suitable data value is better than leaving it blank.
Included as a step of data cleaning and also known as the smoothing in data trans-
formation, its main objective is to detect random errors or variances in a measured
variable. Note that we refer to the detection of noise instead of the removal of noise,
which is more related to the IS task within data reduction. Once a noisy example is
detected, we can apply a correction-based process that could involve some kind of
underlying operation.
Data reduction comprises the set of techniques that, in one way or another, obtain a
reduced representation of the original data. For us, the distinction of data preparation
techniques is those that are needed to appropriately suit the data as input of a DM
task. As we have mentioned before, this means that if data preparation is not properly
conducted, the DM algorithms will not be run or will surely report wrong results
after running. In the case of data reduction, the data produced usually maintains
the essential structure and integrity of the original data, but the amount of data is
downsized. So, can the original data be used, without applying a data reduction
process, as input of a DM process? The answer is yes, but other major issues must be
taken into account, being just as crucial as the issues addressed by data preparation.
Hence, at a glance, it can be considered as an optional step. However, this af-
firmation may be conflictive. Although the integrity of the data is maintained, it is
well known that any algorithm has a certain time complexity that depends on several
parameters. In DM, one of these parameters is somehow directly proportional to the
size of the input database. If the size exceeds the limit, the limit being very depen-
dant on the type of DM algorithms, the running of the algorithm can be prohibitive,
and then the data reduction task is as crucial as data preparation is. Regarding other
14 1 Introduction
Feature Selection
Instance Selection
Discretization
factors, such as the decreasing of the complexity and improvement of the quality of
the models yielded, the role of data reduction is again decisive.
As mentioned before, what are the basic issues that must be resolved in data
reduction? Again, we provide a series of questions associated with the correct answer
related to each type of task that belongs to the data reduction techniques:
• How do I reduce the dimensionality of data?—Feature Selection (FS).
• How do I remove redundant and/or conflictive examples?—Instance Selection
(IS).
• How do I simplify the domain of an attribute?—Discretization.
• How do I fill in gaps in data?—Feature Extraction and/or Instance Generation.
In the following, we provide a concise explanation of the four techniques enu-
merated before. Figure 1.4 shows an illustrative picture that reflects the forms of data
reduction. All of them will be extended, studied and analyzed throughout the various
chapters of the book.
1.6 Introduction to Data Preprocessing 15
Achieves the reduction of the data set by removing irrelevant or redundant features
(or dimensions). The goal of FS is to find a minimum set of attributes, such as the
resulting probability distribution of the data output attributes, (or classes) is as close
as possible to the original distribution obtained using all attributes. It facilitates the
understanding of the pattern extracted and increases the speed of the learning stage.
Consists of choosing a subset of the total available data to achieve the original purpose
of the DM application as if the whole data had been used. It constitutes the family
of oriented methods that perform in a somewhat intelligent way the choice of the
best possible subset of examples from the original data by using some rules and/or
heuristics. The random selection of examples is usually known as Sampling and it is
present in a very large number of DM models for conducting internal validation and
for avoiding over-fitting.
This procedure transforms quantitative data into qualitative data, that is, numerical
attributes into discrete or nominal attributes with a finite number of intervals, obtain-
ing a non-overlapping partition of a continuous domain. An association between each
interval with a numerical discrete value is then established. Once the discretization
is performed, the data can be treated as nominal data during any DM process.
It is noteworthy that discretization is actually a hybrid data preprocessing tech-
nique involving both data preparation and data reduction tasks. Some sources include
discretization in the data transformation category and another sources consider a
data reduction process. In practice, discretization can be viewed as a data reduction
method since it maps data from a huge spectrum of numeric values to a greatly re-
duced subset of discrete values. Our decision is to mostly include it in data reduction
although we also agree with the other trend. The motivation behind this is that recent
discretization schemes try to reduce the number of discrete intervals as much as pos-
sible while maintaining the performance of the further DM process. In other words,
it is often very easy to perform basic discretization with any type of data, given that
the data is suitable for a certain algorithm with a simple map between continuous and
categorical values. However, the real difficulty is to achieve good reduction without
compromising the quality of data, and much of the effort expended by researchers
follows this tendency.
16 1 Introduction
Extends both the feature and IS by allowing the modification of the internal values
that represent each example or attribute. In feature extraction, apart from the removal
operation of attributes, subsets of attributes can be merged or can contribute to the
creation of artificial substitute attributes. Regarding instance generation, the process
is similar in the sense of examples. It allows the creation or adjustment of artificial
substitute examples that could better represent the decision boundaries in supervised
learning.
References
1. Adamo, J.M.: Data Mining for Association Rules and Sequential Patterns: Sequential and
Parallel Algorithms. Springer, New York (2001)
2. Aggarwal, C., Reddy, C.: Data Clustering: Recent Advances and Applications. Data Mining
and Knowledge Discovery Series. Chapman and Hall/CRC, Taylor & Francis Group, Boca
Raton (2013)
3. Aha, D.W., Kibler, D., Albert, M.K.: Instance-based learning algorithms. Mach. Learn. 6(1),
37–66 (1991)
4. Alpaydin, E.: Introduction to Machine Learning, 2nd edn. MIT Press, Cambridge (2010)
5. Amores, J.: Multiple instance classification: review, taxonomy and comparative study. Artif.
Intell. 201, 81–105 (2013)
6. Barber, D.: Bayesian Reasoning and Machine Learning. Cambridge University Press, New
York (2012)
7. Bishop, C.M.: Neural Networks for Pattern Recognition. Oxford University Press, New York
(1995)
8. Boutell, M.R., Luo, J., Shen, X., Brown, C.M.: Learning multi-label scene classification. Pattern
Recogn. 37(9), 1757–1771 (2004)
9. Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: a survey. ACM Comput. Surv.
41(3), 15:1–15:58 (2009)
10. Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P. (eds.): From data mining to knowledge discov-
ery: an overview as chapter. Advances in Knowledge Discovery and Data Mining. American
Association for Artificial Intelligence, San Francisco (1996)
11. Friedman, J.H.: Data mining and statistics: What is the connection The Data Administrative
Newsletter? (1997)
12. Frunkranz, J., Gamberger, D., Lavrac, N.: Foundations of Rule Learning. Springer, New York
(2012)
13. Gama, J.: Knowledge Discovery from Data Streams, 1st edn. Chapman & Hall/CRC, Taylor
and Francis, Boca Raton (2010)
14. García, S., Derrac, J., Cano, J.R., Herrera, F.: Prototype selection for nearest neighbor classifi-
cation: Taxonomy and empirical study. IEEE Trans. Pattern Anal. Mach. Intell. 34(3), 417–435
(2012)
15. García, S., Luengo, J., Sáez, J.A., López, V., Herrera, F.: A survey of discretization techniques:
taxonomy and empirical analysis in supervised learning. IEEE Trans. Knowl. Data Eng. 25(4),
734–750 (2013)
16. Han, J.: Data Mining: Concepts and Techniques. Morgan Kaufmann, San Francisco (2011)
17. Herrera, F., Carmona, C.J., González, P., del Jesus, M.J.: An overview on subgroup discovery:
foundations and applications. Knowl. Inf. Syst. 29(3), 495–525 (2011)
References 17
18. Liu, H., Motoda, H.: Feature Extraction, Construction and Selection: A Data Mining Perspec-
tive. Kluwer International Series in Engineering and Computer Science. Kluwer Academic,
Boston (1998)
19. Liu, H., Motoda, H.: Feature Selection for Knowledge Discovery and Data Mining. Kluwer
International Series in Engineering and Computer Science. Kluwer Academic, Boston (1998)
20. Liu, H., Motoda, H.: Instance Selection and Construction for Data Mining. Kluwer Academic,
Norwell (2001)
21. Liu, H., Motoda, H.: Computational Methods of Feature Selection. Data Mining and Knowledge
Discovery Series. Chapman & Hall/CRC, Taylor and Francis, Boca Raton (2007)
22. López, V., Fernández, A., García, S., Palade, V., Herrera, F.: An insight into classification with
imbalanced data: Empirical results and current trends on using data intrinsic characteristics.
Inf. Sci. 250, 113–141 (2013)
23. Luengo, J., García, S., Herrera, F.: On the choice of the best imputation methods for missing
values considering three groups of classification methods. Knowl. Inf. Syst. 32(1), 77–108
(2012)
24. Nisbet, R., Elder, J., Miner, G.: Handbook of Statistical Analysis and Data Mining Applications.
Academic Press, Boston (2009)
25. Ong, K.: Frequent Pattern Mining. VDM Publishing, Germany (2010)
26. Pan, S.J., Yang, Q.: A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22(10),
1345–1359 (2010)
27. Pyle, D.: Data Preparation for Data Mining. Morgan Kaufmann, San Francisco (1999)
28. Rokach, L.: Data Mining with Decision Trees: Theory and Applications. Series in Machine
Perception and Artificial Intelligence. World Scientific, Singapore (2007)
29. Sáez, J.A., Luengo, J., Herrera, F.: Predicting noise filtering efficacy with data complexity
measures for nearest neighbor classification. Pattern Recogn. 46(1), 355–364 (2013)
30. Schölkopf, B., Smola, A.J.: Learning with Kernels: Support Vector Machines, Regulariza-
tion, Optimization, and Beyond. Adaptive Computation and Machine Learning. MIT Press,
Cambridge (2002)
31. Triguero, I., Derrac, J., García, S., Herrera, F.: A taxonomy and experimental study on prototype
generation for nearest neighbor classification. IEEE Trans. Syst. Man Cybern. Part C 42(1),
86–100 (2012)
32. Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques. Mor-
gan Kaufmann Series in Data Management Systems, 2nd edn. Morgan Kaufmann, San Fran-
cisco (2005)
33. Zhu, X., Goldberg, A.B., Brachman, R.: Introduction to Semi-Supervised Learning. Morgan
and Claypool, California (2009)
Chapter 2
Data Sets and Proper Statistical Analysis
of Data Mining Techniques
Abstract Presenting a Data Mining technique and analyzing it often involves using
a data set related to the domain. In research fortunately many well-known data sets are
available and widely used to check the performance of the technique being consid-
ered. Many of the subsequent sections of this book include a practical experimental
comparison of the techniques described in each one as a exemplification of this
process. Such comparisons require a clear bed test in order to enable the reader to be
able to replicate and understand the analysis and the conclusions obtained. First we
provide an insight of the data sets used to study the algorithms presented as represen-
tative in each section in Sect. 2.1. In this section we elaborate on the data sets used in
the rest of the book indicating their characteristics, sources and availability. We also
delve in the partitioning procedure and how it is expected to alleviate the problematic
associated to the validation of any supervised method as well as the details of the
performance measures that will be used in the rest of the book. Section 2.2 takes a
tour of the most common statistical techniques required in the literature to provide
meaningful and correct conclusions. The steps followed to correctly use and interpret
the statistical test outcome are also given.
The ultimate goal of any DM process is to be applied to real life problems. As testing
a technique in every problem is unfeasible, the common procedure is to evaluate such
a technique in a set of standard DM problems (or data sets) publicly available. In this
book we will mainly use the KEEL DM tool which is also supported by the KEEL-
Dataset repository1 where data sets from different well-known sources as UCI [2]
and others have been converted to KEEL ARFF format and partitioned. This enables
the user to replicate all the experiments presented in this book with ease.
As this book focus on supervised learning, we will provide a list with the data
sets enclosed in this paradigm. The representative data sets that will be used in
classification are shown in Table 2.1. The table includes the most relevant information
about the data set:
1 http://keel.es/datasets.php.
© Springer International Publishing Switzerland 2015 19
S. García et al., Data Preprocessing in Data Mining,
Intelligent Systems Reference Library 72, DOI 10.1007/978-3-319-10247-4_2
20 2 Data Sets and Proper Statistical Analysis of Data Mining Techniques
Table 2.1 Classification data sets to be used in the remainder of this book
Acronym Data set #Attributes (R/I/N) #Examples #Classes Miss val.
ABA Abalone 8 (7/0/1) 4174 28 No
APP Appendicitis 7 (7/0/0) 106 2 No
AUD Audiology 71 (0/0/71) 226 24 Yes
AUS Australian 14 (3/5/6) 690 2 No
AUT Autos 25 (15/0/10) 205 6 Yes
BAL Balance 4 (4/0/0) 625 3 No
BAN Banana 2 (2/0/0) 5300 2 No
BND Bands 19 (13/6/0) 539 2 Yes
BRE Breast 9 (0/0/9) 286 2 Yes
BUP Bupa 6 (1/5/0) 345 2 No
CAR Car 6 (0/0/6) 1728 4 No
CLE Cleveland 13 (13/0/0) 303 5 Yes
CTR Contraceptive 9 (0/9/0) 1473 3 No
CRX Crx 15 (3/3/9) 690 2 Yes
DER Dermatology 34 (0/34/0) 366 6 Yes
ECO Ecoli 7 (7/0/0) 336 8 No
FLA Flare 11 (0/0/11) 1066 6 No
GER German 20 (0/7/13) 1000 2 No
GLA Glass 9 (9/0/0) 214 7 No
HAB Haberman 3 (0/3/0) 306 2 No
HAY Hayes 4 (0/4/0) 160 3 No
HEA Heart 13 (1/12/0) 270 2 No
HEP Hepatitis 19 (2/17/0) 155 2 Yes
HOC Horse colic 23 (7/1/15) 368 2 Yes
HOU Housevotes 16 (0/0/16) 435 2 Yes
IRI Iris 4 (4/0/0) 150 3 No
LED Led7digit 7 (7/0/0) 500 10 No
LUN Lung cancer 57 (0/0/57) 32 3 Yes
LYM Lymphography 18 (0/3/15) 148 4 No
MAM Mammographic 5 (0/5/0) 961 2 Yes
MON Monk-2 6 (0/6/0) 432 2 No
MOV Movement 90 (90/0/0) 360 15 No
NTH Newthyroid 5 (4/1/0) 215 3 No
PAG Pageblocks 10 (4/6/0) 5472 5 No
PEN Penbased 16 (0/16/0) 10992 10 No
PHO Phoneme 5 (5/0/0) 5404 2 No
PIM Pima 8 (8/0/0) 768 2 No
PRT Primary tumor 18 (0/0/18) 339 21 Yes
SAH Saheart 9 (5/3/1) 462 2 No
(continued)
2.1 Data Sets and Partitions 21
• The name of the data set and the abbreviation that will be used as future reference.
• #Attributes is the number of attributes/features and their type. R stands for real
valued attributes, I means integer attributes and N indicates the number of nominal
attributes.
• #Examples is the number of examples/instances contained in the data set.
• #Classes is the quantity of different class labels the problem presents.
• Whether the data set contains MVs or not.
The benchmark data sets presented are used with one goal: to evaluate the perfor-
mance of a given model over a set of well-known standard problems. Thus the results
can be replicated by other users and compared to new proposals. However the data
must be correctly used in order to avoid bias in the results.
If the whole data set is used for both build and validate the model generated by a
ML algorithm, we have no clue about how the model will behave with new, unseen
cases. Two main problems may arise by using the same data to train and evaluate the
model:
• Underfitting is the easiest problem to understand. It happens when the model
is poorly adjusted to the data, suffering from high error both in training and test
(unseen) data.
22 2 Data Sets and Proper Statistical Analysis of Data Mining Techniques
Fig. 2.1 Typical evolution of % error when adjusting a supervised model. Underfitting is noticeable
in the left side of the figure
• Overfitting happens when the model is too tightly adjusted to data offering high
precision to known cases but behaving poorly with unseen data.
By using the whole data we may be aware of underfitting problems due to a low
performance of the model. Adjusting such a model to better fit the data may lead
to overfitting but the lack of unseen case makes impossible to notice this situation.
Please also note that taking this procedure to an extreme may lead to overfitting as
represented in Fig. 2.1. According to Occam’s Razor reasoning given two models
of similar generalization errors, one should prefer the simpler model over the more
complex model.
Overfitting may also appear due other reasons like noise as it may force the model
to be wrongly adjusted to false regions of the problem space. The lack of data will
also cause underfitting, as the inner measures followed by the ML algorithm can only
take into account known examples and their distribution in the space.
In order to control the model’s performance, avoid overfitting and to have a gener-
alizable estimation of the quality of the model obtained several partitioning schemes
are introduced in the literature. The most common one is k-Fold Cross Validation
(k-FCV) [17]:
1. In k-FCV, the original data set is randomly partitioned into k equal size folds or
partitions.
2. From the k partitions, one is retained as the validation data for testing the model,
and the remaining k − 1 subsamples are used to build the model.
3. As we have k partitions, the process is repeated k times with each of the k sub-
samples used exactly once as the validation data.
Finally the k results obtained from each one of the test partitions must be combined,
usually by averaging them, to produce a single value as depicted in Fig. 2.2. This
procedure is widely used as it has been proved that these schemes asymptotically
converge to a stable value, which allows realistic comparisons between classifiers
2.1 Data Sets and Partitions 23
[6, 26]. The value of k may vary, 5 and 10 being the most common ones. Such a value
needs to be adjusted to avoid to generate a small test partition poorly populated with
examples that may bias the performance measures used. If big data sets are being
used, 10-FCV is usually utilized while for smaller data sets 5-FCV is more frequent.
Simple k-FCV may also lead to disarranging the proportion of examples from each
class in the test partition. The most commonly employed method in the literature to
avoid this problem is stratified k-FCV . It places an equal number of samples of each
class on each partition to maintain class distributions equal in all partitions
Other popular validation schemes are:
• In 5 × 2 CV [22] the whole data set is randomly partitioned in two subsets A
and B. Then the model is first built using A and validated with B and then the
process is reversed with the model built with B and tested with A. This partitioning
process is repeated as desired aggregating the performance measure in each step.
Figure 2.3 illustrates the process. Stratified 5 × 2 cross-validation is the variation
most commonly used in this scheme.
• Leave one out is an extreme case of k-FCV, where k equals the number of examples
in the data set. In each step only one instance is used to test the model whereas the
rest of instances are used to learn it.
How to partition the data is a key issue as it will largely influence in the perfor-
mance of the methods and in the conclusions extracted from that point on. Performing
a bad partitioning will surely lead to incomplete and/or biased behavior data about
the model being evaluated. This issue is being actively investigated nowadays, with
special attention being paid to data set shift [21] as a decisive factor that impose large
k values in k-FCV to reach performance stability in the model being evaluated.
24 2 Data Sets and Proper Statistical Analysis of Data Mining Techniques
Most of the preprocessing stages aim to improve the quality of the data. Such improve-
ment is later measured by analyzing the model constructed over the data and it
depends on the type of the DM process carried out afterwards. Predictive processes
like classification and regression rely in a measure of how well the model fits the
data, resulting in a series of measures that work over the predictions made.
In classification literature we can observe that most of the performance measures
are designed for binary-class problems [25]. Well-known accuracy measures for
binary-class problems are: classification rate , precision, sensitivity, specificity, G-
mean [3], F-score, AUC [14], Youden’s index γ [31] and Cohen’s Kappa [4].
Some of the two-class accuracy measures have been adapted for multi-class prob-
lems. For example, in a recent paper [18], the authors propose an approximating
multi-class ROC analysis, which is theoretically possible but its computation is still
restrictive. Only two measures are widely used because of their simplicity and suc-
cessful application when the number of classes is large enough. We refer to classifi-
cation rate and Cohen’s kappa measures, which will be explained in the following.
• Classification rate (also known as accuracy): is the number of successful hits
relative to the total number of classifications. It has been by far the most commonly
used metric for assessing the performance of classifiers for years [1, 19, 28].
• Cohen’s kappa: is an alternative to classification rate, a method, known for decades,
that compensates for random hits (Cohen 1960). Its original purpose was to mea-
sure the degree of agreement or disagreement between two people observing the
2.1 Data Sets and Partitions 25
where xii is the cell count in the main diagonal, n is the number of examples in the
data set, C is the number of class labels and xi· , x·i are the rows and columns total
counts respectively.
Cohen’s kappa ranges from −1 (total disagreement) through 0 (random classifi-
cation) to 1 (perfect agreement). Being a scalar, it is less expressive than ROC curves
when applied to binary-classification. However, for multi-class problems, kappa is
a very useful, yet simple, meter for measuring the accuracy of the classifier while
compensating for random successes.
The main difference between classification rate and Cohen’s kappa is the scoring
of the correct classifications. Classification rate scores all the successes over all
classes, whereas Cohen’s kappa scores the successes independently for each class
and aggregates them. The second way of scoring is less sensitive to randomness
caused by different number of examples in each class, which causes a bias in the
learner towards the obtention of data-dependent models.
Using the raw performance measures to compare different ML methods and to estab-
lish a ranking is discouraged. It has been recently analyzed [5, 10] that other tools
of statistical nature must be utilized in order to obtain meaningful and durable con-
clusions.
In recent years, there has been a growing interest for the experimental analysis
in the field of DM. It is noticeable due to the existence of numerous papers which
analyze and propose different types of problems, such as the basis for experimental
comparisons of algorithms, proposals of different methodologies in comparison or
proposals of use of different statistical techniques in algorithms comparison.
The “No free lunch” theorem [29] demonstrates that it is not possible to find one
algorithm behaving better for any problem. On the other hand, we know that we can
work with different degrees of knowledge of the problem which we expect to solve,
and that it is not the same to work without knowledge of the problem (hypothesis of
26 2 Data Sets and Proper Statistical Analysis of Data Mining Techniques
the “no free lunch” theorem) than to work with partial knowledge about the problem,
knowledge that allows us to design algorithms with specific characteristics which
can make them more suitable to solve of the problem.
In [24] the distinction between parametric and non-parametric tests is based on the
level of measure represented by the data to be analyzed. That is, a parametric test
usually uses data composed by real values.
However the latter does not imply that when we always dispose of this type of
data, we should use a parametric test. Other initial assumptions for a safe usage of
parametric tests must be fulfilled. The non fulfillment of these conditions might cause
a statistical analysis to lose credibility.
The following conditions are needed in order to safely carry out parametric tests
[24, 32]:
• Independence: In statistics, two events are independent when the fact that one
occurs does not modify the probability of the other one occurring.
• Normality: An observation is normal when its behaviour follows a normal or
Gauss distribution with a certain value of average μ and variance σ . A normality
test applied over a sample can indicate the presence or absence of this condition
in observed data. Three normality tests are usually used in order to check whether
normality is present or not:
– Kolmogorov–Smirnov: compares the accumulated distribution of observed data
with the accumulated distribution expected from a Gaussian distribution, obtain-
ing the p-value based on both discrepancies.
– Shapiro–Wilk: analyzes the observed data to compute the level of symmetry and
kurtosis (shape of the curve) in order to compute the difference with respect to
a Gaussian distribution afterwards, obtaining the p-value from the sum of the
squares of the discrepancies.
– D’Agostino–Pearson: first computes the skewness and kurtosis to quantify how
far from Gaussian the distribution is in terms of asymmetry and shape. It then
calculates how far each of these values differs from the value expected with
a Gaussian distribution, and computes a single p-value from the sum of the
discrepancies.
• Heteroscedasticity: This property indicates the existence of a violation of the
hypothesis of equality of variances. Levene’s test is used for checking whether or
not k samples present this homogeneity of variances (homoscedasticity). When
observed data does not fulfill the normality condition, this test’s result is more
reliable than Bartlett’s test [32], which checks the same property.
With respect to the independence condition, Demsǎr suggests in [5] that indepen-
dency is not truly verified in k-FCV and 5 × 2 CV (a portion of samples is used either
2.2 Using Statistical Tests to Compare Methods 27
for training and testing in different partitions). Hold-out partitions can be safely take
as independent, since training and tests partitions do not overlap.
The independence of the events in terms of getting results is usually obvious, given
that they are independent runs of the algorithm with randomly generated initial seeds.
In the following, we show a normality analysis by using KolmogorovSmirnov’s,
ShapiroWilk’s and D’AgostinoPearson’s tests, together with a heteroscedasticity
analysis by using Levene’s test in order to show the reader how to check such property.
Let us consider an small case of study, where we take into account an stochastic
algorithm that needs a seed to generate its model. A classic example of these types
of algorithms is the MLP. Using a small set of 6 well-known classification problems,
we aim to analyze whether the conditions required to safely perform a parametric
statistical analysis are held. We have used a 10-FCV validation scheme in which
MLP is run 5 times per fold, thus obtaining 50 results per data set. Please note that
using a k-FCV will mean that independence is not held but it is the most common
validation scheme used in classification so this study case turns out to be relevant.
First of all, we want to check if our samples follow a normal distribution. In
Table 2.2 the p-values obtained for the normality test were described in the previous
section. As we can observe, in many cases the normality assumption is not held
(indicated by an “a” in the table).
In addition to this general study, we show the sample distribution in three cases,
with the objective of illustrating representative cases in which the normality tests
obtain different results.
From Fig. 2.4 to 2.6, different examples of graphical representations of histograms
and Q-Q graphics are shown. A histogram represents a statistical variable by using
bars, so that the area of each bar is proportional to the frequency of the represented
values. A Q-Q graphic represents a confrontation between the quartiles from data
observed and those from the normal distributions.
In Fig. 2.4 we can observe a general case in which the property of abnormality
is clearly presented. On the contrary, Fig. 2.5 is the illustration of a sample whose
distribution follows a normal shape, and the three normality tests employed verified
Fig. 2.4 Example of non-normal distribution: glass data set for MLP: Histogram and Q-Q graphic
Fig. 2.5 Example of normal distribution: pima data set for MLP: Histogram and Q-Q graphic
Fig. 2.6 Example of a special case: cleveland data set for MLP: Histogram and Q-Q graphic
this fact. Finally, Fig. 2.6 shows a special case where the similarity between both
distributions, the sample of results and the normal distribution, is not confirmed by
all normality tests. In this case, one normality test could work better than another,
depending on types of data, number of ties or number of results collected. Due
to this fact, we have employed three well-known normality tests for studying the
normality condition. The choice of the most appropriate normality test depending on
the problem is out of the scope of this book.
2.2 Using Statistical Tests to Compare Methods 29
The authors are usually familiarized with parametric tests for pairwise comparisons.
ML approaches have been compared through parametric tests by means of paired t
tests.
In some cases, the t test is accompanied with the non-parametric Wilcoxon test
applied over multiple data sets. The use of these types of tests is correct when we
are interested in finding the differences between two methods, but they must not be
used when we are interested in comparisons that include several methods. In the
case of repeating pairwise comparisons, there is an associated error that grows as the
number of comparisons done increases, called the family-wise error rate (FWER),
defined as the probability of at least one error in the family of hypotheses. To solve
this problem, some authors use the Bonferroni correction for applying paired t test
in their works [27] although is not recommended.
30 2 Data Sets and Proper Statistical Analysis of Data Mining Techniques
In this section our interest lies in presenting a methodology for analyzing the
results offered by a pair algorithms in a certain study, by using non-parametric tests
in a multiple data set analysis. Furthermore, we want to comment on the possibility of
comparison with other deterministic ML algorithms. Non-parametric tests could be
applied to a small sample of data and their effectiveness have been proved in complex
experiments. They are preferable to an adjustment of data with transformations or to
a discarding of certain extreme observations (outliers) [16].
This section is devoted to describing a non-parametric statistical procedure for per-
forming pairwise comparisons between two algorithms, also known as the Wilcoxon
signed-rank test, Sect. 2.2.3.1; and to show the operation of this test in the presented
case study, Sect. 2.2.3.2.
allows us to apply it over the means obtained by the algorithms in each data set,
without any assumptions about the sample of results obtained.
In the following, we will perform the statistical analysis by means of pairwise com-
parisons by using the results of performance measures obtained by the algorithms
taken as reference for this section: MLP, RBFN, SONN and LVQ. A similar yet more
detailed study can be found in [20].
In order to compare the results between two algorithms and to stipulate which
one is the best, we can perform a Wilcoxon signed-rank test for detecting differences
in both means. This statement must be enclosed by a probability of error, that is the
complement of the probability of reporting that two systems are the same, called the
p value [32]. The computation of the p value in the Wilcoxon distribution could be
carried out by computing a normal approximation [24]. This test is well known and
it is usually included in standard statistics packages (such as SPSS, R, SAS, etc.) as
well as in open source implementations like as in KEEL [9] (see Chap. 10).
In Table 2.4 the ranks obtained by each method are shown. Table 2.5 shows the
results obtained in all possible comparisons of the algorithms considered in the
example study. We stress with bullets the winning algorithm in each row/column
when the p value associated is below 0.1 and/or 0.05.
The comparisons performed in this study are independent, so they never have to be
considered in a whole. If we try to extract a conclusion which involves more than one
comparison from the previous tables, we will lose control of the FWER. For instance,
the statement: “SONN obtains a classification rate better than RBFN, LVQ and MLP
algorithms with a p value lower than 0.05” is incorrect, since we do not prove the
control of the FWER. The SONN algorithm really outperforms MLP, RBFN and
LVQ algorithms considering classification rate in independent comparisons.
The true statistical signification for combining pairwise comparisons is given
by Eq. 2.2:
One of the most frequent situations where the use of statistical procedures is requested
is in the joint analysis of the results achieved by various algorithms. The groups of
differences between these methods (also called blocks) are usually associated with
2.2 Using Statistical Tests to Compare Methods 33
the problems met in the experimental study. For example, in a multiple problem
comparison, each block corresponds to the results offered for a specific problem.
When referring to multiple comparisons tests, a block is composed of three or more
subjects or results, each one corresponding to the performance evaluation of the
algorithm for the problem.
Please remember that in pairwise analysis, if we try to extract a conclusion involv-
ing more than one pairwise comparisons, we will obtain an accumulated error coming
from its combination losing the control on the Family-Wise Error Rate (FWER) (see
Eq. 2.2).
This section is devoted to describing the use of several procedures for multiple
comparisons considering a control method. In this sense, a control method can be
defined as the most interesting algorithm for the researcher of the experimental study
(usually its new proposal). Therefore, its performance will be contrasted against the
rest of the algorithms of the study.
The best-known procedure for testing the differences between more than two
related samples, the Friedman test, will be introduced in the following.
Friedman test
Iman–Davenport test
Iman and Davenport [15] proposed a derivation from the Friedman statistic given
that this last metric often produces a conservative effect not desired. The proposed
statistic is
(n − 1)χ F2
FID = (2.4)
n(k − 1)χ F2
Post-hoc procedures
The rejection of the null hypothesis in both tests described above does not involve
the detection of the existing differences among the algorithms compared. They only
inform us of the presence of differences among all samples of results compared. In
order to conducting pairwise comparisons within the framework of multiple com-
parisons, we can proceed with a post-hoc procedure. In this case, a control algorithm
(maybe a proposal to be compared) is usually chosen. Then, the post-hoc procedures
proceed to compare the control algorithm with the remain k − 1 algorithms. In the
following, we describe three post-hoc procedures:
• Bonferroni-Dunn’s procedure [32]: it is similar to Dunnet’s test for ANOVA
designs. The performance of two algorithms is significantly different if the corre-
sponding average of rankings is at least as great as its critical difference (CD).
k(k + 1)
C D = qα (2.5)
6N
The value of z is used for finding the corresponding probability from the table
of the normal distribution ( p-value), which is compared with the corresponding
value of α. Holm’s method is more powerful than Bonferroni-Dunn’s and it does
no additional assumptions about the hypotheses checked.
• Hochberg [11] procedure: It is a step-up procedure that works in the opposite direc-
tion to Holm’s method, comparing the largest p-value with ?, the next largest with
α/2 and so forth until it encounters a hypothesis it can reject. All hypotheses with
smaller p values are then rejected as well. Hochberg’s method is more powerful
than Holm’s [23].
The post-hoc procedures described above allow us to know whether or not a
hypothesis of comparison of means could be rejected at a specified level of sig-
nificance α. However, it is very interesting to compute the p-value associated to
each comparison, which represents the lowest level of significance of a hypothesis
that results in a rejection. In this manner, we can know whether two algorithms are
significantly different and also get a metric of how different they are.
In the following, we will describe the method for computing these exact p-values
for each test procedure, which are called “adjusted p-values” [30].
• The adjusted p-value for BonferroniDunn’s test (also known as the Bonferroni
correction) is calculated by p Bon f = (k − 1) pi .
• The adjusted p-value for Holm’s procedure is computed by p H olm = (k − i) pi .
Once all of them have been computed for all hypotheses, it will not be possible to
find an adjusted p-value for the hypothesis i lower than that for the hypothesis j,
j < i. In this case, the adjusted p-value for hypothesis i is set equal to the p-values
associated to the hypothesis j.
• The adjusted p-value for Hochberg’s method is computed with the same formula
as Holm’s, and the same restriction is applied in the process, but to achieve the
opposite, that is, so that it will not possible to find an adjusted p-value for the
hypothesis i lower than for the hypothesis j, j > i.
In this section we carry out a toy example on the analysis of a multiple comparison
using the same ML algorithms as Sect. 2.2.3.2: MLP, RBFN, SONN and LVQ.
In Table 2.6 we show the ranks obtained by each algorithm for Friedman test.
From this table we can observe that SONN is the algorithm with the lowest rank and
hence will act as the control algorithm.
36 2 Data Sets and Proper Statistical Analysis of Data Mining Techniques
Computing the Friedman and Iman–Davenport statistics as in Eqs. (2.3) and (2.4)
the respective values are:
• Friedman statistic (distributed according to chi-squared with 3 degrees of free-
dom): 12.2. p-value computed by Friedman Test: 0.006729.
• Iman and Davenport statistic (distributed according to F-distribution with 3 and 15
degrees of freedom): 10.517241. p-value computed by Iman and Daveport Test:
0.000561296469.
In our case, both Friedman’s and ImanDavenport’s tests indicate that significant
differences in the results are found in the three validations used in this study. Due to
these results, a post-hoc statistical analysis is required. In this analysis, we choose
the best performing method, SONN, as the control method for comparison with the
rest of algorithms.
Post-hoc comparision
We will first present the results obtained for Bonferroni-Dunn’s, Holm’s and
Hochberg’s post-hoc tests with no adjustment of the p-values. Table 2.7 summa-
rizes the unadjusted p-values for each algorithm when compared to SONN.
By computing Bonferroni-Dunn’s CD as in 2.5 those hypotheses that have an
unadjusted p-value ≤ 0.016667 are rejected. By using the z value indicated for
Holm’s and Hochberg’s procedures, we can observe that they reject those hypotheses
that have an unadjusted p-value ≤ 0.05. The reader may notice that Bonferroni-
Dunn’s is not able to reject the null-hypothesis for SONN versus MLP, while Holm’s
and Hochberg’s are able to reject all null-hypothesis due to their higher statistical
power.
The reader may usually refer directly to the adjusted p-values for the post-hoc
methods, as they make searching for critical differences unnecessary and improve
the readability. Table 2.8 contains all the adjusted p-values for Bonferroni-Dunn’s,
Holm’s and Hochberg’s test from the unadjusted values.
Taking a significance level of α = 0.05 we can observe that the conclusions
obtained from Table 2.8 are the same than those obtained in Table 2.7 without the
need to establish the unadjusted p-value that acts as a threshold for the null-hypothesis
rejection. The user only needs to observe those adjusted p-values that fall under the
desired α significance level.
References
1. Alpaydin, E.: Introduction to Machine Learning, 2nd edn. MIT Press, Cambridge (2010)
2. Bache, K., Lichman, M.: UCI machine learning repository (2013). http://archive.ics.uci.edu/
ml
3. Barandela, R., Sánchez, J.S., García, V., Rangel, E.: Strategies for learning in class imbalance
problems. Pattern Recognit. 36(3), 849–851 (2003)
4. Ben-David, A.: A lot of randomness is hiding in accuracy. Eng. Appl. Artif. Intell. 20(7),
875–885 (2007)
5. Děmsar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res.
7, 1–30 (2006)
6. Efron, B., Gong, G.: A leisurely look at the bootstrap, the jackknife, and cross-validation. Am.
Stat. 37(1), 36–48 (1983)
7. Friedman, M.: The use of ranks to avoid the assumption of normality implicit in the analysis
of variance. J. Am. Stat. Assoc. 32(200), 675–701 (1937)
8. Friedman, M.: A comparison of alternative tests of significance for the problem of m rankings.
Ann. Math. Stat. 11(1), 86–92 (1940)
9. García, S., Fernández, A., Luengo, J., Herrera, F.: Advanced nonparametric tests for multi-
ple comparisons in the design of experiments in computational intelligence and data mining:
Experimental analysis of power. Inf. Sci. 180(10), 2044–2064 (2010)
10. García, S., Herrera, F.: An extension on “statistical comparisons of classifiers over multiple
data sets” for all pairwise comparisons. J. Mach. Learn. Res. 9, 2677–2694 (2008)
11. Hochberg, Y.: A sharper bonferroni procedure for multiple tests of significance. Biometrika
75(4), 800–802 (1988)
12. Hodges, J., Lehmann, E.: Rank methods for combination of independent experiments in analysis
of variance. Ann. Math. Statist 33, 482–497 (1962)
13. Holm, S.: A simple sequentially rejective multiple test procedure. Scand. J. Stat. 6, 65–70
(1979)
14. Huang, J., Ling, C.X.: Using AUC and accuracy in evaluating learning algorithms. IEEE Trans.
Knowl. Data Eng. 17(3), 299–310 (2005)
15. Iman, R., Davenport, J.: Approximations of the critical region of the Friedman statistic. Com-
mun. Stat. 9, 571–595 (1980)
38 2 Data Sets and Proper Statistical Analysis of Data Mining Techniques
16. Koch, G.: The use of non-parametric methods in the statistical analysis of a complex split plot
experiment. Biometrics 26(1), 105–128 (1970)
17. Kohavi, R.: A study of cross-validation and bootstrap for accuracy estimation and model
selection. Proceedings of the 14th international joint conference on Artificial intelligence.
IJCAI’95, vol. 2, pp. 1137–1143. Morgan Kaufmann Publishers Inc., San Francisco, CA (1995)
18. Landgrebe, T.C., Duin, R.P.: Efficient multiclass ROC approximation by decomposition via
confusion matrix perturbation analysis. IEEE Trans. Pattern Anal. Mach. Intell. 30(5), 810–822
(2008)
19. Lim, T.S., Loh, W.Y., Shih, Y.S.: A comparison of prediction accuracy, complexity, and training
time of thirty-three old and new classification algorithms. Mach. Learn. 40(3), 203–228 (2000)
20. Luengo, J., García, S., Herrera, F.: A study on the use of statistical tests for experimentation
with neural networks: Analysis of parametric test conditions and non-parametric tests. Expert
Syst. Appl. 36(4), 7798–7808 (2009)
21. Moreno-Torres, J.G., Sáez, J.A., Herrera, F.: Study on the impact of partition-induced dataset
shift on k -fold cross-validation. IEEE Trans. Neural Netw. Learn. Syst. 23(8), 1304–1312
(2012)
22. Salzberg, S.L.: On comparing classifiers: Pitfalls to avoid and a recommended approach. Data
Min. Knowl. Discov. 1(3), 317–328 (1997)
23. Shaffer, J.P.: Multiple hypothesis testing. Annu. Rev. Psychol. 46(1), 561–584 (1995)
24. Sheskin, D.J.: Handbook of Parametric and Nonparametric Statistical Procedures. Chapman
& Hall/CRC, Boca Raton (2007)
25. Sokolova, M., Japkowicz, N., Szpakowicz, S.: Beyond accuracy, f-score and roc: A family of
discriminant measures for performance evaluation. In: A. Sattar, B.H. Kang (eds.) Australian
Conference on Artificial Intelligence, Lecture Notes in Computer Science, vol. 4304, pp. 1015–
1021. Springer (2006).
26. Stone, M.: Asymptotics for and against cross-validation. Biometrika 64(1), 29–35 (1977)
27. Tan, K.C., Yu, Q., Ang, J.H.: A coevolutionary algorithm for rules discovery in data mining.
Int. J. Syst. Sci. 37(12), 835–864 (2006)
28. Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, Sec-
ond Edition (Morgan Kaufmann Series in Data Management Systems). Morgan Kaufmann
Publishers Inc., San Francisco (2005)
29. Wolpert, D.H., Macready, W.G.: No free lunch theorems for optimization. Trans. Evol. Comp.
1(1), 67–82 (1997)
30. Wright, S.P.: Adjusted P-values for simultaneous inference. Biometrics 48(4), 1005–1013
(1992)
31. Youden, W.J.: Index for rating diagnostic tests. Cancer 3(1), 32–35 (1950)
32. Zar, J.: Biostatistical Analysis, 4th edn. Prentice Hall, Upper Saddle River (1999)
Chapter 3
Data Preparation Basic Models
Abstract The basic preprocessing steps carried out in Data Mining convert real-
world data to a computer readable format. An overall overview related to this topic
is given in Sect. 3.1. When there are several or heterogeneous sources of data, an
integration of the data is needed to be performed. This task is discussed in Sect.
3.2. After the data is computer readable and constitutes an unique source, it usually
goes through a cleaning phase where the data inaccuracies are corrected. Section
3.3 focuses in the latter task. Finally, some Data Mining applications involve some
particular constraints like ranges for the data features, which may imply the normal-
ization of the features (Sect. 3.4) or the transformation of the features of the data
distribution (Sect. 3.5).
3.1 Overview
Data gathered in data sets can present multiple forms and come from many different
sources. Data directly extracted from relational databases or obtained from the real
world is completely raw: it has not been transformed, cleansed or changed at all.
Therefore, it may contain errors due to wrong data entry procedures or missing data,
or inconsistencies due to ill-handled merging data processes.
Three elements define data quality [15]: accuracy, completeness and consistency.
Unfortunately real-world data sets often present the opposite conditions, and the
reasons may vary as mentioned above. Many preprocessing techniques have been
devised to overcome the problems present in such real-world data sets and to obtain
a final, reliable and accurate data set to later apply a DM technique [35].
Gathering all the data elements together is not an easy task when the examples
come from different sources and they have to be merged in a single data set. Integrat-
ing data from different databases is usually called data integration. Different attribute
names or table schemes will produce uneven examples that need to be consolidated.
Moreover, attribute values may represent the same concept but with different names
creating inconsistencies in the instances obtained. If some attributes are calculated
from the others, the data sets will present a large size but the information contained
will not scale accordingly: detecting and eliminating redundant attributes is needed.
Having an uniform data set without measurable inconsistences does not mean
that the data is clean. Errors like MVs or uncontrolled noise may be still present. A
data cleaning step is usually needed to filter or correct wrong data. Otherwise, the
knowledge extracted by a DM algorithm will be barely accurate or DM algorithms
will not be able to handle the data.
Ending up with a consistent and almost error-free data set does not mean that the
data is in the best form for a DM algorithm. Some algorithms in DM work much
better with normalized attribute values, such as ANNs or distance-based methods.
Others are not able to work with nominal valued attributes, or benefit from subtle
transformations in the data. Data normalization and data transformation techniques
have been devised to adapt a data set to the needs of the DM algorithm that will be
applied afterwards. Note that eliminating redundant attributes and inconsistencies
may still yield a large data set that will slow down the DM process. The use of data
reduction techniques to transform the data set are quite useful, as they can reduce
the number of attributes or instances while maintaining the information as whole as
possible.
To sum up, real-world data is usually incomplete, dirty and inconsistent. Therefore
data preprocessing techniques are needed to improve the accuracy and efficiency the
subsequent DM technique used. The rest of the chapter further describes the basic
techniques used to perform the preparation of the data set, while leading the reader
to the chapters where advanced techniques are deeper described and presented.
One hard problem in DM is to collect a single data set with information coming from
varied and different sources. If the integration process is not properly performed
redundancies and inconsistencies will soon appear, resulting in a decrease of the
accuracy and speed of the subsequent DM processes. Matching the schema from dif-
ferent sources presents a notorious problem that usually does not usually come alone:
inconsistent and duplicated tuples as well as redundant and correlated attributes are
problems that the data set creation process will probably show sooner or later.
An essential part of the integration process is to build a data map that establishes
how each instance should be arranged in a common structure to present a real-
world example taken from the real world. When data is obtained from relational
databases, it is usually flattened, gathered together into one single record. Some
database frameworks enable the user to provide a map to directly traverse the database
through in-database access utilities. While using this in-database mining tools has
the advantage of not having to extract and create an external file for the data, it is not
the best option for its treatment with preprocessing techniques. In this case extracting
the data is usually the best option. Preprocessing takes time, and when the data is
kept in the database, preprocessing has to be applied repeatedly. In this way, if the
data is extracted to an external file, then processing and modeling it can be faster if
the data is already preprocessed and completely fits in memory.
3.2 Data Integration 41
Automatic approaches used to integrate the data can be found in the literature,
from techniques that match and find the schemas of the data [7, 8], to automatic
procedures that reconcile different schemas [6].
Suppose that two nominal attributes, A and B, contain c and r distinct values each,
namely a1 , . . . , ac and b1 , . . . , ar . We can check the correlation between them using
the χ 2 test. In order to do so, a contingency table, with the joint events (Ai , B j ) in
which attribute A takes the value ai and the attribute B takes the value b j , is created.
Every possible joint event (Ai , B j ) has its own entry in the table. The χ 2 value (or
Pearson χ 2 statistic) is computed as:
c
r
(oi j − ei j )2
χ2 = , (3.1)
ei j
i=1 j=1
where oi j is the observed frequency of the joint event (Ai , B j ), and ei j is the expected
frequency of (Ai , B j ) computed as:
count (A = ai ) × count (B = b j )
ei j = , (3.2)
m
where m is the number of instances in the data set, count (A = ai ) is the number
of instances with the value ai for attribute A and count (B = b j ) is the number of
instances having the value b j for attribute B.
The χ 2 test checks the hypothesis that A and B are independent, with (r −1)×(c−
1) degrees of freedom. The χ 2 statistic obtained in Eq. (3.1) is compared against any
42 3 Data Preparation Basic Models
χ 2 table using the suitable degrees of freedom or any available software that is able
to provide this value. If the significance level of such a table is below the established
one (or the statistic value computed is above the needed one in the table), we can say
that the null hypothesis is rejected and therefore, A and B are statistically correlated.
When we have two numerical attributes, checking whether they are highly correlated
or not is useful to determine if they are redundant. The most well-known correlation
coefficient is the Pearson’s product moment coefficient, given by:
m m
i=1 (ai− A)(bi − B) i=1 (ai bi ) − m A B
r A,B = = , (3.3)
mσ A σ B mσ A σ B
where m is the number of instances, ai and bi are the values of attributes A and B in
the instances, A and B are the mean values of A and B respectively, and σ A and σ B
are the standard deviations of A and B.
Please note that −1 ≤ r A,B ≤ +1. When r A,B > 0 it means that the two attributes
are positively correlated: when values of A are increased, then the values of B are
incremented too. The higher the coefficient is, the higher the correlation between
them is. Having a high value of r A,B could also indicate that one of the two attributes
can be removed.
When r A,B = 0, it implies that attributes A and B are independent and no correla-
tion can be found between them. If r A,B < 0, then attributes A and B are negatively
correlated and when the values of one attribute are increased, the values of the other
attribute are decreased. Scatter plots can be useful to examine how correlated the
data is and to visually check the results obtained.
Similarly to correlation, covariance is an useful and widely used measure in sta-
tistics in order to check how much two variables change together. Considering that
the mean values are the expected values of attributes A and B, namely E(A) = A
and E(B) = B, the covariance between both is defined as
m
i=1 (ai − A)(bi − B)
Cov(A, B) = E((A − A)(B − B)) = . (3.4)
m
It is easy to see the relation between the covariance and the correlation coefficient
r A,B given in Eq. (3.3) expressed as
Cov(A, B)
r A,B = . (3.5)
σAσB
If two attributes vary similarly, when A > A then probably B > B and thus the
covariance is positive. On the other hand, when one attribute tends to be above
its expected value whereas the other is below its expected value, the covariance is
3.2 Data Integration 43
negative. When the two variables are independent, it is satisfied that E(A · B) =
E(A) · E(B), and thus the covariance verifies
The reader must be cautious, as having Cov(A, B) = 0 does not imply the two
attributes being independent, as some random variables may present a covariance of
0 but still being dependent. Additional assumptions (like the data follows multivariate
normal distributions) are necessary if covariance is 0 to determine whether the two
attributes are independent.
It is interesting to check, when the tuples have been obtained, that there are not
any duplicated tuple. One source of duplication is the use of denormalized tables,
sometimes used to speed up processes involving join operations.
Having duplicate tuples can be troublesome, not only wasting space and comput-
ing time for the DM algorithm, but they can also be a source of inconsistency. Due
to errors in the entry process, differences in some attribute values (for example the
identifier value) may produce identical repeated instances but which are considered
as different. These samples are harder to detect than simply scanning the data set for
duplicate instances.
Please note that sometimes the duplicity is subtle. For example, if the information
comes from different sources, the systems of measurement may be different as well,
resulting in some instances being actually the same, but not identified like that.
Their values can be represented using the metric system and the imperial system in
different sources, resulting in a not-so-obvious duplication. The instances may also
be inconsistent if attribute values are out of the established range (usually indicated
in the associated metadata for the data set), but this is an easy to check condition.
One of the most common sources of mismatches in the instances are the nominal
attributes [9]. Analyzing the similarity between nominal attributes is not trivial, as
distance functions are not applied in a straightforward way and several alternatives
do exist. Several character-based distance measures for nominal values can be found
in the literature. These and can be helpful to determine whether two nominal values
are similar (even with entry errors) or different [9]:
• The edit distance [23] between two strings σ1 and σ2 is the minimum number
of string operations (or edit operations) needed to convert one string in the other.
Three types of edit operations are usually considered: inserting a character, replac-
ing a character or deleting a character. Using dynamic programming the number of
operations can be established. Modern versions of this distance measure establish
different costs for each edit operation, depending on the characters involved [31].
44 3 Data Preparation Basic Models
• The edit distance does not work well if the string values have been shortened. The
affine gap distance adds two edit operations: opening gap and extending gap.
With these operations the truncated values can be matched with their original full
string counterparts.
• Jaro introduced in [17] a string comparison algorithm mainly aimed to compare
first and last names in census data. It uses the string lengths |σ1 | and |σ2 |, the
common c characters between strings σ1 and σ2 , and t the number of transpositions
by comparing the ordering of the most common characters in the string and calling
a transposition each mismatch in the order. The Jaro distance value is computed
as
1 c c c − t/2
J ar o(σ1 , σ2 ) = + + . (3.7)
3 |σ1 | |σ2 | c
After the data is correctly integrated into a data set, it does not mean that the data is free
from errors. The integration may result in an inordinate proportion of the data being
dirty [20]. Broadly, dirty data include missing data, wrong data and non-standard
representation of the same data. If a high proportion of the data is dirty, applying a
DM process will surely result in a unreliable model. Dirty data has varying degrees of
impact depending on the DM algorithm, but it is difficult to quantify such an impact.
Before applying any DM technique over the data, the data must be cleaned to
remove or repair dirty data. The sources of dirty data include data entry errors, data
update errors, data transmission errors and even bugs in the data processing system.
As a result, dirty data usually is presented in two forms: missing data and wrong
(noisy) data. The authors in [20] also include under this categorization inconsistent
instances, but we assume that such kind of erroneous instances have been already
addressed as indicated in Sect. 3.2.2.
The presence of a high proportion of dirty data in the training data set and/or the
testing data set will likely produce a less reliable model. The impact of dirty data
also depends on the particular DM algorithm applied. Decision trees are known to
be susceptible to noise (specially if the trees are of higher order than two) [2]. ANNs
and distance based algorithms (like the KNN algorithm) are known to be susceptible
to noise. The use of distance measures is heavily dependent on the values of the data,
46 3 Data Preparation Basic Models
and if the data used is dirty the results will be faulty. However, the detection of the
noisy data in the data set is not a trivial task [16] and a wrong detection will result
in damage to the correct data.
The way of handling MVs and noisy data is quite different:
• MVs are treated by routines prior to the DM algorithm application. The instances
containing MVs can be ignored, or filled in manually or with a constant. Elaborated
strategies that use estimations over the data are recommended in order to obtain
reliable and more general results. This task is deeper studied in Chap. 4.
• The presence of noise in data is often defined as a random error in a measured
variable, changing its value. Basic statistical and descriptive techniques as scatter
plots can be used to identify outliers. Multiple linear regression is considered to
estimate the tendency of the attribute values if they are numerical. However, the
most recommended approach in the literature is the noisy detection and treatment,
usually by filtering. Chapter 5 is completely devoted to noise identification and
filtering.
The data collected in a data set may not be useful enough for a DM algorithm.
Sometimes the attributes selected are raw attributes that have a meaning in the
original domain from where they were obtained, or are designed to work with the
operational system in which they are being currently used. Usually these original
attributes are not good enough to obtain accurate predictive models. Therefore, it is
common to perform a series of manipulation steps to transform the original attributes
or to generate new attributes with better properties that will help the predictive power
of the model. The new attributes are usually named modeling variables or analytic
variables.
In this section we will focus on the transformations that do not generate new
attributes, but they transform the distribution of the original values into a new set of
values with the desired properties.
The min-max normalization aims to scale all the numerical values v of a numerical
attribute A to a specified range denoted by [new − min A , new − max A ]. Thus a
transformed value is obtained by applying the following expression to v in order to
obtain the new value v :
v − min A
v = (new − max A − new − min A ) + new − min A , (3.8)
max A − min A
3.4 Data Normalization 47
where max A and min A are the original maximum and minimum attribute values
respectively.
In the literature “normalization” usually refers to a particular case of the min-max
normalization in which the final interval is [0, 1], that is, new − min A = 0 and
new − max A = 1. The interval [−1, 1] is also typical when normalizing the data.
This type of normalization is very common in those data sets being prepared
to be used with learning methods based on distances. Using a normalization to re-
scale all the data to the same range of values will avoid those attributes with a large
max A − min A difference dominating over the other ones in the distance calculation,
misleading the learning process by giving more importance to the former attributes.
This normalization is also known for speeding up the learning process in ANNs,
helping the weights to converge faster.
An alternative, but equivalent, formulation for the min-max normalization is
obtained by using a base value new −min A and the desired new range R in which the
values will be mapped after the transformation. Some well-known software packages
such as SAS or Weka [14] use this type of formulation for the min-max transforma-
tion:
v − min A
v = new − min A + R . (3.9)
max A − min A
In some cases, the min-max normalization is not useful or cannot be applied. When
the minimum or maximum values of attribute A are not known, the min-max normal-
ization is infeasible. Even when the minimum and maximum values are available,
the presence of outliers can bias the min-max normalization by grouping the values
and limiting the digital precision available to represent the values.
If A is the mean of the values of attribute A and σ A is the standard deviation,
original value v of A is normalized to v using
v− A
v = . (3.10)
σA
By applying this transformation the attribute values now present a mean equal to 0
and a standard deviation of 1.
If the mean and standard deviation associated to the probability distribution are
not available, it is usual to use instead the sample mean and standard deviation:
1
n
A= vi , (3.11)
n
i=1
and
48 3 Data Preparation Basic Models
n
1
σ A = + (vi − A)2 . (3.12)
n
i=1
A variation of the z-score normalization, described in [15], uses the mean absolute
deviation s A of A instead of the standard deviation. It is computed as
1
n
sA = |vi − A|. (3.13)
n
i=1
v− A
v = . (3.14)
sA
In the previous Sect. 3.4 we have shown some basic transformation techniques to
adapt the ranges of the attributes or their distribution to a DM algorithm’s needs. In
this section we aim to present the process to create new attributes, often called trans-
forming the attributes or the attribute set. Data transformation usually combines the
original raw attributes using different mathematical formulas originated in business
models or pure mathematical formulas.
3.5 Data Transformation 49
In the area of scientific discoveries and machine control, normalizations may not be
enough to adapt the data to improve the generated model. In these cases aggregating
the information contained in various attributes might be beneficial. A family of simple
methods that can be used to this purpose are linear transformations. They are based
on simple algebraic transformations such as sums, averages, rotations, translations
and so on. Let A = A1 , A2 , . . . , An be a set of attributes, let B = B1 , B2 , . . . , Bm
be a subset of the complete set of attributes A. If the following expression is applied
Z = r1 B1 + r2 B2 + · · · + rm B M (3.16)
where ri, j is a real number. These kinds of transformations have been thoroughly
studied and can help to transform data to make it separable.
In Table 3.1 we show an example of how quadratic transformations can help us
to reveal knowledge that it is not explicitly present using the initial attributes of the
data set. Let us consider a set of conic sections described by the coefficients of the
algebraic expression that follows:
A1 x 2 + A2 x y + A3 y 2 + A4 x + A5 y + A6 = 0. (3.19)
From Eq. (3.19) there is no obvious interpretation that can be used to label the tuples
to any type of conic section. However, computing the quadratic transformation known
as the discriminant given by
Z = (A22 − 4 A1 A3 ), (3.20)
the sign of Z provides enough information to correctly label the tuples. Without the
new derived attribute Z we could not be able to classify them.
50 3 Data Preparation Basic Models
Table 3.1 Applying quadratic transformations to identify the implicit conic figure
A1 A2 A3 A4 A5 A6 Z Sign Conic
1 0 42 2 34 −33 −168 − Hyperbola
−5 0 0 −64 29 −68 0 0 Parabola
88 −6 −17 −79 97 −62 6,020 + Ellipse
30 0 0 −53 84 −14 0 0 Parabola
1 19 −57 99 38 51 589 + Ellipse
15 −39 35 98 −52 −40 −579 − Hyperbola
Although the above given example indicates that transformations are necessary
for knowledge discovery in certain scenarios, we usually do not have any clue on
how such transformation can be found and when should them be applied. As [24]
indicates, the best source for obtaining the correct transformation is usually the expert
knowledge. Sometimes the best transformation can be discovered by brute force (see
Sect. 3.5.4).
Sometimes polynomial transformations, including the lineal and quadratic ones, are
not enough to create new attributes able to facilitate the KDD task. In other words,
each problem requires its own set of transformations and such transformations can
adopt any form. For instance, let us consider several triangles in the plane described
by the (X, Y ) coordinates of their vertices as shown in Table 3.2. Considering only
these attributes does not provide us any information about the relationship of the
triangles, but by computing the length of the segments given by:
A= (X 1 − X 2 )2 + (Y1 − Y2 )2 (3.21)
B= (X 2 − X 3 )2 + (Y2 − Y3 )2 (3.22)
C= (X 1 − X 3 )2 + (Y1 − Y3 )2 (3.23)
we can observe that all the segments are of the same length. Obtaining this con-
clusions from the original attributes was impossible and this example is useful to
illustrate that some specific and non-polynomial attribute transformations are needed
but they are also highly dependent of the problem domain. Selecting the appropriate
transformation is not easy and expert knowledge is usually the best alternative to do
so.
3.5 Data Transformation 51
In the last two sections we have observed that specific transformations may be needed
to extract knowledge from a data set. However, help from an expert is not always
available to dictate the form of the transformation and attributes to use. In [24]
the author shows that when no knowledge is available, a transformation f can be
approximated via a polynomial transformation using a brute search with one degree
at a time.
Given a set of attributes X 1 , X 2 , . . . , X n , we want to compute a derived attribute
Y from such a set of already existing features. We set the attribute Y as a function of
the original attributes:
Y = f (X 1 , X 2 , . . . , X n ). (3.24)
Please note that f can be any kind of function. Due to the number of instances
being finite, f can be expressed as a polynomial approximation. Each tuple X i =
(X 1 , X 2 , . . . , X n ) can be considered as a point in Euclidean space. Using the Weis-
trass approximation, there is a polynomial function f that takes the value Yi for each
instance X i .
There are as many polynomials verifying Y = f (X ) as we want, being different in
their expression but with identical output for the point (i.e. the instances) given in the
data set. As the number of instances in the data set increases, the approximations will
be better. We can consider two different cases for our approximations: an intrinsic
transformation which is not a polynomial and when it is intrinsically polynomial.
For the case where the intrinsic transformation is not polynomial, let us take as an
example the first two columns of Table 3.2 representing two vertices of a triangle as
the input tuples X , and the segment length between them as the attribute that we want
to model Y . Adding more examples of two points and their distance, a polynomial
approximation obtained using MATLAB is:
being ri the rank of the observation i and Φ the cumulative normal function.
This transformation is useful to obtain a new variable that is very likely to behave
like to a normally distributed one. However, this transformation cannot be applied
separately to the training and test partitions [30]. Therefore, this transformation is
only recommended when the test and training data is the same.
A big drawback when selecting the optimal transformation for an attribute is that
we do not know in advance which transformation will be the best to improve the
model performance. The Box-Cox transformation aims to transform a continuous
variable into an almost normal distribution. As [30] indicates, this can be achieved
by mapping the values using following the set of transformations:
x λ−1 /λ, λ = 0
y= (3.29)
log(x), λ=0
All linear, inverse, quadratic and similar transformations are special cases of the
Box-Cox transformations. Please note that all the values of variable x in Eq. (3.29)
must be positive. If we have negative values in the attribute we must add a parameter
c to offset such negative values:
(x + c)λ−1 /gλ, λ = 0
y= (3.30)
log(x + c)/g, λ=0
The parameter g is used to scale the resulting values, and it is often considered as the
geometric mean of the data. The value of λ is iteratively found by testing different
values in the range from −3.0 to 3.0 in small steps until the resulting attribute is as
close as possible to the normal distribution.
In [30] a likelihood function to be maximized depending on the value of λ is
defined based on the work of Johnson and Wichern [19]. This function is computed
as: ⎡ ⎤
n ⎣1
m m
L(λ) = − ln (y j − y)2 ⎦ + (λ − 1) lnx j , (3.31)
2 m
j=1 j=1
where y j is the transformation of the value x j using Eq. (3.29), and y is the mean of
the transformed values.
54 3 Data Preparation Basic Models
(a) (b)
Fig. 3.1 Example of the histogram spreading made by a Box-Cox transformation: a before the
transformation and b after the transformation
The presence of nominal attributes in the data set can be problematic, specially if
the DM algorithm used cannot correctly handle them. This is the case of SVMs and
ANNs. The first option is to transform the nominal variable to a numeric one, in
which each nominal value is encoded by an integer, typically starting from 0 or 1
onwards. Although simple, this approach has two big drawbacks that discourage it:
• With this transformation we assume an ordering of the attribute values, as the
integer values are ranked. However the original nominal values did not present
any ranking among them.
• The integer values can be used in operations as numbers, whereas the nominal
values cannot. This is even worse than the first point, as with this nominal to
integer transformation we are establishing unequal differences between pairs of
nominal values, which is not correct.
3.5 Data Transformation 55
In the previous sections, we have analyzed the processes to transform or create new
attributes from the existing ones. However, when the data set is very large, performing
complex analysis and DM can take a long computing time. Data reduction techniques
are applied in these domains to reduce the size of the data set while trying to maintain
the integrity and the information of the original data set as much as possible. In this
way, mining on the reduced data set will be much more efficient and it will also
resemble the results that would have been obtained using the original data set.
The main strategies to perform data reduction are Dimensionality Reduction (DR)
techniques. They aim to reduce the number of attributes or instances available in
the data set. Well known attribute reduction techniques are Wavelet transforms or
Principal Component Analysis (PCA). Chapter 7 is devoted to attribute DR. Many
techniques can be found for reducing the dimensionality in the number of instances,
like the use of clustering techniques, parametric methods and so on. The reader
will find a complete survey of IS techniques in Chap. 8. The use of binning and
discretization techniques is also useful to reduce the dimensionality and complexity
of the data set. They convert numerical attributes into nominal ones, thus drastically
reducing the cardinality of the attributes involved. Chapter 9 presents a thorough
presentation of these discretization techniques.
References
1. Agrawal, R., Srikant, R.: Searching with numbers. IEEE Trans. Knowl. Data Eng. 15(4), 855–
870 (2003)
2. Berry, M.J., Linoff, G.: Data Mining Techniques: For Marketing, Sales, and Customer Support.
Wiley, New York (1997)
56 3 Data Preparation Basic Models
3. Cochinwala, M., Kurien, V., Lalk, G., Shasha, D.: Efficient data reconciliation. Inf. Sci. 137(1–
4), 1–15 (2001)
4. Cohen, W.W.: Integration of heterogeneous databases without common domains using queries
based on textual similarity. In: Proceedings of the 1998 ACM SIGMOD International Confer-
ence on Management of Data. SIGMOD ’98, pp. 201–212. New York (1998)
5. Dey, D., Sarkar, S., De, P.: Entity matching in heterogeneous databases: A distance based deci-
sion model. In: 31st Annual Hawaii International Conference on System Sciences (HICSS’98),
pp. 305–313 (1998)
6. Do, H.H., Rahm, E.: Matching large schemas: approaches and evaluation. Inf. Syst. 32(6),
857–885 (2007)
7. Doan, A., Domingos, P., Halevy, A.Y.: Reconciling schemas of disparate data sources: A
machine-learning approach. In: Proceedings of the 2001 ACM SIGMOD International Con-
ference on Management of Data, SIGMOD ’01, pp. 509–520 (2001)
8. Doan, A., Domingos, P., Halevy, A.: Learning to match the schemas of data sources: a multi-
strategy approach. Mach. Learn. 50, 279–301 (2003)
9. Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE
Trans. Knowl. Data Eng. 19(1), 1–16 (2007)
10. Fellegi, I.P., Sunter, A.B.: A theory for record linkage. J. Am. Stat. Assoc. 64, 1183–1210
(1969)
11. Gill, L.E.: OX-LINK: The Oxford medical record linkage system. In: Proceedings of the Inter-
national Record Linkage Workshop and Exposition, pp. 15–33 (1997)
12. Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Pietarinen, L.,
Srivastava, D.: Using q-grams in a DBMS for approximate string processing. IEEE Data Engi-
neering Bull. 24(4), 28–34 (2001)
13. Guha, S., Koudas, N., Marathe, A., Srivastava, D.: Merging the results of approximate match
operations. In: Nascimento, M.A., Zsu, M.T., Kossmann, D., Miller, R.J., Blakeley, J.A.,
Schiefer, K.B. (eds.) VLDB. Morgan Kaufmann, San Francisco (2004)
14. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data
mining software: an update. SIGKDD Explor. Newsl. 11(1), 10–18 (2009)
15. Han, J., Kamber, M., Pei, J.: Data mining: concepts and techniques. The Morgan Kaufmann
Series in Data Management Systems, 2nd edn. Morgan Kaufmann, San Francisco (2006)
16. Hulse, J., Khoshgoftaar, T., Huang, H.: The pairwise attribute noise detection algorithm. Knowl.
Inf. Syst. 11(2), 171–190 (2007)
17. Jaro, M.A.: Unimatch: A record linkage system: User’s manual. Technical report (1976)
18. Joachims, T.: Advances in kernel methods. In: Making Large-scale Support Vector Machine
Learning Practical, pp. 169–184. MIT Press, Cambridge (1999)
19. Johnson, R.A., Wichern, D.W.: Applied Multivariate Statistical Analysis. Prentice-Hall, Engle-
wood Cliffs (2001)
20. Kim, W., Choi, B.J., Hong, E.K., Kim, S.K., Lee, D.: A taxonomy of dirty data. Data Min.
Knowl. Disc. 7(1), 81–99 (2003)
21. Koudas, N., Marathe, A., Srivastava, D.: Flexible string matching against large databases in
practice. In: Proceedings of the Thirtieth International Conference on Very Large Data Bases,
VLDB ’04, vol. 30, pp. 1078–1086. (2004)
22. Kukich, K.: Techniques for automatically correcting words in text. ACM Comput. Surv. 24(4),
377–439 (1992)
23. Levenshtein, V.: Binary codes capable of correcting deletions. Insertions Reversals Sov. Phys.
Doklady 163, 845–848 (1965)
24. Lin, T.Y.: Attribute transformations for data mining I: theoretical explorations. Int. J. Intell.
Syst. 17(2), 213–222 (2002)
25. McCallum, A., Wellner, B.: Conditional models of identity uncertainty with application to
noun coreference. Advances in Neural Information Processing Systems 17, pp. 905–912. MIT
Press, Cambridge (2005)
26. Monge, A.E., Elkan, C.: The field matching problem: algorithms and applications. In: Simoudis,
E., Han, J., Fayyad, U.M. (eds.) Proceedings of the Second International Conference on Knowl-
edge Discovery and Data Mining (KDD-96), pp. 267–270. KDD, Portland, Oregon, USA (1996)
References 57
27. Philips, L.: Hanging on the metaphone. Comput. Lang. Mag. 7(12), 39–44 (1990)
28. Pyle, D.: Data Preparation for Data Mining. Morgan Kaufmann, San Francisco (1999)
29. Ravikumar, P., Cohen, W.W.: A hierarchical graphical model for record linkage. In: Proceedings
of the 20th Conference on Uncertainty in Artificial Intelligence, UAI ’04, pp. 454–461 (2004)
30. Refaat, M.: Data Preparation for Data Mining Using SAS. Morgan Kaufmann, San Francisco
(2007)
31. Ristad, E.S., Yianilos, P.N.: Learning string edit distance. IEEE Trans. Pattern Anal. Mach.
Intell. 20(5), 522–532 (1998)
32. Singla, P., Domingos, P.: Multi-relational record linkage. In: KDD-2004 Workshop on Multi-
Relational Data Mining, pp. 31–48 (2004)
33. Verykios, V.S., Elmagarmid, A.K., Houstis, E.N.: Automating the approximate record-
matching process. Inf. Sci. 126(1–4), 83–98 (2000)
34. Winkler, W.E.: Improved decision rules in the Fellegi-Sunter model of record linkage. Technical
report, Statistical Research Division, U.S. Census Bureau, Washington, DC (1993)
35. Zhang, S., Zhang, C., Yang, Q.: Data preparation for data mining. Appl. Artif. Intell. 17(5–6),
375–381 (2003)
Chapter 4
Dealing with Missing Values
Abstract In this chapter the reader is introduced to the approaches used in the
literature to tackle the presence of Missing Values (MVs). In real-life data, informa-
tion is frequently lost in data mining, caused by the presence of missing values in
attributes. Several schemes have been studied to overcome the drawbacks produced
by missing values in data mining tasks; one of the most well known is based on
preprocessing, formally known as imputation. After the introduction in Sect. 4.1, the
chapter begins with the theoretical background which analyzes the underlying dis-
tribution of the missingness in Sect. 4.2. From this point on, the successive sections
go from the simplest approaches in Sect. 4.3, to the most advanced proposals, focus-
ing in the imputation of the MVs. The scope of such advanced methods includes the
classic maximum likelihood procedures, like Expectation-Maximization or Multiple-
Imputation (Sect. 4.4) and the latest Machine Learning based approaches which use
algorithms for classification or regression in order to accomplish the imputation
(Sect. 4.5). Finally a comparative experimental study will be carried out in Sect. 4.6.
4.1 Introduction
Many existing, industrial and research data sets contain MVs in their attribute values.
Intuitively a MV is just a value for attribute that was not introduced or was lost in
the recording process. There are various reasons for their existence, such as manual
data entry procedures, equipment errors and incorrect measurements. The presence
of such imperfections usually requires a preprocessing stage in which the data is
prepared and cleaned [71], in order to be useful to and sufficiently clear for the
knowledge extraction process. The simplest way of dealing with MVs is to discard
the examples that contain them. However, this method is practical only when the
data contains a relatively small number of examples with MVs and when analysis of
the complete examples will not lead to serious bias during the inference [54].
MVs make performing data analysis difficult. The presence of MVs can also pose
serious problems for researchers. In fact, inappropriate handling of the MVs in the
analysis may introduce bias and can result in misleading conclusions being drawn
from a research study, and can also limit the generalizability of the research findings
[96]. Three types of problems are usually associated with MVs in DM [5]:
1. loss of efficiency;
2. complications in handling and analyzing the data; and
3. bias resulting from differences between missing and complete data.
Recently some authors have tried to estimate how many MVs are needed to noticeably
harm the prediction accuracy in classification [45].
Usually the treatment of MVs in DM can be handled in three different ways [27]:
• The first approach is to discard the examples with MVs in their attributes. Therefore
deleting attributes with elevated levels of MVs is included in this category too.
• Another approach is the use of maximum likelihood procedures, where the para-
meters of a model for the complete portion of the data are estimated, and later used
for imputation by means of sampling.
• Finally, the imputation of MVs is a class of procedures that aims to fill in the MVs
with estimated ones. In most cases, a data set’s attributes are not independent from
each other. Thus, through the identification of relationships among attributes, MVs
can be determined
We will focus our attention on the use of imputation methods. A fundamental advan-
tage of this approach is that the MV treatment is independent of the learning algorithm
used. For this reason, the user can select the most appropriate method for each situ-
ation faced. There is a broad family of imputation methods, from simple imputation
techniques like mean substitution, KNN, etc.; to those which analyze the relation-
ships between attributes such as: SVM-based, clustering-based, logistic regressions,
maximum likelihood procedures and multiple imputation [6, 26].
The use of imputation methods for MVs is a task with a well established back-
ground. It is possible to track the first formal studies to several decades ago. The
work of [54] laid the foundation of further work in this topic, specially in statis-
tics. From their work, imputation techniques based on sampling from estimated data
distributions followed, distinguishing between single imputation procedures (like
Expectation-Maximization (EM) procedures [81]) and multiple imputation ones [82],
the latter being more reliable and powerful but more difficult and restrictive to be
applied.
These imputation procedures became very popular for quantitative data, and there-
fore they were easily adopted in other fields of knowledge, like bioinformatics
[49, 62, 93], climatic science [85], medicine [94], etc. The imputation methods
proposed in each field are adapted to the common characteristics of the data ana-
lyzed in it. With the popularization of the DM field, many studies in the treatment of
MVs arose in this topic, particularly in the classification task. Some of the existent
imputation procedures of other fields are adapted to be used in classification, for
example adapting them to deal with qualitative data, while many specific approaches
are proposed.
4.2 Assumptions and Missing Data Mechanisms 61
n
P(X|θ ) = f (xi |θ ), (4.1)
i=1
where f is the probability function for a single case and θ represents the parameters
of the model that yield such a particular instance of data. The main problem is that
the particular parameters’ values θ for the given data are very rarely known. For this
reason authors usually overcome this problem by considering distributions that are
commonly found in nature and their properties are well known as well. The three
distributions that standout among these are:
1. the multivariate normal distribution in the case of only real valued parameters;
2. the multinomial model for cross-classified categorical data (including loglinear
models) when the data consists of nominal features; and
3. mixed models for combined normal and categorical features in the data [50, 55].
If we call Xobs the observed part of X and we denote the missing part as Xmis so
that X = (Xobs , Xmis ), we can provide a first intuitive definition of what missing at
random (MAR) means. Informally talking, when the probability that an observation
is missing may depend on Xobs but not on Xmis we can state that the missing data is
missing at random.
In the case of MAR missing data mechanism, given a particular value or val-
ues for a set of features belonging to Xobs , the distribution of the rest of features
is the same among the observed cases as it is among the missing cases. Follow-
ing Schafer’s example based on [79], let suppose that we dispose an n × p matrix
called B of variables whose values are 1 or 0 when X elements are observed and
missing respectively. The distribution of B should be related to X and to some
unknown parameters ζ , so we dispose a probability model for B described by
P(B|X, ζ ). Having a MAR assumption means that this distribution does not depend
on Xmis :
P(B|Xobs , Xmis , ζ ) = P(B|Xobs , ζ ). (4.2)
Please be aware of MAR does not suggest that the missing data values consti-
tute just another possible sample from the probability distribution. This condition is
known as missing completely at random (MCAR). Actually MCAR is a special case
of MAR in which the distribution of an example having a MV for an attribute does
not depend on either the observed or the unobserved data. Following the previous
notation, we can say that
Although there will generally be some loss of information, comparable results can be
obtained with missing data by carrying out the same analysis that would have been
performed with no MVs. In practice this means that, under MCAR, the analysis of
only those units with complete data gives valid inferences.
Please note that MCAR is more restrictive than MAR. MAR requires only that
the MVs behave like a random sample of all values in some particular subclasses
defined by observed data. In such a way, MAR allows the probability of a missing
datum to depend on the datum itself, but only indirectly through the observed values.
4.2 Assumptions and Missing Data Mechanisms 63
Recently a software package has been published in which the MCAR condition can
be tested [43].
A third case arises when MAR does not apply as the MV depends on both the rest
of observed values and the proper value itself. That is
is the actual probability estimation. This model is usually called not missing at
random (NMAR) or missing not at random (MNAR) in the literature. This model of
missingness is a challenge for the user as the only way to obtain an unbiased estimate
is to model the missingness as well. This is a very complex task in which we should
create a model accounting for the missing data that should be later incorporated to
a more complex model used to estimate the MVs. However, even when we cannot
account for the missingness model, the introduced bias may be still small enough.
In [23] the reader can find an example of how to perform this.
In this section we introduce the most simplistic methods used to deal with MVs. As
they are very simple, they usually do not take into account the missingness mechanism
and they blindly perform the operation.
The most simple approach is to do not impute (DNI). As its name indicates,
all the MVs remain unreplaced, so the DM algorithm must use their default MVs
strategies if present. Often the objective is to verify whether imputation methods
allow the classification methods to perform better than when using the original data
sets. As a guideline, in [37] a previous study of imputation methods is presented. As
an alternative for these learning methods that cannot deal with explicit MVs notation
(as a special value for instance) another approach is to convert the MVs to a new
value (encode them into a new numerical value), but such a simplistic method has
been shown to lead to serious inference problems [82].
A very common approach in the specialized literature, even nowadays, is to apply
case deletion or ignore missing (IM). Using this method, all instances with at least
one MV are discarded from the data set. Although IM often results in a substantial
decrease in the sample size available for the analysis, it does have important advan-
tages. In particular, under the assumption that data is MCAR, it leads to unbiased
parameter estimates. Unfortunately, even when the data are MCAR there is a loss in
power using this approach, especially if we have to rule out a large number of sub-
jects. And when the data is not MCAR, it biases the results. For example when low
income individuals are less likely to report their income level, the resulting mean
is biased in favor of higher incomes. The alternative approaches discussed below
should be considered as a replacement for IM.
Often seen as a good choice, the substitution of the MVs for the global most
common attribute value for nominal attributes, and global average value for numerical
64 4 Dealing with Missing Values
attributes (MC) [36] is widely used, specially when many instances in the data set
contain MVs and to apply DNI would result in a very reduced and unrepresentative
pre-processed data set. This method is very simple: for nominal attributes, the MV
is replaced with the most common attribute value, and numerical values are replaced
with the average of all values of the corresponding attribute.
A variant of MC is the concept most common attribute value for nominal attributes,
and concept average value for numerical attributes (CMC) [36]. As stated in MC, the
MV is replaced by the most repeated one if nominal or is the mean value if numerical,
but considers only the instances with the same class as the reference instance.
Older and rarely used DM approaches may be put under this category. For example
Hot deck imputation goes back over 50 years and was used quite successfully by the
Census Bureau and others. It is referred from time to time [84] and thus it is interesting
to describe it here partly for historical reasons and partly because it represents an
approach of replacing data that is missing.
Hot deck has it origins in the surveys made in USA in the 40s and 50s, when
most people felt impelled to participate in survey filling. As a consequence little data
was missing and when any registers were effectively missing, a random complete
case from the same survey was used to substitute the MVs. This process can be
simulated nowadays by clustering over the complete data, and associating the instance
with a cluster. Any complete example from the cluster can be used to fill in the
MVs [6]. Cold deck is similar to hot deck, but the cases or instances used to fill in
the MVs came from a different source. Traditionally this meant that the case used
to fill the data was obtained from a different survey. Some authors have recently
assessed the limitations imposed to the donors (the instances used to substitute the
MVs) [44].
At the same time Rubin et al. formalized the concept of missing data introduc-
tion mechanisms described in Sect. 4.2, they advised against use case deletion as a
methodology (IM) to deal with the MVs. However, using MC or CMC techniques are
not much better than replacing MVs with fixed values, as they completely ignore the
mechanisms that yield the data values. In an ideal and rare case where the parameters
of the data distribution θ were known, a sample from such a distribution conditioned
to the other attributes’ values or not depending of whether the MCAR, MAR or
NMAR applies, would be a suitable imputed value for the missing one. The problem
is that the parameters θ are rarely known and also very hard to estimate [38].
In a simple case such as flipping a coin, P(heads) = θ and P(tails) = 1 − θ .
Depending on the coin being rigged or not, the value of θ can vary and thus its value
is unknown. Our only choice is to flip the coin several times, say n, obtaining h heads
and n − h tails. An estimation of θ would be θ = h/n.
More
h h formally, the likelihood of θ is obtained from a binomial distribution P(θ ) =
θ (1 − θ ) n−h . Our θ can be proven to be the maximum likelihood estimate of θ .
n
4.4 Maximum Likelihood Imputation Methods 65
So the next question arises: to solve a maximum likelihood type problem, can we
analytically maximize the likelihood function? We have shown it can work with one
dimensional Bernoulli problems like the coin toss, and that it also works with one
dimensional Gaussian by finding the μ and σ parameters. To illustrate the latter case
let us assume that we have the samples 1, 4, 7, 9 obtained from a normal distribution
and we want to estimate the population mean for the sake of simplicity, that is, in
this simplistic case θ = μ. The maximum likelihood problem here is to choose a
specific value of μ and compute p(1) · p(4) · p(7) · p(9). Intuitively one can say
that this probability would be very small if we fix μ = 10 and would be higher for
μ = 4 or μ = 5. The value of μ that produces the maximum product of combined
probabilities is what we call the maximum likelihood estimate of μ = θ . Again,
in our case the maximum likelihood estimate would constitute the sample mean
μ = 5.25 and adding the variance to the problem can be solved again using the
sample variance as the best estimator.
In real world data things are not that easy. We can have distribution that may
not be well behaved or have too many parameters making the actual solution com-
putationally too complex. Having a likelihood function made of a mixture of 100
100-dimensional Gaussians would yield 10,000 parameters and thus direct trial-error
maximization is not feasible. The way to deal with such complexity is to introduce
hidden variables in order to simplify the likelihood function and, in our case as well,
to account for MVs. The observed variables are those that can be directly measured
from the data, while hidden variables influence the data but are not trivial to measure.
An example of an observed variable would be if it is sunny today, whereas the hidden
variable can be P(sunny today|sunny yesterday).
Even simplifying with hidden variables does not allow us to reach the solution in
a single step. The most common approach in these cases would be to use an iterative
approach in which we obtain some parameter estimates, we use a regression technique
to impute the values and repeat. However as the imputed values will depend on the
estimated parameters θ , they will not add any useful information to the process
and can be ignored. There are several techniques to obtain maximum likelihood
estimators. The most well known and simplistic is the EM algorithm presented in
the next section.
algorithm consists of two steps that are repeated until convergence: the expectation
(E-step) and the maximization (M-step) steps.
The E-step tries to compute the expectation of logPθ (y, x):
Q(θ, θ ) = Pθ (b|a)logPθ (b, a), (4.5)
y
where θ are the new distribution parameters. Please note that we are using the log.
The reason for this is that we need to multiply the probabilities of each observed value
for an specific set of parameters. But multiplying several probabilities will soon yield
a very small number and thus produce a loss of precision in a computer due to limited
digital accuracy. A typical solution is then to use the log of these probabilities and to
look for the maximum log likelihood. As the logs will be negative, we are looking
for the set of parameters whose likelihood is as close to 0 as possible. In the M-step
we try to find the θ that maximizes Q.
How can we find the θ that maximizes Q? Let us review conditional expectation
where A and B are random variables drawn from distributions P(a) and P(b) to
resolve the M-step. The conditional distribution is given by P(b|a) = P(b,a) P(a) and
then E[B] = b P(b)b. For a function
depending on B h(B) the expectation is
trivially obtained by E[h(B)] = b P(b)h(b). For a particular value A(A = a) the
expectation is E[h(B)|a] = b P(b|a)h(b).
Remember that we want to pick a θ that maximizes the log likelihood of the
observed (a) and the unobserved (b) variables given an observed variable a and the
previous parameters θ . The conditional expectation of logPθ (b, a) given a and θ is
E[logP(b, a|θ )|a, θ ] = P(b|a, θ )logP(b, a|θ ) (4.6)
y
= Pθ (b|a)logPθ (b, a). (4.7)
y
The key is that if b Pθ (b|a)logPθ (b, a) > b Pθ (b|a)logPθ (b, a) then Pθ (a) >
Pθ (a). If we can improve the expectation of the log likelihood, EM is improving the
model of the observed variable a.
In any real world problem, we do not have a single point but a series of attributes
x1 , . . . , xn . Assuming i.i.d. we can sum over all points to compute the expectation:
n
Q(θ, θ ) = Pθ (b|xi )logPθ (b, xi ) (4.8)
i=1 b
The EM algorithm is not perfect: it can be stuck in local maxima and also depends
on an initial θ value. The latter is usually resolved by using a bootstrap process in
order to choose a correct initial θ . Also the reader may have noticed that we have
4.4 Maximum Likelihood Imputation Methods 67
not talked about any imputation yet. The reason is EM is a meta algorithm that it is
adapted to a particular application.
To use EM for imputation first we need to choose a plausible set of parameters,
that is, we need to assume that the data follows a probability distribution, which is
usually seen as a drawback of these kind of methods. The EM algorithm works better
with probability distributions that are easy to maximize, as Gaussian mixture models.
In [85] an approach of EM using multivariate Gaussian is proposed as using multi-
variate Gaussian data can be parameterized by the mean and the covariance matrix.
In each iteration of the EM algorithm for imputation the estimates of the mean μ
and the covariance Σ are represented by a matrix and revised in three phases. These
parameters are used to apply a regression over the MVs by using the complete data.
In the first one in each instance with MVs the regression parameters B for the MVs
are calculated from the current estimates of the mean and covariance matrix and the
available complete data. Next the MVs are imputed with their conditional expectation
values from the available complete ones and the estimated regression coefficients
where the instance x of n attributes is separated into the observed values xobs and
the missing ones xmis . The mean and covariance matrix are also separated in such a
way. The residual e ∈ R1×nmis is assumed to be a random vector with mean zero and
unknown covariance matrix. These two phases would complete the E-step. Please
note that for the iteration of the algorithm the imputation is not strictly needed as
only the estimates of the mean and covariance matrix are, as well as the regression
parameters. But our ultimate goal is to have our data set filled, so we use the latest
regression parameters to create the best imputed values so far.
In the third phase the M-step is completed by re-estimating the mean a covari-
ance matrix. The mean is taken as the sample mean of the completed data set and
the covariance is the sample covariance matrix and the covariance matrices of the
imputation errors as shown in [54]. That is:
−1 Σ
B=Σ
obs,obs obs,mis , and (4.10)
mis,mis − Σ
C=Σ −1 Σ
mis,obs Σ (4.11)
obs,obs obs,mis
1
n
μ(t+1) =
Xi (4.12)
n
i=1
and
68 4 Dealing with Missing Values
1 (t)
n
(t+1) =
Σ μ(t+1) )
Si − ( μ(t+1) , (4.13)
ñ
i=1
(t)
where, for each instance x = Xi , the conditional expectation
Si of the cross-products
is composed of three parts. The two parts that involve the available values in the
instance,
T
E(xobs xobs |xobs ; (t) ) = xobs
μ(t) , Σ T
xobs (4.14)
and
T
E(xmis (t) ) =
μ(t) , Σ
xmis |xobs ; T
xmisxmis +
C, (4.15)
is the sum of the cross-product of the imputed values and the residual covariance
matrix (t) ), the conditional covariance matrix of
μ(t) , Σ
C = Cov(xmiss , xmis |xobs ;
the imputation error. The normalization constant ñ of the covariance matrix estimate
[Eq. (4.13)] is the number of degrees of freedom of the sample covariance matrix of
the completed data set.
The first estimation of the mean and covariance matrix needs to rely on a com-
pletely observed data set. One solution in [85] is to fill the data set with the initial
estimates of the mean and covariance matrices. The process ends when the estimates
of the mean and covariance matrix do not change over a predefined threshold. Please
note that this EM approach is only well suited for numeric data sets, constituting a
limitation for the application of EM, although an extension for mixed numerical and
nominal attributes can be found in [82].
The EM algorithm is still valid nowadays, but is usually part of a system in which
it helps to evolve some distributions like GTM neural networks in [95]. Still some
research is being carried out for EM algorithms in which its limitations are being
improved and also are applied to new fields like semi-supervised learning [97]. The
most well known version of the EM for real valued data sets is the one introduced
in [85] where the basic EM algorithm presented is extended with a regularization
parameter.
One big problem of the maximum likelihood methods like EM is that they tend
to underestimate the inherent errors produced by the estimation process, formally
standard errors. The Multiple Imputation (MI) approach was designed to take this into
account to be a less biased imputation method, at the cost of being computationally
expensive. MI is a Monte Carlo approach described very well by [80] in which we
generate multiple imputed values from the observed data in a very similar way to
the EM algorithm: it fills the incomplete data by repeatedly solving the observed-
4.4 Maximum Likelihood Imputation Methods 69
data. But a significative difference between the two methods is attained: while EM
generates a single imputation in each step from the estimated parameters at each
step, MI performs several imputations that yield several complete data sets.
This repeated imputation can be done thanks to the use of Markov Chain Monte
Carlo methods, as the several imputations are obtained by introducing a random
component, usually from a standard normal distribution. In a more advanced fashion,
MI also considers that the parameters estimates are in fact sample estimates. Thus,
the parameters are not directly estimated from the available data but, as the process
continues, they are drawn from their Bayesian posterior distributions given the data at
hand. These assumptions means that only in the case of MCAR or MAR missingness
mechanisms hold MI should be applied.
As a result Eq. (4.9) can be applied with slight changes as the e term is now a
sample from a standard normal distribution and is applied more than once to obtain
several imputed values for a single MV. As indicated in the previous paragraph,
MI has a Bayesian nature that forces the user to specify a prior distribution for the
parameters θ of the model from which the e term is drawn. In practice [83] is stressed
out that the results depend more on the election of the distribution for the data than
the distribution for θ . Unlike the single imputation performed by EM where only one
imputed value for each MV is created (and thus only one value of e is drawn), MI will
create several versions of the data set, where the observed data Xobs is essentially the
same, but the imputed values for Xmis will be different in each data set created. This
process is usually known as data augmentation (DA) [91] as depicted in Fig. 4.2.
Surprisingly not many imputation steps are needed. Rubin claims in [80] that only
3–5 steps are usually needed. He states that the efficiency of the final estimation built
upon m imputations is approximately:
Fig. 4.2 Multiple imputation process by data augmentation. Every MV denoted by a ‘?’ is replaced
by several imputed and different values that will be used to continue the process later
70 4 Dealing with Missing Values
γ −1
1+ ,
m
where γ is the fraction of missing data in the data set. With a 30 % of MVs in each
data set, which is a quite high amount, with 5 different final data sets a 94 % of
efficiency will be achieved. Increasing the number to m = 10 slightly raises the
efficiency to 97 %, which is a low gain paying the double computational effort.
To start we need an estimation of the mean and covariance matrices. A good
approach is to take them from a solution provided from an EM algorithm once
their values have stabilized at the end of its execution [83]. Then the DA process
starts by alternately filling the MVs and then making inferences about the unknown
parameters in a stochastic fashion. First DA creates an imputation using the available
values of the parameters of the MVs, and then draws new parameter values from the
Bayesian posterior distribution using the observed and missing data. Concatenating
this process of simulating the MVs and the parameters is what creates a Markov
chain that will converge at some point. The distribution of the parameters θ will
stabilize to the posterior distribution averaged over the MVs, and the distribution of
the MVs will stabilize to a predictive distribution: the proper distribution needed to
drawn values for the MIs.
Large rates of MVs in the data sets will cause the convergence to be slow. However,
the meaning of convergence is different to that used in EM. In EM the parameter
estimates have converged when they no longer change from one iteration to the
following over a threshold. In DA the distribution of the parameters do no change
across iterations but the random parameter values actually continue changing, which
makes the convergence of DA more difficult to assess than for EM. In [83] the authors
propose to reinterpret convergence in DA in terms of lack of serial dependence: DA
can be said to have converged by k cycles if the value of any parameter at iteration
t ∈ 1, 2, . . . is statistically independent of its value at iteration t + k. As the authors
show in [83] the DA algorithm usually converges under these terms in equal or less
cycles than EM.
The value k is interesting, because it establishes when we should stop performing
the Markov chain in order to have MI that are independent draws from the missing
data predictive distribution. A typical process is to perform m runs, each one of length
k. That is, for each imputation from 1 to m we perform the DA process during k cycles.
It is a good idea not to be too conservative with the k value, as after convergence the
process remains stationary, whereas with low k values the m imputed data sets will
not be truly independent. Remember that we do not need a high m value, so k acts
as the true computational effort measure.
Once the m MI data sets have been created, they can be analyzed by any standard
complete-data methods. For example, we can use a linear or logistic regression,
a classifier or any other technique applied to each one of the m data sets, and the
variability of the m results obtained will reflect the uncertainty of MVs. It is common
to combine the results following the rules provided by Rubin [80] that act as measures
of ordinary sample variation to obtain a single inferential statement of the parameters
of interest.
4.4 Maximum Likelihood Imputation Methods 71
Rubin’s rules to obtain an overall set of estimated coefficients and standard errors
proceed as follows. Let R denote the estimation of interest and U its estimated
variance, R being either an estimated regression coefficient or a kernel parameter
of a SVM, whatever applies. Once the MIs have been obtained, we will have
R1 ,
R2 , . . . ,
Rm estimates and their respective variances U1 , U2 , . . . , Um . The overall
estimate, occasionally called the MI estimate is given by
1
m
R= Ri . (4.16)
m
i=1
The variance for the estimate has two components: the variability within each
data set and across data sets. The within imputation variance is simply the average
of the estimated variances:
1
m
U= Ui , (4.17)
m
i=1
whereas the between imputation variance is the sample variance of the proper esti-
mates:
1
m
B= (Ri − R)2 . (4.18)
m−1
i=1
The total variance T is the corrected sum of these two components with a factor that
accounts for the simulation error in
R,
T =U + 1 + 1 B. (4.19)
m
The square root of T is the overall standard error associated to R. In the case of no
MVs being present in the original data set, all R1 ,
R2 , . . . ,
Rm would be the same,
then B = 0 and T = U. The magnitude of B with respect to U indicates how much
information is contained in the missing portion of the data set relative to the observed
part.
In [83] the authors elaborate more on the confidence intervals extracted from R
and how to test the null hypothesis of R = 0 by comparing the ratio √R with a
T
Student’s t-distribution with degrees of freedom
2
mU
df = (m − 1) 1 + , (4.20)
(m + 1)B
in the case the readers would like to further their knowledge on how to use this
hypothesis to check whether the number of MI m was large enough.
72 4 Dealing with Missing Values
The MI algorithm has been widely used in many research fields. Focusing
on DM methods to increase the robustness of MI [19], alleviate the parameter
selection process [35] and improve Rubin’s rules to aggregate models have been
proposed [86]. New extensions to new problems like one-class [48] can be found,
as well as hybridizations with innovative techniques such as Gray System Theory
[92]. Implementing MI is not trivial and reputed implementations can be found in
statistical packages as R [9] (see Chap. 10) and Stata [78].
4.4.3.1 PC Regression
For the time being, we consider a situation where there is no MV. PCA represents the
variation of D-dimensional example vectors y as a linear combination of principal
axis vectors wl (1 ≤ l ≤ K) whose number is relatively small (K < D):
K
y= xl wl + (4.21)
l=1
The linear coefficients xl (1 ≤ l ≤ K) are called factor scores. denotes the residual
error. Using a specifically determined number K, PCA obtains xl and wl such that
the sum of squared error 2 over the whole data set Y is minimized.
When there is no MV, xl and wl are calculated as follows. A covariance matrix S
for the example vectors yi (1 ≤ i ≤ N) is given by
1
N
S= (yi − μ)(yi − μ)T , (4.22)
N
i=1
where μ is the mean vector of y: μ = (1/N) N i=1 yi . T denotes the transpose of
a vector or a matrix. For description convenience, Y is assumed to be row-wisely
normalized by a preprocess, so that μ = 0 holds. With this normalization, the result
by PCA is identical to that by SVD.
Let λ1 ≥ λ2 ≥ · · · ≥ λD and u1 , u2 , . . . , uD denote the eigenvalues and the
corresponding eigenvectors,
√ respectively, of S. We also define the lth principal axis
vector by wl = λl ul .With these notations, the lth factor score for an example vector
4.4 Maximum Likelihood Imputation Methods 73
This is a well-known regression problem, and the least square solution is given by
p(x) = NK (x|0, IK ),
p() = ND (|0, (1/τ )ID ),
p(θ ) is called a prior distribution, which denotes a priori preference for parameter θ .
The prior distribution is a part of the model and must be defined before estimation.
We assume conjugate priors for τ and μ, and a hierarchical prior for W , namely, the
prior for W , p(W |τ, α), is parameterized by a hyperparameter α ∈ RK .
K
p(θ |α) ≡ p(μ, W , τ |α) = p(μ|τ )p(τ ) p(wj |τ, αj ),
j=1
(γτ τ −1 )γτ
G(τ |τ , γτ ) ≡ exp −γτ τ −1 τ + (γτ − 1)lnτ
Γ (γτ )
If we know the true parameter θtrue , the posterior of the MVs is given by
which produces equivalent estimation to the PC regression. Here, p(Ymiss |Yobs , θtrue )
is obtained by marginalizing the likelihood (4.24) with respect to the observed vari-
ables Yobs . If we have the parameter posterior q(θ ) instead of the true parameter, the
posterior of the MVs is given by
q(Ymiss ) = dθ q(θ )p(Ymiss |Yobs , θ ),
which corresponds to the Bayesian PC regression. Since we do not know the true
parameter naturally, we conduct the BPCA. Although the parameter posterior q(θ )
can be easily obtained by the Bayesian estimation when a complete data set Y is
available, we assume that only a part of Y , Yobs , is observed and the rest Ymiss is
missing. In that situation, it is required to obtain q(θ ) and q(Ymiss ) simultaneously.
We use a variational Bayes (VB) algorithm, in order to execute Bayesian esti-
mation for both model parameter θ and MVs Ymiss . Although the VB algorithm
resembles the EM algorithm that obtains maximum likelihood estimators for θ and
Ymiss , it obtains the posterior distributions for θ and Ymiss , q(θ ) and q(Ymiss ), by a
repetitive algorithm.
The VB algorithm is implemented as follows: (a) the posterior distribution of
MVs, q(Ymiss ), is initialized by imputing each of the MVs to instance-wise average;
(b) the posterior distribution of the parameter θ , q(θ ), is estimated using the observed
data Yobs and the current posterior distribution of MVs, q(Ymiss ); (c) the posterior
distribution of the MVs, q(Ymiss ), is estimated using the current q(θ ); (d) the hyperpa-
rameter α is updated using both of the current q(θ ) and the current q(Ymiss ); (e) repeat
(b)–(d) until convergence.
The VB algorithm has been proved to converge to a locally optimal solution.
Although the convergence to the global optimum is not guaranteed, the VB algorithm
for BPCA almost always converges to a single solution. This is probably because
the objective function of BPCA has a simple landscape. As a consequence of the VB
algorithm, therefore, q(θ ) and q(Ymiss ) are expected to approach the global optimal
posteriors.
Then, the MVs in the expression matrix are imputed to the expectation with respect
to the estimated posterior distribution:
Ymiss = ymiss q(Ymiss )dYmiss . (4.25)
76 4 Dealing with Missing Values
the most common value among all neighbors is taken, and for numerical values the
average value is used. Therefore, a proximity measure between instances is needed
for it to be defined. The Euclidean distance (it is a case of a Lp norm distance) is the
most commonly used in the literature.
In order to estimate a MV yih in the ith example vector yi by KNNI [6], we first
select K examples whose attribute values are similar to yi . Next, the MV is estimated
as the average of the corresponding entries in the selected K expression vectors.
When there are other MVs in yi and/or yj , their treatment requires some heuristics.
The missing entry yih is estimated as average:
j∈IKih yjh
y i h = , (4.26)
|IKih |
where IKih is now the index set of KNN examples of the ith example, and if yjh
is missing the jth attribute is excluded from IKih . Note that KNNI has no theoret-
ical criteria for selecting the best K-value and the K-value has to be determined
empirically.
The Weighted KNN method [93] selects the instances with similar values (in terms
of distance) to incomplete instance, so it can impute as KNNI does. However, the
estimated value now takes into account the different distances to the neighbors, using
a weighted mean or the most repeated value according to a similarity measure. The
similarity measure si (yj ) between two examples yi and yj is defined by the Euclidian
distance calculated over observed attributes in yi . Next we define the measure as
follows:
1/si = (yih − yjh )2 , (4.27)
hi ∈Oi Oj
where IKih is the index set of KNN examples of the ith example, and if yjh is missing
the jth attribute is excluded from IKih . Note that KNNI has no theoretical criteria for
selecting the best K-value and the K-value has to be determined empirically.
78 4 Dealing with Missing Values
In fuzzy clustering, each data object has a membership function which describes the
degree to which this data object belongs to a certain cluster. Now we want to extend
the original K-means clustering method to a fuzzy version to impute missing data
[1, 53]. The reason for applying the fuzzy approach is that fuzzy clustering provides
4.5 Imputation of Missing Values. Machine Learning Based Methods 79
a better description tool when the clusters are not well-separated, as is the case in
missing data imputation. Moreover, the original K-means clustering may be trapped
in a local minimum status if the initial points are not selected properly. However,
continuous membership values in fuzzy clustering make the resulting algorithms less
susceptible to get stuck in a local minimum situation.
In fuzzy clustering, each data object xi has a membership function which describes
the degree to which this data object belongs to certain cluster vk . The membership
function is defined in the next equation
d(vk , xi )−27(m−1)
U(vk , xi ) = K (4.29)
−2/(m−1)
j=1 d(vj , xi )
K
where m > 1 is the fuzzifier, and j=1 U(vj , xi ) = 1 for any data object
xi (1 ≤ i ≤ N). Now we can not simply compute the cluster centroids by the mean
values. Instead, we need to consider the membership degree of each data object.
Equation (4.30) provides the formula for cluster centroid computation:
N
i=1 U(vk , xi ) × xi
vk = N (4.30)
i=1 U(vk , xi )
Since there are unavailable data in incomplete objects, we use only reference
attributes to compute the cluster centroids.
The algorithm for missing data imputation with fuzzy K-means clustering method
also has three processes. Note that in the initialization process, we pick K centroids
which are evenly distributed to avoid local minimum situation. In the second process,
we iteratively update membership functions and centroids until the overall distance
meets the user-specified distance threshold ε. In this process, we cannot assign the
data object to a concrete cluster represented by a cluster centroid (as did in the basic
K-mean clustering algorithm), because each data object belongs to all K clusters
with different membership degrees. Finally, we impute non-reference attributes for
each incomplete object. We replace non-reference attributes for each incomplete
data object xi based on the information about membership degrees and the values of
cluster centroids, as shown in next equation:
K
xi,j = U(xi , vk ) × vk,j , for any non-reference attribute j ∈
/R (4.31)
k=1
attributes (input attributes) and the condition attributes as the decision attributes, so
SVM regression can be used to predict the missing condition attribute values. SVM
regression estimation seeks to estimate functions
based on data
1
l
ε
Remp = |yi − f (xi )|ε (4.35)
l
i=1
and the model complexity term W 2 . Here, we use the so-called ε-insensitive loss
function
The main insight of the statistical learning theory is that in order to obtain a small risk,
one needs to control both training error and model complexity, i.e. explain the data
with a simple model. The minimization of Eq. (4.36) is equivalent to the following
constrained optimization problem [17]: minimize
1
l
1
τ (w, ξ (∗) ) = w 2 + C (ξi + ξi∗ ) (4.37)
2 l
i=1
((w • xi ) + b) − yi ≤ ε + ξi (4.38)
ξi(∗) ≥ 0, ε ≥ 0 (4.40)
are penalized in the objective function via the regularization parameter C chosen a
priori.
In the ν-SVM the size of ε is not defined a priori but is itself a variable. Its value
is traded off against model complexity and slack variables via a constant ν ∈ (0, 1]
minimize
1
l
1
τ (W , ξ (∗) , ε) = W 2 +C • (νε + (ξi + ξi∗ )) (4.41)
2 l
i=1
subject to the constraints 4.38–4.40. Using Lagrange multipliers techniques, one can
show [17] that the minimization of Eq. (4.37) under the constraints 4.38–4.40 results
in a convex optimization problem with a global minimum. The same is true for the
optimization problem 4.41 under the constraints 4.38–4.40. At the optimum, the
regression estimate can be shown to take the form
l
f (x) = (αi∗ − αi )(xi • x) + b (4.42)
i=1
In most cases, only a subset of the coefficients (αi∗ − αi ) will be nonzero. The
corresponding examples xi are termed support vectors (SVs). The coefficients and
the SVs, as well as the offset b; are computed by the ν-SVM algorithm. In order to
move from linear (as in Eq. 4.42) to nonlinear functions the following generalization
can be done: we map the input vectors xi into a high-dimensional feature space Z
through some chosen a priori nonlinear mapping Φ : Xi → Zi . We then solve the
optimization problem 4.41 in the feature space Z. In this case, the inner product
of the input vectors (xi • x) in Eq. (4.42) is replaced by the inner product of their
icons in feature space Z, (Φ(xi ) • Φ(x)). The calculation of the inner product in
a high-dimensional space is computationally very expensive. Nevertheless, under
general conditions (see [17] and references therein) these expensive calculations can
be reduced significantly by using a suitable function k such that
l
f (x) = (αi∗ − αi )k(xi , x) + b (4.44)
i=1
The nonlinear function k is called a kernel [17]. We mostly use a Gaussian kernel
We can use SVM regression [29] to predict the missing condition attribute values.
In order to do that, first we select the examples in which there are no missing attribute
values. In the next step we set one of the condition attributes (input attribute), some of
those values are missing, as the decision attribute (output attribute), and the decision
attributes as the condition attributes by contraries. Finally, we use SVM regression
to predict the decision attribute values.
Based on the work of Wong et al. [99], a mixed-mode probability model is approx-
imated by a discrete one. First, we discretize the continuous components using a
minimum loss of information criterion. Treating a mixed-mode feature n-tuple as
a discrete-valued one, the authors propose a new statistical approach for synthe-
sis of knowledge based on cluster analysis: (1) detect from data patterns which
indicate statistical interdependency; (2) group the given data into clusters based on
detected interdependency; and (3) interpret the underlying patterns for each of the
clusters identified. The method of synthesis is based on author’s event–covering
approach. With the developed inference method, we are able to estimate the MVs in
the data.
The cluster initiation process involves the analysis of the nearest neighbour dis-
tance distribution on a subset of samples, the selection of which is based on a mean
probability criterion. Let X = (X1 , X2 , . . . , Xn ) be a random n-tuple of related vari-
ables and x = (x1 , x2 , . . . , xn ) be its realization. Then a sample can be represented
as x. Let S be an ensemble of observed samples represented as n-tuples. The nearest-
neighbour distance of a sample xi to a set of examples S is defined as:
D(xi , S) = minxj ∈S x
=x d(xi , xj ) (4.46)
i j
where d(xi , xj ) is a distance measure. Since we are using discrete values, we have
adopted the Hamming distance. Let C be a set of examples forming a simple cluster.
We define the maximum within-cluster nearest-neighbour distance as
Dc∗ reflects an interesting characteristic of the cluster configuration: that is, the smaller
the Dc∗ , the denser the cluster.
Using a mean probability criterion to select a similar subset of examples, the
isolated samples can be easily detected by observing the wide gaps in the nearest-
neighbour distance space. The probability distribution from which the criterion is
derived for the samples can be estimated using a second-order probability estimation.
An estimation of P(x) known as the dependence tree product approximation can be
4.5 Imputation of Missing Values. Machine Learning Based Methods 83
expressed as:
n
P(x) = P(xmj |xmk(j) ), 0 < k(j) < 1 (4.48)
j=1
where |S| is the number of samples in S. For more details in the probability estimation
with dependence tree product approximation, please refer to [13].
When distance is considered for cluster initiation, we can use the following criteria
in assigning a sample x to a cluster.
1. If there exists more than one cluster, say Ck |k = 1, 2, . . ., such that D(x, Ck ) ≤ D∗
for all k, then all these clusters can be merged together.
2. If exactly one cluster Ck exists, such that D(x, Ck ) ≤ D∗ , then x can be grouped
into Ck .
3. If D(x, CK ) > D∗ for all clusters Ck , then x may not belong to any cluster.
To avoid including distance calculation of outliers, we use a simple method suggested
in [99] which assigns D∗ the maximum value of all nearest-neighbor distances in L
provided there is a sample in L having a nearest-neighbor distance value of D∗ − 1
(with the distance values rounded to the nearest integer value).
After finding the initial clusters along with their membership, the regrouping
process is thus essentially an inference process for estimating the cluster label of
a sample. Let C = ac1 , ac2 , . . . , acq be the set of labels for all possible clusters to
which x can be assigned. For Xk in X, we can form a contingency table between Xk
and C. Let aks and acj be possible outcomes of Xk and C respectively, and let obs(aks
and obsacj be the respectively marginal frequencies of their observed occurrences.
The expected relative frequency of (aks , acj ) is expressed as:
obs(aks ) × obs(acj )
exp(aks , acj ) = (4.50)
|S|
84 4 Dealing with Missing Values
Let obs(aks , acj ) represent the actual observed frequency of (aks , acj ) in S. The
expression
q
(obsks − exp(aks , acj ))2
D= (4.51)
exp(aks , acj )
j=1
We call Ekc the covered event subset of Xk with respect to C. Likewise, the covered
event subset Eck of C with respect to Xk can be defined. After finding the covered
event subsets of Eck and Ekc between a variable pair (C, Xk ), information measures
can be used to detect the statistical pattern of these subsets. An interdependence
redundancy measure between Xkc and C k can be defined as
I(Xkc , C k )
R(Xkc , C k ) = (4.54)
H(Xkc , C k )
where I(Xkc , C k ) is the expected MI and H(Xkc , C k ) is the Shannon’s entropy defined
respectively on Xkc and C k :
P(acu , aks )
I(Xkc , C k ) = P(acu , aks ) log (4.55)
P(acu )P(aks )
acu ∈Eck aks ∈Ekc
and
H(Xkc , C k ) = − P(acu , aks ) log P(acu , aks ). (4.56)
acu ∈Eck aks ∈Ekc
χdf2
I(Xkc , C k ) (4.57)
2|S|H(xkc , C k )
4.5 Imputation of Missing Values. Machine Learning Based Methods 85
where df is the corresponding degree of freedom having the value (|Eck |−1)(|Ekc |−1).
A chi-squared test is then used to select interdependent variables in X at a presumed
significant level.
The cluster regrouping process uses an information measure to regroup data itera-
tively. Wong et al. have proposed an information measure called normalized surprisal
(NS) to indicate significance of joint information. Using this measure, the informa-
tion conditioned by an observed event xk is weighted according to R(Xkc , C K ), their
measure of interdependency with the cluster label variable. Therefore, the higher the
interdependency of a conditioning event, the more relevant the event is. NS measures
the joint information of a hypothesized value based on the selected set of significant
components. It is defined as
I(acj |x (acj ))
NS(acj |x (acj )) = m (4.58)
k=1 R(Xk , C )
m c k
where I(acj |x (acj )) is the summation of the weighted conditional information defined
on the incomplete probability distribution scheme as
m
I(acj |x (acj )) = R(Xkc , C k )I(acj |xk ))
k=1
m
P(acj |xk )
= R(Xkc , C k ) −log (4.59)
k=1 acu ∈Eck P(acu |xk )
Once the clusters are stable, we take the examples with MVs. Now we use the dis-
tance D(xi , S) = minxj ∈S x
=x d(xi , xj ) to find the nearest cluster Ci to such instance.
i j
From this cluster we compute the centroid x such that
for all instances xi of the cluster Ci . Once the centroid is obtained, the MV of the
example is imputed with the value of the proper attribute of xi .
reason the method has the “local” connotation. There are two steps in the LLSI. The
first step is to select k instances by the L2 -norm. The second step is regression and
estimation, regardless of how the k instances are selected. A heuristic k parameter
selection method is used by the authors.
Throughout the section, we will use X ∈ Rm×n to denote a dataset with m attributes
and n instances. Since LLSI was proposed for microarrays, it is assumed that m n.
In the data set X, a row xiT ∈ R1×n represents expressions of the ith instance in n
examples:
⎛ T⎞
x1
⎜ ⎟
X = ⎝ ... ⎠ ∈ Rm×n
T
xm
X(i, l) = xi (l) = α
For simplicity we first assume assuming there is a MV in the first position of the
first instance, i.e.
X(1, 1) = x1 (1) = α.
xSTi ∈ R1×n , 1 ≤ i ≤ k,
are found for LLSimpute based on the L2 -norm (LLSimpute). In this process of
finding the similar instances, the first component of each instance is ignored due to
the fact that x1 (1) is missing. The LLSimpute based on the Pearson’s correlation
coefficient to select the k instances can be consulted in [49].
vector w are the n − 1 elements of the instance vector x1 whose missing first item
is deleted. After the matrix A, and the vectors b and w are formed, the least squares
problem is formulated as
α = bT z = bT (AT )† w, (4.64)
where α is the MV and xST1 , . . . , xSTk are instances similar to x1T . From the second
to the last components of the neighbor instances, aiT , 1 ≤ i ≤ k, form the ith row
vector of the matrix A. The vector w of the known elements of target instance x1 can
be represented as a linear combination
w z1 a1 + z2 a2 + · · · + zk ak
where zi are the coefficients of the linear combination, found from the least squares
formulation (4.63). Accordingly, the MV α in x1 can be estimated by
α = bT x = b1 z1 + b2 z2 + · · · + bk zk
Now, we deal with the case in which there is more than one MV in a instance
vector. In this case, to recover the total of q MVs in any of the locations of the instance
x1 , first, the KNN instance vectors for x1 ,
xSTi ∈ R1×n , 1 ≤ i ≤ k,
are found. In this process of finding the similar instances, the q components of
each instance at the q locations of MVs in x1 are ignored. Then, based on these
4.5 Imputation of Missing Values. Machine Learning Based Methods 89
Table 4.1 Recent and most well-known imputation methods involving ML techniques
Clustering Kernel methods
MLP hybrid [4] Mixture-kernel-based iterative estimator [105]
Rough fuzzy subspace clustering [89] Nearest neighbors
LLS based [47] ICkNNI [40]
Fuzzy c-means with SVR and Gas [3] Iterative KNNI [101]
Biclustering based [32] CGImpute [22]
KNN based [46] Boostrap for maximum likelihood [72]
Hierarchical Clustering [30] kDMI [75]
K2 clustering [39] Ensembles
Weighted K-means [65] Random Forest [42]
Gaussian mixture clustering [63] Decision forest [76]
ANNs Group Method of Data Handling (GMDH) [104]
RBFN based [90] Boostrap [56]
Wavelet ANNs [64] Similarity and correlation
Multi layer perceptron [88] FIMUS [77]
ANNs framework [34] Parameter estimation for regression imputation
Self-organizing maps [58] EAs for covariance matrix estimation [31]
Generative Topographic Mapping [95] Iterative mutual information imputation [102]
Bayesian networks CMVE [87]
Dynamic bayesian networks [11] DMI (EM + decision trees) [74]
Bayesian networks with weights [60] WLLSI [12]
90 4 Dealing with Missing Values
Although we have tried to provide an extensive introduction to the most used and
basic imputation methods based on ML techniques, there is a great amount of journal
publications showing their application and particularization to real world problems.
We would like to give the reader a summarization of the latest and more important
imputation methods presented at the current date of publication, both extensions of
the introduced ones and completely novel ones in Table 4.1.
In this section we aim to provide the reader with a general overview of the behavior
and properties of all the imputation methods presented above. However, this is not
an easy task. The main question is: what is a good imputation method?
As multiple imputation is a very resource consuming approach, we will focus on
the single imputation methods described in this chapter.
From an unsupervised data point of view, those imputation methods able to generate
values close to the true but unknown MV should be the best. This idea has been
explored in the literature by means of using complete data sets and then artificially
introducing MVs. Please note that such a mechanism will act as a MCAR MV
generator mechanism, validating the use of imputation methods. Then, imputation
methods are applied to the data and an estimation of how far is the estimation to the
original (and known) value. Authors usually choose the mean square error (MSE)
or root mean square error (RMSE) to quantify and compare the imputation methods
over a set of data sets [6, 32, 41, 77].
On the other hand, other problems arise when we do not have the original values
or the problem is supervised. In classification, for example, it is more demanding
to impute values that will constitute an easier and more generalizable problem. As
a consequence in this paradigm a good imputation method will enable the classifier
to obtain better accuracy. This is harder to measure, as we are relating two different
values: the MV itself and the class label assigned to the example. Neither MSE or
RMSE can provide us with such kind of information.
One way to measure how good the imputation is for the supervised task is to
use Wilson’s Noise Ratio. This measure proposed by [98] observes the noise in the
data set. For each instance of interest, the method looks for the KNN (using the
4.6 Experimental Comparative Analysis 91
Euclidean distance), and uses the class labels of such neighbors in order to classify
the considered instance. If the instance is not correctly classified, then the variable
noise is increased by one unit. Therefore, the final noise ratio will be
noise
Wilson’s Noise =
#instances in the data set
After imputing a data set with different imputation methods, we can measure how
disturbing the imputation method is for the classification task. Thus by using Wilson’s
noise ratio we can observe which imputation methods reduce the impact of the MVs
as a noise, and which methods produce noise when imputing.
Another approach is to use the MI (MI) which is considered to be a good indicator
of relevance between two random variables [18]. Recently, the use of the MI measure
in FS has become well-known and seen to be successful [51, 52, 66]. The use of
the MuI measure for continuous attributes has been tackled by [51], allowing us to
compute the Mui measure not only in nominal-valued data sets.
In our approach, we calculate the Mui between each input attribute and the class
attribute, obtaining a set of values, one for each input attribute. In the next step we
compute the ratio between each one of these values, considering the imputation of
the data set with one imputation method in respect to the not imputed data set. The
average of these ratios will show us if the imputation of the data set produces a gain
in information:
Muiα (xi )+1
xi ∈X Mui(xi )+1
Avg. Mui Ratio =
|X|
where X is the set of input attributes, Muiα (i) represents the Mui value of the ith
attribute in the imputed data set and Mui(i) is the Mui value of the ith input attribute
in the not imputed data set. We have also applied the Laplace correction, summing
1 to both numerator and denominator, as an Mui value of zero is possible for some
input attributes.
The calculation of Mui(xi ) depends on the type of attribute xi . If the attribute xi
is nominal, the Mui between xi and the class label Y is computed as follows:
p(z, y)
Muinominal (xi ) = I(xi ; Y ) = p(z, y)log2 .
z∈xi y∈Y
p(z)p(y)
On the other hand, if the attribute xi is numeric, we have used the Parzen window
density estimate as shown in [51] considering a Gaussian window function:
Considering each sample has the same probability, applying the Bayesian rule and
approximating p(y|z) by the Parzen window we get:
n
1
N
Ĥ(Y |xi ) = − p̂(y|zj )log2 p̂(y|zj )
n
j=1 y=1
where n is the number of instances in the data set, N is the total number of class
labels and p̂(c|x) is
(z−zi )Σ −1 (z−zi )
i∈Ic exp − 2h2
p̂(y|z) =
.
N (z−zi )Σ −1 (z−zi )
k=1 i∈Ik exp − 2h2
In this case, Ic is the set of indices of the training examples belonging to class c, and
Σ is the covariance of the random variable (z − zi ).
Let us consider all the single imputation methods presented in this chapter. For the
sake of simplicity we will omit the Multiple Imputation approaches, as it will require
us to select a probability model for all the data sets, which would be infeasible. In
Table 4.2 we have summarized the Wilson’s noise ratio values for 21 data sets with
MVs from those presented in Sect. 2.1. We must point out that the results of Wilson’s
noise ratio and Mui are related to a given data set. Hence, the characteristics of the
proper data appear to determine the values of this measure.
Looking at the results from Table 4.2 we can observe which imputation methods
reduce the impact of the MVs as noise, and which methods produce noise when
imputing. In addition the MI ratio allows us to relate the attributes to the imputation
results. A value of the Mui ratio higher than 1 will indicate that the imputation is
capable of relating more of the attributes individually to the class labels. A value lower
than 1 will indicate that the imputation method is adversely affecting the relationship
between the individual attributes and the class label.
If we consider the average Mui ratio in Table 4.2 we can observe that the average
ratios are usually close to 1; that is, the use of imputation methods appears to harm
the relationship between the class label and the input attribute little or not at all, even
improving it in some cases. However, the MI considers only one attribute at a time and
therefore the relationships between the input attributes are ignored. The imputation
Table 4.2 Wilson’s noise ratio values and average MI values per data set
Data set Imp. % Wilson’s Avg. Mui Data set Imp. % Wilson’s Avg. Mui Data set Imp. % Wilson’s Avg. Mui
Method Noise ratio Method Noise ratio Method Noise ratio
CLE MC 50.0000 0.998195 HOV MC 7.9208 0.961834 HEP MC 17.3333 0.963765
CMC 50.0000 0.998585 CMC 5.4455 1.105778 CMC 16.0000 0.990694
KNNI 50.0000 0.998755 KNNI 7.4257 0.965069 KNNI 20.0000 0.978564
WKNNI 50.0000 0.998795 WKNNI 7.4257 0.965069 WKNNI 20.0000 0.978343
KMI 50.0000 0.998798 KMI 7.4257 0.961525 KMI 20.0000 0.980094
FKMI 50.0000 0.998889 FKMI 7.9208 0.961834 FKMI 17.3333 0.963476
SVMI 50.0000 0.998365 SVMI 6.9307 0.908067 SVMI 17.3333 1.006819
4.6 Experimental Comparative Analysis
(continued)
94
(continued)
96
methods estimate the MVs using such relationships and can afford improvements in
the performance of the classifiers. Hence the highest values of average Mui ratios
could be related to those methods which can obtain better estimates for the MVs, and
maintaining the relationship degree between the class labels and the isolated input
attributes. It is interesting to note that when analyzing the Mui ratio, the values do
not appear to be as highly data dependant as Wilson’s noise ratio, as the values for
all the data sets are more or less close to each other.
If we count the methods with the lowest Wilson’s noise ratios in each data set
in Table 4.2, we find that the CMC method is first, with 12 times being the lowest
one, and the EC method is second with 9 times being the lowest one. If we count the
methods with the highest MI ratio in each data set, the EC method has the highest
ratio for 7 data sets and is therefore the first one. The CMC method has the highest
ratio for 5 data sets and is the second one in this case. Immediately the next question
arises: are these methods also the best for the performance of the learning methods
applied afterwards? We try to answer this question in the following.
Our aim is to use the same imputation results as data sets used in the previous
Sect. 4.6.1 as the input for a series of well known classifiers in order to shed light on
the question “which is the best imputation method?”. Let us consider a wide range
of classifiers grouped by their nature, as that will help us to limit the comparisons
needed to be made. We have grouped them in three sub-categories. In Table 4.3
we summarize the classification methods we have used, organized in these three
categories. The description of the former categories is as follows:
• The first group is the Rule Induction Learning category. This group refers to
algorithms which infer rules using different strategies.
• The second group represents the Black Box Methods. It includes ANNs, SVMs
and statistical learning.
• The third and last group corresponds to the Lazy Learning (LL) category. This
group incorporates methods which do not create any model, but use the training
data to perform the classification directly.
Some methods do not work with numerical attributes (CN2, AQ and Naïve-Bayes).
In order to discretize the numerical values, we have used the well-known discretizer
proposed by [28]. For the SVM methods (C-SVM, ν-SVM and SMO), we have
applied the usual preprocessing in the literature to these methods [25]. This pre-
processing consists of normalizing the numerical attributes to the [0, 1] range, and
binarizing the nominal attributes. Some of the presented classification methods in the
previous section have their own MVs treatment that will trigger when no imputation
is made (DNI): C4.5 uses a probabilistic approach to handling MVs and CN2 applies
the MC method by default in these cases. For ANNs [24] proposed to replace MVs
with zero so as not to trigger the corresponding neuron which the MV is applied to.
98 4 Dealing with Missing Values
As shown here all the detailed accuracy values for each fold, data set, imputation
method and classifier would be too long, we have used Wilcoxon’s Signed Rank test
to summarize them. For each classifier, we have compared every imputation method
along with the rest in pairs. Every time the classifier obtains a better accuracy value for
an imputation method than another one and the statistical test yield a p − value < 0.1
we count it as a win for the former imputation method. In another case it is a tie when
p − value > 0.1.
In the case of rule induction learning in Table 4.4 we show the average ranking
or each imputation method for every classifier belonging to this group. We can
observe that, for the rule induction learning classifiers, the imputation methods FKMI,
SVMI and EC perform best. The differences between these three methods in average
rankings are low. Thus we can consider that these three imputation methods are
the most suitable for this kind of classifier. They are well separated from the other
4.6 Experimental Comparative Analysis 99
Table 4.4 Average ranks for the Rule Induction Learning methods
C45 Ripper PART Slipper AQ CN2 SRI Ritio Rules-6 Avg. Ranks
IM 5 8.5 1 4 6.5 10 6.5 6 5 5.83 4
EC 2.5 8.5 6.5 1 6.5 5.5 6.5 6 1 4.89 3
KNNI 9 2.5 6.5 11 11 5.5 11.5 11 11 8.78 11
WKNNI 11 2.5 6.5 7 6.5 1 11.5 6 11 7.00 8
KMI 5 2.5 6.5 3 6.5 5.5 9.5 12 7.5 6.44 6
FKMI 7.5 2.5 6.5 10 2 5.5 1 2 3 4.44 1
SVMI 1 5.5 6.5 7 1 5.5 6.5 6 2 4.56 2
EM 13 12 6.5 7 12 13 3 6 4 8.50 10
SVDI 11 11 6.5 12 10 12 9.5 10 11 10.33 12
BPCA 14 13 13 7 13 14 13 13 13 12.56 14
LLSI 11 5.5 6.5 7 6.5 11 3 6 7.5 7.11 9
MC 7.5 8.5 6.5 2 6.5 5.5 3 6 7.5 5.89 5
CMC 5 8.5 12 13 3 5.5 6.5 1 7.5 6.89 7
DNI 2.5 14 14 14 14 5.5 14 14 14 11.78 13
imputation methods and we cannot choose the best method from among these three.
On the other hand, BPCA and DNI are the worst methods.
In Table 4.5 we can observe the rankings associated with the methods belonging to
the black-boxes modeling category. As can be appreciated, for black-boxes modelling
the differences between imputation methods are even more evident. We can select
the EC method as the best solution, as it has a difference of ranking of almost 1 with
KMI, which stands as the second best. This difference increases when considering
100 4 Dealing with Missing Values
the third best, FKMI. No other family of classifiers present this gap in the rankings.
Therefore, in this family of classification methods we could, with some confidence,
establish the EC method as the best choice. The DNI and IM methods are among
the worst. This means that for the black-boxes modelling methods the use of some
kind of MV treatment is mandatory, whereas the EC method is the most suitable one.
As with the RIL methods, the BPCA method is the worst choice, with the highest
ranking.
Finally the results for the last LL group are presented in Table 4.6. For the LL
models, the MC method is the best with the lowest average ranking. The CMC
method, which is relatively similar to MC, also obtains a low rank very close to
MC’s. Only the FKMI method obtains a low enough rank to be compared with
the MC and CMC methods. The rest of the imputation methods are far from these
lowest ranks with almost two points of difference in the ranking. Again, the DNI and
IM methods obtain high rankings. The DNI method is one of the worst, with only
the BPCA method performing worse. As with the black-boxes modelling models,
the imputation methods produce a significant improvement in the accuracy of these
classification methods.
In this last Section we have carried out an experimental comparison among the impu-
tation methods presented in this chapter. We have tried to obtain the best imputation
choice by means of non-parametric statistical testing. The results obtained concur
with previous studies:
4.6 Experimental Comparative Analysis 101
• The imputation methods which fill in the MVs outperform the case deletion (IM
method) and the lack of imputation (DNI method).
• There is no universal imputation method which performs best for all classifiers.
In Sect. 4.6.1 we have analyzed the influence of the imputation methods in the data
in respect to two measures. These two measures are the Wilson’s noise ratio and the
average MI difference. The first one quantifies the noise induced by the imputation
method in the instances which contain MVs. The second one examines the increment
or decrement in the relationship of the isolated input attributes with respect to the
class label. We have observed that the CMC and EC methods are the ones which
introduce less noise and maintain the MI better.
According to the results in Sect. 4.6.2, the particular analysis of the MVs treat-
ment methods conditioned to the classification methods’ groups seems necessary.
Thus, we can stress the recommended imputation algorithms to be used based on
the classification method’s type, as in the case of the FKMI imputation method for
the Rule Induction Learning group, the EC method for the black-boxes modelling
Models and the MC method for the Lazy Learning models. We can confirm the pos-
itive effect of the imputation methods and the classifiers’ behavior, and the presence
of more suitable imputation methods for some particular classifier categories than
others.
References
1. Acuna, E., Rodriguez, C.: Classification, Clustering and Data Mining Applications. Springer,
Berlin (2004)
2. Atkeson, C.G., Moore, A.W., Schaal, S.: Locally weighted learning. Artif. Intell. Rev. 11,
11–73 (1997)
3. Aydilek, I.B., Arslan, A.: A hybrid method for imputation of missing values using optimized
fuzzy c-means with support vector regression and a genetic algorithm. Inf. Sci. 233, 25–35
(2013)
4. Azim, S., Aggarwal, S.: Hybrid model for data imputation: using fuzzy c-means and multi
layer perceptron. In: Advance Computing Conference (IACC), 2014 IEEE International,
pp. 1281–1285 (2014)
5. Barnard, J., Meng, X.: Applications of multiple imputation in medical studies: from aids to
nhanes. Stat. Methods Med. Res. 8(1), 17–36 (1999)
6. Batista, G., Monard, M.: An analysis of four missing data treatment methods for supervised
learning. Appl. Artif. Intell. 17(5), 519–533 (2003)
7. Bezdek, J., Kuncheva, L.: Nearest prototype classifier designs: an experimental study. Int. J.
Intell. Syst. 16(12), 1445–1473 (2001)
8. Broomhead, D., Lowe, D.: Multivariable functional interpolation and adaptive networks.
Complex Systems 11, 321–355 (1988)
9. van Buuren, S., Groothuis-Oudshoorn, K.: MICE: multivariate imputation by chained equa-
tions in r. J. Stat. Softw. 45(3), 1–67 (2011)
10. le Cessie, S., van Houwelingen, J.: Ridge estimators in logistic regression. Appl. Stat. 41(1),
191–201 (1992)
11. Chai, L., Mohamad, M., Deris, S., Chong, C., Choon, Y., Ibrahim, Z., Omatu, S.: Inferring
gene regulatory networks from gene expression data by a dynamic bayesian network-based
model. In: Omatu, S., De Paz Santana, J.F., González, S.R., Molina, J.M., Bernardos, A.M.,
102 4 Dealing with Missing Values
Rodríguez, J.M.C. (eds.) Distributed Computing and Artificial Intelligence, Advances in Intel-
ligent and Soft Computing, pp. 379–386. Springer, Berlin (2012)
12. Ching, W.K., Li, L., Tsing, N.K., Tai, C.W., Ng, T.W., Wong, A.S., Cheng, K.W.: A weighted
local least squares imputation method for missing value estimation in microarray gene expres-
sion data. Int. J. Data Min. Bioinform. 4(3), 331–347 (2010)
13. Chow, C., Liu, C.: Approximating discrete probability distributions with dependence trees.
IEEE Trans. Inf. Theor. 14(3), 462–467 (1968)
14. Clark, P., Niblett, T.: The CN2 induction algorithm. Machine Learning 3(4), 261–283 (1989)
15. Cohen, W., Singer, Y.: A simple and fast and effective rule learner. In: Proceedings of the
Sixteenth National Conference on Artificial Intelligence, pp. 335–342 (1999)
16. Cohen, W.W.: Fast effective rule induction. In: Proceedings of the Twelfth International Con-
ference on Machine Learning (ICML), pp. 115–123 (1995).
17. Cortes, C., Vapnik, V.: Support vector networks. Machine Learning 20, 273–297 (1995)
18. Cover, T.M., Thomas, J.A.: Elements of Information Theory, 2 edn. Wiley, New York (1991)
19. Daniel, R.M., Kenward, M.G.: A method for increasing the robustness of multiple imputation.
Comput. Stat. Data Anal. 56(6), 1624–1643 (2012)
20. Dempster, A., Laird, N., Rubin, D.: Maximum likelihood estimation from incomplete data
via the EM algorithm (with discussion). J. Roy. Statist. Soc. Ser. B 39, 1–38 (1977)
21. Domingos, P., Pazzani, M.: On the optimality of the simple bayesian classifier under zero-one
loss. Machine Learning 29, 103–137 (1997)
22. Dorri, F., Azmi, P., Dorri, F.: Missing value imputation in dna microarrays based on conjugate
gradient method. Comp. Bio. Med. 42(2), 222–227 (2012)
23. Dunning, T., Freedman, D.: Modeling section effects, Sage, pp. 225–231 (2008)
24. Ennett, C.M., Frize, M., Walker, C.R.: Influence of missing values on artificial neural network
performance. Stud. Health Technol. Inform. 84, 449–453 (2001)
25. Fan, R.E., Chen, P.H., Lin, C.J.: Working set selection using second order information for
training support vector machines. J. Machine Learning Res. 6, 1889–1918 (2005)
26. Farhangfar, A., Kurgan, L., Dy, J.: Impact of imputation of missing values on classification
error for discrete data. Pattern Recognit. 41(12), 3692–3705 (2008). http://dx.doi.org/10.
1016/j.patcog.2008.05.019
27. Farhangfar, A., Kurgan, L.A., Pedrycz, W.: A novel framework for imputation of missing
values in databases. IEEE Trans. Syst. Man Cybern. Part A 37(5), 692–709 (2007)
28. Fayyad, U., Irani, K.: Multi-interval discretization of continuous-valued attributes for classi-
fication learning. In: 13th International Joint Conference on Uncertainly in Artificial Intelli-
gence(IJCAI93), pp. 1022–1029 (1993)
29. Feng, H., Guoshun, C., Cheng, Y., Yang, B., Chen, Y.: A SVM regression based approach
to filling in missing values. In: Khosla, R., Howlett, R.J., Jain, L.C. (eds.) KES (3), Lecture
Notes in Computer Science, vol. 3683, pp. 581–587. Springer, Berlin (2005)
30. Feng, X., Wu, S., Liu, Y.: Imputing missing values for mixed numeric and categorical attributes
based on incomplete data hierarchical clustering. In: Proceedings of the 5th International
Conference on Knowledge Science, Engineering and Management, KSEM’11, pp. 414–424
(2011)
31. Figueroa García, J.C., Kalenatic, D., Lopez Bello, C.A.: Missing data imputation in multi-
variate data by evolutionary algorithms. Comput. Hum. Behav. 27(5), 1468–1474 (2011)
32. de França, F.O., Coelho, G.P., Zuben, F.J.V.: Predicting missing values with biclustering: a
coherence-based approach. Pattern Recognit. 46(5), 1255–1266 (2013)
33. Frank, E., Witten, I.: Generating accurate rule sets without global optimization. In: Proceed-
ings of the 15th International Conference on Machine Learning, pp. 144–151 (1998)
34. Gheyas, I.A., Smith, L.S.: A neural network-based framework for the reconstruction of incom-
plete data sets. Neurocomputing 73(16–18), 3039–3065 (2010)
35. Gibert, K.: Mixed intelligent-multivariate missing imputation. Int. J. Comput. Math. 91(1),
85–96 (2014)
36. Grzymala-Busse, J., Goodwin, L., Grzymala-Busse, W., Zheng, X.: Handling missing attribute
values in preterm birth data sets. In: 10th International Conference of Rough Sets and Fuzzy
Sets and Data Mining and Granular Computing(RSFDGrC05), pp. 342–351 (2005)
References 103
37. Grzymala-Busse, J.W., Hu, M.: A comparison of several approaches to missing attribute values
in data mining. In: Ziarko, W., Yao, Y.Y. (eds.) Rough Sets and Current Trends in Computing,
Lecture Notes in Computer Science, vol. 2005, pp. 378–385. Springer, Berlin (2000)
38. Howell, D.: The analysis of missing data. SAGE Publications Ltd, London (2007)
39. Hruschka Jr, E.R., Ebecken, N.F.F.: Missing values prediction with k2. Intell. Data Anal. 6(6),
557–566 (2002)
40. Hulse, J.V., Khoshgoftaar, T.M.: Incomplete-case nearest neighbor imputation in software
measurement data. Inf. Sci. 259, 596–610 (2014)
41. Ingsrisawang, L., Potawee, D.: Multiple imputation for missing data in repeated measurements
using MCMC and copulas, pp. 1606–1610 (2012)
42. Ishioka, T.: Imputation of missing values for unsupervised data using the proximity in random
forests. In: eLmL 2013, The 5th International Conference on Mobile, Hybrid, and On-line
Learning, pp. 30–36 (2013)
43. Jamshidian, M., Jalal, S., Jansen, C.: Missmech: an R package for testing homoscedasticity,
multivariate normality, and missing completely at random (mcar). J. Stat. Softw. 56(6), 1–31
(2014)
44. Joenssen, D.W., Bankhofer, U.: Hot deck methods for imputing missing data: the effects
of limiting donor usage. In: Proceedings of the 8th International Conference on Machine
Learning and Data Mining in Pattern Recognition, MLDM’12, pp. 63–75 (2012)
45. Juhola, M., Laurikkala, J.: Missing values: how many can they be to preserve classification
reliability? Artif. Intell. Rev. 40(3), 231–245 (2013)
46. Keerin, P., Kurutach, W., Boongoen, T.: Cluster-based knn missing value imputation for
dna microarray data. In: Systems, Man, and Cybernetics (SMC), 2012 IEEE International
Conference on, pp. 445–450. IEEE (2012)
47. Keerin, P., Kurutach, W., Boongoen, T.: An improvement of missing value imputation in
dna microarray data using cluster-based lls method. In: Communications and Information
Technologies (ISCIT), 2013 13th International Symposium on, pp. 559–564 (2013)
48. Khan, S.S., Hoey, J., Lizotte, D.J.: Bayesian multiple imputation approaches for one-class
classification. In: Kosseim, L., Inkpen, D. (eds.) Advances in Artificial Intelligence - 25th
Canadian Conference on Artificial Intelligence, Canadian AI 2012, Toronto, ON, Canada,
Proceedings, pp. 331–336. 28–30 May 2012
49. Kim, H., Golub, G.H., Park, H.: Missing value estimation for dna microarray gene expression
data: local least squares imputation. Bioinform. 21(2), 187–198 (2005)
50. Krzanowski, W.: Multiple discriminant analysis in the presence of mixed continuous and
categorical data. Comput. Math. Appl. 12(2, Part A), 179–185 (1986)
51. Kwak, N., Choi, C.H.: Input feature selection by mutual information based on parzen window.
IEEE Trans. Pattern Anal. Mach. Intell. 24(12), 1667–1671 (2002)
52. Kwak, N., Choi, C.H.: Input feature selection for classification problems. IEEE Trans. Neural
Networks 13(1), 143–159 (2002)
53. Li, D., Deogun, J., Spaulding, W., Shuart, B.: Towards missing data imputation: a study of
fuzzy k-means clustering method. In: 4th International Conference of Rough Sets and Current
Trends in Computing (RSCTC04), pp. 573–579 (2004)
54. Little, R.J.A., Rubin, D.B.: Statistical Analysis with Missing Data, 1st edn. Wiley Series in
Probability and Statistics, New York (1987)
55. Little, R.J.A., Schluchter, M.D.: Maximum likelihood estimation for mixed continuous and
categorical data with missing values. Biometrika 72, 497–512 (1985)
56. Lu, X., Si, J., Pan, L., Zhao, Y.: Imputation of missing data using ensemble algorithms. In:
Fuzzy Systems and Knowledge Discovery (FSKD), 2011 8th International Conference on,
vol. 2, pp. 1312–1315 (2011)
57. McLachlan, G.: Discriminant Analysis and Statistical Pattern Recognition. Wiley, New
York(2004)
58. Merlin, P., Sorjamaa, A., Maillet, B., Lendasse, A.: X-SOM and L-SOM: a double classifica-
tion approach for missing value imputation. Neurocomputing 73(7–9), 1103–1108 (2010)
104 4 Dealing with Missing Values
59. Michalksi, R., Mozetic, I., Lavrac, N.: The multipurpose incremental learning system AQ15
and its testing application to three medical domains. In: 5th INational Conference on Artificial
Intelligence (AAAI86), pp. 1041–1045 (1986)
60. Miyakoshi, Y., Kato, S.: Missing value imputation method by using Bayesian network with
weighted learning. IEEJ Trans. Electron. Inf. Syst. 132, 299–305 (2012)
61. Moller, F.: A scaled conjugate gradient algorithm for fast supervised learning. Neural Net-
works 6, 525–533 (1990)
62. Oba, S., aki Sato, M., Takemasa, I., Monden, M., ichi Matsubara, K., Ishii, S.: A bayesian
missing value estimation method for gene expression profile data. Bioinform. 19(16), 2088–
2096 (2003)
63. Ouyang, M., Welsh, W.J., Georgopoulos, P.: Gaussian mixture clustering and imputation of
microarray data. Bioinform. 20(6), 917–923 (2004)
64. Panigrahi, L., Ranjan, R., Das, K., Mishra, D.: Removal and interpolation of missing values
using wavelet neural network for heterogeneous data sets. In: Proceedings of the International
Conference on Advances in Computing, Communications and Informatics, ICACCI ’12, pp.
1004–1009 (2012)
65. Patil, B., Joshi, R., Toshniwal, D.: Missing value imputation based on k-mean clustering with
weighted distance. In: Ranka, S., Banerjee, A., Biswas, K., Dua, S., Mishra, P., Moona, R.,
Poon, S.H., Wang, C.L. (eds.) Contemporary Computing, Communications in Computer and
Information Science, vol. 94, pp. 600–609. Springer, Berlin (2010)
66. Peng, H., Long, F., Ding, C.: Feature selection based on mutual information: Criteria of max-
dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell.
27(8), pp. 1226–1238 (2005)
67. Pham, D.T., Afify, A.A.: Rules-6: a simple rule induction algorithm for supporting decision
making. In: Industrial Electronics Society, 2005. IECON 2005. 31st Annual Conference of
IEEE, pp. 2184–2189 (2005)
68. Pham, D.T., Afify, A.A.: SRI: a scalable rule induction algorithm. Proc. Inst. Mech. Eng. [C]:
J. Mech. Eng. Sci. 220, 537–552 (2006)
69. Plat, J.: A resource allocating network for function interpolation. Neural Comput. 3(2), 213–
225 (1991)
70. Platt, J.C.: Fast training of support vector machines using sequential minimal optimization. In:
Advances in Kernel Methods: Support Vector Learning, pp. 185–208. MIT Press, Cambridge
(1999)
71. Pyle, D.: Data Preparation for Data Mining. Morgan Kaufmann Publishers Inc., San Francisco
(1999)
72. Qin, Y., Zhang, S., Zhang, C.: Combining knn imputation and bootstrap calibrated empirical
likelihood for incomplete data analysis. Int. J. Data Warehouse. Min. 6(4), 61–73 (2010)
73. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San
Francisco (1993)
74. Rahman, G., Islam, Z.: A decision tree-based missing value imputation technique for data
pre-processing. In: Proceedings of the 9th Australasian Data Mining Conference - Volume
121, AusDM ’11, pp. 41–50 (2011)
75. Rahman, M., Islam, M.: KDMI: a novel method for missing values imputation using two
levels of horizontal partitioning in a data set. In: Motoda, H., Wu, Z., Cao, L., Zaiane, O.,
Yao, M., Wang, W. (eds.) Advanced Data Mining and Applications. Lecture Notes in Computer
Science, vol. 8347, pp. 250–263. Springer, Berlin (2013)
76. Rahman, M.G., Islam, M.Z.: Missing value imputation using decision trees and decision
forests by splitting and merging records: two novel techniques. Know.-Based Syst. 53, 51–65
(2013)
77. Rahman, M.G., Islam, M.Z.: Fimus: a framework for imputing missing values using co-
appearance, correlation and similarity analysis. Know.-Based Syst. 56, 311–327 (2014)
78. Royston, P., White, I.R.: Multiple imputation by chained equations (MICE): implementation
in STATA. J. Stat. Softw. 45(4), 1–20 (2011)
79. Rubin, D.B.: Inference and missing data. Biometrika 63(3), 581–592 (1976)
References 105
80. Rubin, D.B.: Multiple Imputation for Nonresponse in Surveys. Wiley, New York (1987)
81. Safarinejadian, B., Menhaj, M., Karrari, M.: A distributed EM algorithm to estimate the
parameters of a finite mixture of components. Knowl. Inf. Syst. 23(3), 267–292 (2010)
82. Schafer, J.L.: Analysis of Incomplete Multivariate Data. Chapman & Hall, London (1997)
83. Schafer, J.L., Olsen, M.K.: Multiple imputation for multivariate missing-data problems: a
data analyst’s perspective. Multivar. Behav. Res. 33(4), 545–571 (1998)
84. Scheuren, F.: Multiple imputation: how it began and continues. Am. Stat. 59, 315–319 (2005)
85. Schneider, T.: Analysis of incomplete climate data: estimation of mean values and covariance
matrices and imputation of missing values. J. Clim. 14, 853–871 (2001)
86. Schomaker, M., Heumann, C.: Model selection and model averaging after multiple imputation.
Comput. Stat. Data Anal. 71, 758–770 (2014)
87. Sehgal, M.S.B., Gondal, I., Dooley, L.: Collateral missing value imputation: a new robust
missing value estimation algorithm for microarray data. Bioinform. 21(10), 2417–2423 (2005)
88. Silva-Ramírez, E.L., Pino-Mejías, R., López-Coello, M., Cubiles-de-la Vega, M.D.: Missing
value imputation on missing completely at random data using multilayer perceptrons. Neural
Networks 24(1), 121–129 (2011)
89. Simński, K.: Rough fuzzy subspace clustering for data with missing values. Comput. Inform.
33(1), 131–153 (2014)
90. Somasundaram, R., Nedunchezhian, R.: Radial basis function network dependent exclusive
mutual interpolation for missing value imputation. J. Comput. Sci. 9(3), 327–334 (2013)
91. Tanner, M.A., Wong, W.: The calculation of posterior distributions by data augmentation. J.
Am. Stat. Assoc. 82, 528–540 (1987)
92. Ting, J., Yu, B., Yu, D., Ma, S.: Missing data analyses: a hybrid multiple imputation algorithm
using gray system theory and entropy based on clustering. Appl. Intell. 40(2), 376–388 (2014)
93. Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Botstein,
D., Altman, R.B.: Missing value estimation methods for dna microarrays. Bioinform. 17(6),
520–525 (2001)
94. Unnebrink, K., Windeler, J.: Intention-to-treat: methods for dealing with missing values in
clinical trials of progressively deteriorating diseases. Stat. Med. 20(24), 3931–3946 (2001)
95. Vellido, A.: Missing data imputation through GTM as a mixture of t-distributions. Neural
Networks 19(10), 1624–1635 (2006)
96. Wang, H., Wang, S.: Mining incomplete survey data through classification. Knowl. Inf. Syst.
24(2), 221–233 (2010)
97. Williams, D., Liao, X., Xue, Y., Carin, L., Krishnapuram, B.: On classification with incomplete
data. IEEE Trans. Pattern Anal. Mach. Intell. 29(3), 427–436 (2007)
98. Wilson, D.: Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans.
Syst. Man Cybern. 2(3), 408–421 (1972)
99. Wong, A.K.C., Chiu, D.K.Y.: Synthesizing statistical knowledge from incomplete mixed-
mode data. IEEE Trans. Pattern Anal. Mach. Intell. 9(6), 796–805 (1987)
100. Wu, X., Urpani, D.: Induction by attribute elimination. IEEE Trans. Knowl. Data Eng. 11(5),
805–812 (1999)
101. Zhang, S.: Nearest neighbor selection for iteratively knn imputation. J. Syst. Softw. 85(11),
2541–2552 (2012)
102. Zhang, S., Wu, X., Zhu, M.: Efficient missing data imputation for supervised learning. In:
Cognitive Informatics (ICCI), 2010 9th IEEE International Conference on, pp. 672–679 (2010)
103. Zheng, Z., Webb, G.I.: Lazy learning of bayesian rules. Machine Learning 41(1), 53–84 (2000)
104. Zhu, B., He, C., Liatsis, P.: A robust missing value imputation method for noisy data. Appl.
Intell. 36(1), 61–74 (2012)
105. Zhu, X., Zhang, S., Jin, Z., Zhang, Z., Xu, Z.: Missing value estimation for mixed-attribute
data sets. IEEE Transactions on Knowl. Data Eng. 23(1), 110–121 (2011)
Chapter 5
Dealing with Noisy Data
Abstract This chapter focuses on the noise imperfections of the data. The presence
of noise in data is a common problem that produces several negative consequences
in classification problems. Noise is an unavoidable problem, which affects the data
collection and data preparation processes in Data Mining applications, where errors
commonly occur. The performance of the models built under such circumstances
will heavily depend on the quality of the training data, but also on the robustness
against the noise of the model learner itself. Hence, problems containing noise are
complex problems and accurate solutions are often difficult to achieve without using
specialized techniques—particularly if they are noise-sensitive. Identifying the noise
is a complex task that will be developed in Sect. 5.1. Once the noise has been identi-
fied, the different kinds of such an imperfection are described in Sect. 5.2. From this
point on, the two main approaches carried out in the literature are described. On the
first hand, modifying and cleaning the data is studied in Sect. 5.3, whereas design-
ing noise robust Machine Learning algorithms is tackled in Sect. 5.4. An empirical
comparison between the latest approaches in the specialized literature is made in
Sect. 5.5.
Real-world data is never perfect and often suffers from corruptions that may harm
interpretations of the data, models built and decisions made. In classification, noise
can negatively affect the system performance in terms of classification accuracy,
building time, size and interpretability of the classifier built [99, 100]. The presence
of noise in the data may affect the intrinsic characteristics of a classification prob-
lem. Noise may create small clusters of instances of a particular class in parts of the
instance space corresponding to another class, remove instances located in key areas
within a concrete class or disrupt the boundaries of the classes and increase over-
lapping among them. These alterations corrupt the knowledge that can be extracted
from the problem and spoil the classifiers built from that noisy data with respect
to the original classifiers built from the clean data that represent the most accurate
implicit knowledge of the problem [100].
Fig. 5.1 Examples of the interaction between classes: a small disjuncts and b overlapping between
classes
Fig. 5.2 The three types of examples considered in this book: safe examples (labeled as s), bor-
derline examples (labeled as b) and noisy examples (labeled as n). The continuous line shows the
decision boundary between the two classes
A large number of components determine the quality of a data set [90]. Among them,
the class labels and the attribute values directly influence the quality of a classification
data set. The quality of the class labels refers to whether the class of each example is
correctly assigned; otherwise, the quality of the attributes refers to their capability of
properly characterizing the examples for classification purposes—obviously, if noise
affects attribute values, this capability of characterization and therefore, the quality
of the attributes, is reduced. Based on these two information sources, two types of
noise can be distinguished in a given data set [12, 96]:
1. Class noise (also referred as label noise) It occurs when an example is incorrectly
labeled. Class noise can be attributed to several causes, such as subjectivity during
the labeling process, data entry errors, or inadequacy of the information used to
label each example. Two types of class noise can be distinguished:
• Contradictory examples There are duplicate examples in the data set having
different class labels [31].
• Misclassifications Examples are labeled with class labels different from their
true label [102].
2. Attribute noise It refers to corruptions in the values of one or more attributes.
Examples of attribute noise are: erroneous attribute values, missing or unknown
attribute values, and incomplete attributes or “do not care” values.
5.2 Types of Noise Data: Class Noise and Attribute Noise 111
In this book, class noise refers to misclassifications, whereas attribute noise refers
to erroneous attribute values, because they are the most common in real-world data
[100]. Furthermore, erroneous attribute values, unlike other types of attribute noise,
such as MVs (which are easily detectable), have received less attention in the litera-
ture.
Treating class and attribute noise as corruptions of the class labels and attribute
values, respectively, has been also considered in other works in the literature [69,
100]. For instance, in [100], the authors reached a series of interesting conclusions,
showing that attribute noise is more harmful than class noise or that eliminating
or correcting examples in data sets with class and attribute noise, respectively, may
improve classifier performance. They also showed that attribute noise is more harmful
in those attributes highly correlated with the class labels. In [69], the authors checked
the robustness of methods from different paradigms, such as probabilistic classifiers,
decision trees, instance based learners or SVMs, studying the possible causes of their
behavior.
However, most of the works found in the literature are only focused on class
noise. In [9], the problem of multi-class classification in the presence of labeling
errors was studied. The authors proposed a generative multi-class classifier to learn
with labeling errors, which extends the multi-class quadratic normal discriminant
analysis by a model of the mislabeling process. They demonstrated the benefits
of this approach in terms of parameter recovery as well as improved classification
performance. In [32], the problems caused by labeling errors occurring far from
the decision boundaries in Multi-class Gaussian Process Classifiers were studied.
The authors proposed a Robust Multi-class Gaussian Process Classifier, introducing
binary latent variables that indicate when an example is mislabeled. Similarly, the
effect of mislabeled samples appearing in gene expression profiles was studied in
[98]. A detection method for these samples was proposed, which takes advantage of
the measuring effect of data perturbations based on the SVM regression model. They
also proposed three algorithms based on this index to detect mislabeled samples. An
important common characteristic of these works, also considered in this book, is that
the suitability of the proposals was evaluated on both real-world and synthetic or
noisy-modified real-world data sets, where the noise could be somehow quantified.
In order to model class and attribute noise, we consider four different synthetic
noise schemes found in the literature, so that we can simulate the behavior of the
classifiers in the presence of noise as presented in the next section.
Traditionally the label noise introduction mechanism has not attracted as much atten-
tion in its consequences as it has in the knowledge extracted from it. However, as
the noise treatment is being embedded in the classifier design, the nature of noise
becomes more and more important. Recently, the authors in Frenay and Verley-
sen [19] have adopted the statistical analysis for the MVs introduction described
112 5 Dealing with Noisy Data
Fig. 5.3 Statistical taxonomy of label noise as described in [19]. a Noisy completely at random
(NCAR), b Noisy at random (NAR), and c Noisy not at random (NNAR). X is the array of input
attributes, Y is the true class label, Ŷ is the actual class label and E indicates whether a labeling
error occurred (Y = Ŷ ). Arrows indicate statistical dependencies
in Sect. 4.2. That is, we will distinguish between three possible statistical models
for label noise as depicted in Fig. 5.3. In the three subfigures of Fig. 5.3 the dashed
arrow points out a the implicit relation between the input features and the output that
is desired to be modeled by the classifier. In the most simplistic case in which the
noise procedure is not dependent of either the true value of the class Y or the input
attribute values X , the label noise is called noise completely at random or NCAR as
shown in Fig. 5.3a. In [7] the observed label is different from the true class with a
probability pn = P(E = 1), that is also called the error rate or noise rate. In binary
classification problems, the labeling error in NCAR is applied symmetrically to both
class labels and when pn = 0.5 the labels will no longer provide useful information.
In multiclass problems when the error caused by noise (i.e. E = 1) appears the
class label is changed by any other different one available. In the case in which the
selection of the erroneous class label is made by a uniform probability distribution,
the noise model is known as uniform label/class noise.
Things get more complicated in the noise at random (NAR) model. Although
the noise is independent of the inputs X , the true value of the class make it more
or less prone to be noisy. This asymmetric labeling error can be produced by the
different cost of extracting the true class, as for example in medical case-control
studies, financial score assets and so on. Since the wrong class label is subject to a
particular true class label, the labeling probabilities can be defined as:
P(Ŷ = ŷ|Y = y) = P(Ŷ = ŷ|E = e, Y = y)P(E = e|Y = y). (5.1)
e∈0,1
Of course this probability definition span over all the class labels and the possibly
erroneous class that the could take. As shown in [70] this conforms a transition
matrix γ where each position γi j shows the probability of P(Ŷ = ci |Y = c j ) for
the possible class labels ci and c j . Some examples can be examined with detail in
[19]. The NCAR model is a special case of the NAR label noise model in which
the probability of each position γi j denotes the independency between Ŷ and Y :
γi j = P(Ŷi , Y = c j ).
5.2 Types of Noise Data: Class Noise and Attribute Noise 113
Apart from the uniform class noise, the NAR label noise has widely studied in
the literature. An example is the pairwise label noise, where two selected class labels
are chosen to be labeled with the other with certain probability. In this pairwise label
noise (or pairwise class noise) only two positions of the γ matrix are nonzero outside
of the diagonal. Another problem derived from the NAR noise model is that it is not
trivial to decide whether the class labels are useful or not.
The third and last noise model is the noisy not at random (NNAR), where the input
attributes somehow affect the probability of the class label being erroneous as shown
in Fig. 5.3c. An example of this illustrated by Klebanov [49] where evidence is given
that difficult samples are randomly labeled.It also occurs that those examples similar
to existing ones are labeled by experts in a biased way, having more probability of
being mislabeled the more similar they are. NNAR model is the more general case
of class noise [59] where the error E depends on both X and Y and it is the only
model able to characterize mislabelings in the class borders or due to poor sampling
density. As shown in [19] the probability of error is much more complex than in the
two previous cases as it has to take into account the density function of the input over
the input feature space X when continuous:
pn = P(E = 1) = × P(X = x|Y = y)P(E = 1|X = x, Y = y)d x.
ci ∈C x∈X
(5.2)
As a consequence the perfect identification and estimation of the NNAR noise is
almost impossible, relying in approximating it from the expert knowledge of the
problem and the domain.
In the case of attribute noise, the modelization described above can be extended
and adapted. In this case, we can distinguish three possibilities as well:
• When the noise appearance does not depend either on the rest of the input features’
values or the class label the NCAR noise model applies. This type of noise can
occur when distortions in the measures appear at random, for example in faulty
hand data insertion or network errors that do not depend in the data content itself.
• When the attribute noise depends on the true value xi but not on the rest of input
values x1 , . . . , xi−1 , xi+1 , . . . , xn or the observed class label y the NAR model is
applicable. An illustrative example is when the different temperatures affect their
registration in climatic data in a different way depending on the proper temperature
value.
• In the last case the noise probability will depend on the value of the feature xi
but also on the rest of the input feature values x1 , . . . , xi−1 , xi+1 , . . . , xn . This is
a very complex situation in which the value is altered when the rest of features
present a particular combination of values, as in medical diagnosis when some test
results are filled by an expert prediction without conducting the test due to high
costs.
For the sake of brevity we will not develop the probability error equations here as their
expressions would vary depending on the nature of the input feature, being different
114 5 Dealing with Noisy Data
from real valued ones with respect to nominal attributes. However we must point out
that in attribute noise the probability dependencies are not the only important aspect
to be considered. The probability distribution of the noise is also fundamental.
For numerical data, the noisy datum x̂i may be a slight variation of the true value xi
or a completely random value. The density function of the noise values is very rarely
known. Simple examples of the first type of noise would be perturbations caused by a
normal distribution with the mean centered in the true value and with a fixed variance.
The second type of noise is usually estimated by assigning an uniform probability
to all the possible values of the input feature’s range. This procedure is also typical
with nominal data, where no preference of one value is taken. Again note that the
distribution of the noise is not the same as the probability of its appearance discussed
above: first the noise must be introduced with a certain probability (following the
NCAR, NAR or NNAR models) and then the noise value is stated or analyzed to
follow the aforementioned density functions.
Checking the effect of noisy data on the performance of classifier learning algorithms
is necessary to improve their reliability and has motivated the study of how to generate
and introduce noise into the data. Noise generation can be characterized by three main
characteristics [100]:
1. The place where the noise is introduced Noise may affect the input attributes
or the output class, impairing the learning process and the resulting model.
2. The noise distribution The way in which the noise is present can be, for example,
uniform [84, 104] or Gaussian [100, 102].
3. The magnitude of generated noise values The extent to which the noise affects
the data set can be relative to each data value of each attribute, or relative to the
minimum, maximum and standard deviation for each attribute [100, 102, 104].
In contrast to other studies in the literature, this book aims to clearly explain
how noise is defined and generated, and also to properly justify the choice of the
noise introduction schemes. Furthermore, the noise generation software has been
incorporated into the KEEL tool (see Chap. 10) for its free usage. The two types of
noise considered in this work, class and attribute noise, have been modeled using
four different noise schemes; in such a way that, the presence of these types of noise
will allow one to simulate the behavior of the classifiers in these two scenarios:
1. Class noise usually occurs on the boundaries of the classes, where the examples
may have similar characteristics—although it can occur in any other area of
the domain. In this book, class noise is introduced using an uniform class noise
scheme [84] (randomly corrupting the class labels of the examples) and a pairwise
class noise scheme [100, 102] (labeling examples of the majority class with the
second majority class). Considering these two schemes, noise affecting any class
label and noise affecting only the two majority classes is simulated respectively.
5.2 Types of Noise Data: Class Noise and Attribute Noise 115
Whereas the former can be used to simulate a NCAR noise model, the latter is
useful to produce a particular NAR noise model.
2. Attribute noise can proceed from several sources, such as transmission con-
straints, faults in sensor devices, irregularities in sampling and transcription errors
[85]. The erroneous attribute values can be totally unpredictable, i.e., random,
or imply a low variation with respect to the correct value. We use the uniform
attribute noise scheme [100, 104] and the Gaussian attribute noise scheme in
order to simulate each one of the possibilities, respectively. We introduce attribute
noise in accordance with the hypothesis that interactions between attributes are
weak [100]; as a consequence, the noise introduced into each attribute has a low
correlation with the noise introduced into the rest.
Robustness is the capability of an algorithm to build models that are insensitive to
data corruptions and suffer less from the impact of noise [39]. Thus, a classification
algorithm is said to be more robust than another if the former builds classifiers which
are less influenced by noise than the latter, i.e., more robust. In order to analyze the
degree of robustness of the classifiers in the presence of noise, we will compare the
performance of the classifiers learned with the original (without induced noise) data
set with the performance of the classifiers learned using the noisy data set. Therefore,
those classifiers learned from noisy data sets being more similar (in terms of results)
to the noise free classifiers will be the most robust ones.
Noise filters are preprocessing mechanisms to detect and eliminate noisy instances in
the training set. The result of noise elimination in preprocessing is a reduced training
set which is used as an input to a classification algorithm. The separation of noise
detection and learning has the advantage that noisy instances do not influence the
classifier building design [24].
Noise filters are generally oriented to detect and eliminate instances with class
noise from the training data. Elimination of such instances has been shown to be
advantageous [23]. However, the elimination of instances with attribute noise seems
counterproductive [74, 100] since instances with attribute noise still contain valuable
information in other attributes which can help to build the classifier. It is also hard
to distinguish between noisy examples and true exceptions, and henceforth many
techniques have been proposed to deal with noisy data sets with different degrees of
success.
We will consider three noise filters designed to deal with mislabeled instances
as they are the most common and the most recent: the Ensemble Filter [11], the
Cross-Validated Committees Filter [89] and the Iterative-Partitioning Filter [48]. It
should be noted that these three methods are ensemble-based and vote-based filters.
A motivation for using ensembles for filtering is pointed out in [11]: when it is
assumed that some instances in the data have been mislabeled and that the label errors
116 5 Dealing with Noisy Data
are independent of the particular model being fitted to the data, collecting information
from different models will provide a better method for detecting mislabeled instances
than collecting information from a single model.
The implementations of these three noise filters can be found in KEEL (see
Chap. 10). Their descriptions can be found in the following subsections. In all descrip-
tions we use DT to refer to the training set, D N to refer to the noisy data identified
in the training set (initially, D N = ∅) and Γ is the number of folds in which the
training data is partitioned by the noise filter.
The three noise filters presented below use a voting scheme to determine which
instances to eliminate from the training set. There are two possible schemes to deter-
mine which instances to remove: consensus and majority schemes. The consensus
scheme removes an instance if it is misclassified by all the classifiers, while the
majority scheme removes an instance if it is misclassified by more than half of the
classifiers. Consensus filters are characterized by being conservative in rejecting
good data at the expense of retaining bad data. Majority filters are better at detecting
bad data at the expense of rejecting good data.
The Ensemble Filter (EF) [11] is a well-known filter in the literature. It attempts to
achieve an improvement in the quality of the training data as a preprocessing step
in classification, by detecting and eliminating mislabeled instances. It uses a set of
learning algorithms to create classifiers in several subsets of the training data that
serve as noise filters for the training set.
The identification of potentially noisy instances is carried out by performing an
Γ -FCV on the training data with µ classification algorithms, called filter algorithms.
In the developed experimentation for this book we have utilized the three filter algo-
rithms used by the authors in [11], which are C4.5, 1-NN and LDA [63]. The complete
process carried out by EF is described below:
• Split the training data set DT into Γ equal sized subsets.
• For each one of the μ filter algorithms:
– For each of these Γ parts, the filter algorithm is trained on the other Γ − 1 parts.
This results in Γ different classifiers.
– These Γ resulting classifiers are then used to tag each instance in the excluded
part as either correct or mislabeled, by comparing the training label with that
assigned by the classifier.
• At the end of the above process, each instance in the training data has been tagged
by each filter algorithm.
• Add to D N the noisy instances identified in DT using a voting scheme, taking into
account the correctness of the labels obtained in the previous step by the μ filter
algorithms. We use a consensus vote scheme in this case.
• Remove the noisy instances from the training set: DT ← DT \D N .
5.3 Noise Filtering at Data Level 117
The Cross-Validated Committees Filter (CVCF) [89] uses ensemble methods in order
to preprocess the training set to identify and remove mislabeled instances in classi-
fication data sets. CVCF is mainly based on performing an Γ -FCV to split the full
training data and on building classifiers using decision trees in each training subset.
The authors of CVCF place special emphasis on using ensembles of decision trees
such as C4.5 because they think that this kind of algorithm works well as a filter for
noisy data.
The basic steps of CVCF are the following:
• Split the training data set DT into Γ equal sized subsets.
• For each of these Γ parts, a base learning algorithm is trained on the other Γ − 1
parts. This results in Γ different classifiers. We use C4.5 as base learning algorithm
in our experimentation as recommended by the authors.
• These Γ resulting classifiers are then used to tag each instance in the training
set DT as either correct or mislabeled, by comparing the training label with that
assigned by the classifier.
• Add to D N the noisy instances identified in DT using a voting scheme (the majority
scheme in our experimentation), taking into account the correctness of the labels
obtained in the previous step by the Γ classifier built.
• Remove the noisy instances from the training set: DT ← DT \D N .
• These Γ resulting classifiers, are then used to tag each instance in the training
set DT as either correct or mislabeled, by comparing the training label with that
assigned by the classifier.
• Add to D N the noisy instances identified in DT using a voting scheme, taking
into account the correctness of the labels obtained in the previous step by the Γ
classifier built. For the IPF filter we use the majority vote scheme.
• Add to DG a percentage y of the good data in DT . This step is useful when we deal
with large data sets because it helps to reduce them faster. We do not eliminate
good data with the IPF method in our experimentation (we set y = 0, so DG is
always empty) and nor do we lose generality.
• Remove the noisy instances and the good data from the training set: DT ←
DT \{D N ∪ DG }.
At the end of the iterative process, the filtered data is formed by the remaining
instances of DT and the good data of DG ; that is, DT ∪ DG .
A particularity of the voting schemes in IPF is that a noisy instance should also be
misclassified by the model which was induced in the subset containing that instance
as an additional condition. Moreover, by varying the required number of filtering
iterations, the level of conservativeness of the filter can also be varied in both schemes,
consensus and majority.
Apart from the three aforementioned filtering methods, we can find many more in
the specialized literature. We try to provide a helpful summary of the most recent
and well-known ones in the following Table 5.1. For the sake of brevity, we will
not carry out a deep description of these methods as done in the previous sections.
A recent categorization of the different filtering procedures made by Frenay and
Verleysen [19] will be followed as it matches our descriptions well.
Filtering the data has also one major drawback: some instances will be dropped from
the data sets, even if they are valuable. Instead of filtering the data set or modifying
the learning algorithm, we can use other approaches to diminish the effect of noise
in the learned model. In the case of labeled data, one powerful approach is to train
not a single classifier but several ones, taking advantage of their particular strengths.
In this section we provide a brief insight into classifiers that are known to be robust
to noise to a certain degree, even when the noise is not treated or cleansed. As
said in Sect. 5.1 C4.5 has been considered as a paradigmatic robust learner against
noise. However, it is also true that classical decision trees have been considered
5.4 Robust Learners Against Noise 119
sensitive to class noise as well [74]. This instability has make them very suitable for
ensemble methods. As a countermeasure for this lack of stability some strategies can
be used. The first one is to carefully select an appropriate splitting criteria measure.
In [2] several measures are compared to minimize the impact of label noise in the
constructed trees, empirically showing that the imprecise info-gain measure is able
to improve the accuracy and reduce the tree growing size produced by the noise.
Another approach typically described as useful to deal with noise in decision trees
is the use of pruning. Pruning tries to stop the overfitting caused by the overspecial-
ization over the isolated (and usually noisy) examples. The work of [1] eventually
shows that the usage of pruning helps to reduce the effect and impact of the noise in
the modeled trees. C4.5 is the most famous decision tree and it includes this pruning
strategy by default, and can be easily adapted to split under the desired criteria.
We have seen that the usage of ensembles is a good strategy to create accurate
and robust filters. Whether an ensemble of classifiers is robust or not against noise
can be also asked.
Many ensemble approaches exist and their noise robustness has been tested. An
ensemble is a system where the base learners are all of the same type built to be as
varied as possible. The two most classic approaches bagging and boosting were com-
pared in [16] showing that bagging obtains better performance than boosting when
label noise is present. The reason shown in [1] indicates that boosting (or the par-
ticular implementation made by AdaBoost) increase the weights of noisy instances
too much, making the model construction inefficient and imprecise, whereas mis-
labeled instances favour the variability of the base classifiers in bagging [19]. As
AdaBoost is not the only boosting algorithm, other implementations as LogitBoost
and BrownBoost have been checked as more robust to class noise [64]. When the base
120 5 Dealing with Noisy Data
classifiers are different we talk of Multiple Classifier Systems (MCSs). They are thus
a generalization of the classic ensembles and they should offer better improvements
in noisy environments. They are tackled in Sect. 5.4.1.
We can separate the labeled instances in several “bags” or groups, each one con-
taining only those instances belonging to the same class. This type of decomposition
is well suited for those classifiers that can only work with binary classification prob-
lems, but has also been suggested that this decomposition can help to diminish the
effects of noise. This decomposition is expected to decrease the overlapping between
the classes and to limit the effect of noisy instances to their respective bags by sim-
plifying the problem and thus alleviating the effect of the noise if the whole data set
were considered.
Given a set of problems, finding the best overall classification algorithm is sometimes
difficult because some classifiers may excel in some cases and perform poorly in
others. Moreover, even though the optimal match between a learning method and
a problem is usually sought, this match is generally difficult to achieve and perfect
solutions are rarely found for complex problems [34, 36]. This is one reason for using
Multi-Classifier Systems [34, 36, 72], since it is not necessary to choose a specific
learning method. All of them might be used, taking advantage of the strengths of each
method, while avoiding its weaknesses. Furthermore, there are other motivations to
combine several classifiers [34]:
• To avoid the choice of some arbitrary but important initial conditions, e.g. those
involving the parameters of the learning method.
• To introduce some randomness to the training process in order to obtain different
alternatives that can be combined to improve the results obtained by the individual
classifiers.
• To use complementary classification methods to improve dynamic adaptation and
flexibility.
Several works have claimed that simultaneously using classifiers of different
types, complementing each other, improves classification performance on difficult
problems, such as satellite image classification [60], fingerprint recognition [68] and
foreign exchange market prediction [73]. Multiple Classifier Systems [34, 36, 72,
94] are presented as a powerful solution to these difficult classification problems,
because they build several classifiers from the same training data and therefore allow
the simultaneous usage of several feature descriptors and inference procedures. An
important issue when using MCSs is the way of creating diversity among the clas-
sifiers [54], which is necessary to create discrepancies among their decisions and
hence, to take advantage of their combination.
MCSs have been traditionally associated with the capability of working accu-
rately with problems involving noisy data [36]. The main reason supporting this
5.4 Robust Learners Against Noise 121
hypothesis could be the same as one of the main motivations for combining classi-
fiers: the improvement of the generalization capability (due to the complementarity
of each classifier), which is a key question in noisy environments, since it might
allow one to avoid the overfitting of the new characteristics introduced by the noisy
examples [84]. Most of the works studying MCSs and noisy data are focused on
techniques like bagging and boosting [16, 47, 56], which introduce diversity con-
sidering different samples of the set of training examples and use only one baseline
classifier. For example, in [16] the suitability of randomization, bagging and boosting
to improve the performance of C4.5 was studied. The authors reached the conclu-
sion that with a low noise level, boosting is usually more accurate than bagging and
randomization. However, bagging outperforms the other methods when the noise
level increases. Similar conclusions were obtained in the paper of Maclin and Opitz
[56]. Other works [47] compare the performance of boosting and bagging techniques
dealing with imbalanced and noisy data, reaching also the conclusion that bagging
methods generally outperforms boosting ones. Nevertheless, explicit studies about
the adequacy of MCSs (different from bagging and boosting, that is, those introduc-
ing diversity using different base classifiers) to deal with noisy data have not been
carried out yet. Furthermore, most of the existing works are focused on a concrete
type of noise and on a concrete combination rule. On the other hand, when data is
suffering from noise, a proper study on how the robustness of each single method
influences the robustness of the MCS is necessary, but this fact is usually overlooked
in the literature.
There are several strategies to use more than one classifier for a single classification
task [36]:
• Dynamic classifier selection This is based on the fact that one classifier may
outperform all others using a global performance measure but it may not be the
best in all parts of the domain. Therefore, these types of methods divide the input
domain into several parts and aim to select the classifier with the best performance
in that part.
• Multi-stage organization This builds the classifiers iteratively. At each iteration, a
group of classifiers operates in parallel and their decisions are then combined. A
dynamic selector decides which classifiers are to be activated at each stage based
on the classification performances of each classifier in previous stages.
• Sequential approach A classifier is used first and the other ones are used only if
the first does not yield a decision with sufficient confidence.
• Parallel approach All available classifiers are used for the same input example
in parallel. The outputs from each classifier are then combined to obtain the final
prediction.
Although the first three approaches have been explored to a certain extent, the
majority of classifier combination research focuses on the fourth approach, due to its
simplicity and the fact that it enables one to take advantage of the factors presented
in the previous section. For these reasons, this book focus on the fourth approach.
122 5 Dealing with Noisy Data
Usually, the more classes in a problem, the more complex it is. In multi-class learn-
ing, the generated classifier must be able to separate the data into more than a pair
of classes, which increases the chances of incorrect classifications (in a two-class
balanced problem, the probability of a correct random classification is 1/2, whereas
in a multi-class problem it is 1/M). Furthermore, in problems affected by noise, the
boundaries, the separability of the classes and therefore, the prediction capabilities
of the classifiers may be severely hindered.
When dealing with multi-class problems, several works [6, 50] have demonstrated
that decomposing the original problem into several binary subproblems is an easy,
yet accurate way to reduce their complexity. These techniques are referred to as
binary decomposition strategies [55]. The most studied schemes in the literature are:
One-vs-One (OVO) [50], which trains a classifier to distinguish between each pair of
classes, and One-vs-All (OVA) [6], which trains a classifier to distinguish each class
from all other classes. Both strategies can be encoded within the Error Correcting
Output Codes framework [5, 17]. However, none of these works provide any theoret-
ical nor empirical results supporting the common assumption that assumes a better
behavior against noise of decomposition techniques compared to not using decom-
position. Neither do they show what type of noise is better handled by decomposition
techniques.
Consequently, we can consider the usage of the OVO strategy, which generally
out-stands over OVA [21, 37, 76, 83], and check its suitability with noisy training
data. It should be mentioned that, in real situations, the existence of noise in the
data sets is usually unknown-therefore, neither the type nor the quantity of noise
in the data set can be known or supposed a priori. Hence, tools which are able
to manage the presence of noise in the data sets, despite its type or quantity (or
unexistence), are of great interest. If the OVO strategy (which is a simple yet effective
methodology when clean data sets are considered) is also able to properly (better than
the baseline non-OVO version) handle the noise, its usage could be recommended in
spite of the presence of noise and without taking into account its type. Furthermore,
this strategy can be used with any of the existing classifiers which are able to deal
with two-class problems. Therefore, the problems of algorithm level modifications
and preprocessing techniques could be avoided; and if desired, they could also be
combined.
The final output is derived from the score matrix by different aggregation models.
The most commonly used and simplest combination, also considered in the experi-
ments of this book, is the application of a voting strategy:
Class = arg maxi=1,...,M si j (5.4)
1≤ j=i≤M
where si j is 1 if ri j > r ji and 0 otherwise. Therefore, the class with the largest number
of votes will be predicted. This strategy has proved to be competitive with different
classifiers obtaining similar results in comparison with more complex strategies [21].
In this section we want to illustrate the advantages of the noise approaches described
above.
In the data sets we are going to use (taken from Chap. 2), as in most of the real-world
data sets, the initial amount and type of noise present is unknown. Therefore, no
assumptions about the base noise type and level can be made. For this reason, these
data sets are considered to be noise free, in the sense that no recognizable noise has
been introduced. In order to control the amount of noise in each data set and check
how it affects the classifiers, noise is introduced into each data set in a supervised
manner. Four different noise schemes proposed in the literature, as explained in
Sect. 5.2, are used in order to introduce a noise level x% into each data set:
1. Introduction of class noise.
• Uniform class noise [84] x% of the examples are corrupted. The class labels
of these examples are randomly replaced by another one from the M classes.
• Pairwise class noise [100, 102] Let X be the majority class and Y the second
majority class, an example with the label X has a probability of x/100 of
being incorrectly labeled as Y .
126 5 Dealing with Noisy Data
The usage of filtering is claimed to be useful in the presence of noise. This section
tries to show whether this claim is true or not and to what extent. As a simple but
representative case of study, we show the results of applying noise filters based on
detecting and eliminating mislabeled training instances. We want to illustrate how
applying filters is a good strategy to obtain better results in the presence of even low
amounts of noise. As filters are mainly designed for class noise, we will focus on the
two types of class noise described in this chapter: the uniform class noise and the
pairwise class noise.
Three popular classifiers will be used to obtain the accuracy values that are C4.5,
Ripper and a SVM. Their selection is not made at random: SVMs are known to be
very accurate but also sensitive to noise. Ripper is a rule learning algorithm able to
perform averagely well, but as we saw in Sect. 5.4 rule learners are also sensitive to
noise when they are not designed to cope with it. The third classifier is C4.5 using
the pruning strategy, that it is known for diminishing the effects of noise in the final
tree. Table 5.2 shows the average results for the three noise filters for each kind of
class noise studied. The amount of noise ranges from 5 to 20 %, enough to show the
differences between no filtering (labeled as “None”) and the noise filters. The results
shown are the average over all the data sets considered in order to ease the reading.
The evolution of the results and their tendencies can be better depicted by using a
graphical representation. Figure 5.4a shows the performance of SVM from an amount
of 0 % of controlled pairwise noise to the final 20 % introduced. The accuracy can be
seen to drop from an initial amount of 90–85 % by only corrupting 20 % of the class
labels. The degradation is even worse in the case of uniform class noise depicted
in Fig. 5.4b, as all the class labels can be affected. The evolution of not using any
(a) (b)
Fig. 5.4 Accuracy over different amounts and types of noise. The different filters used are named
by their acronyms. “None” denotes the absence of any filtering. a SVM with pairwise noise b SVM
with uniform noise c Ripper with pairwise noise d Ripper with uniform noise e C4.5 with pairwise
noise f C4.5 with uniform noise
noise filter denoted by “None” is remarkably different from the lines that illustrate
the usage of any noise filter. The IPF filter is slightly better than the other due
to its greater sophistication, but in overall the use of filters is highly recommended.
Even in the case of 0 % of controlled noise, the noise already present is also cleansed,
allowing the filtering to improve even in this base case. Please note that the divergence
appears even in the 5 % case, showing that noise filtering is worth trying in low noise
frameworks.
Ripper obtains a lower overall accuracy than SVM, but the conclusions are akin:
the usage of noise filters is highly recommended as can be seen in Fig. 5.4c, d. It is
remarkable that not applying filtering for Ripper causes a fast drop in performance,
5.5 Empirical Analysis of Noise Filters and Robust Strategies 129
indicating that the rule base modeled is being largely affected by the noise. Thanks
to the use of the noise filter the inclusion of misleading rules is controlled, resulting
in a smoother drop in performance, even slower than that for SVM.
The last case is also very interesting. Being that C4.5 is more robust against
noise than SVM and Ripper, the accuracy drop over the increment of noise is lower.
However the use of noise filters is still recommended as they improve both the initial
case 0 % and the rest of levels. The greater differences between not filtering and
the use of any filter are found in uniform class noise (Fig. 5.4f). As we indicated
when describing the SVM case, uniform class noise is more disruptive but the use
of filtering for C4.5 make its performance comparable to the case of pairwise noise
(Fig. 5.4e).
Although not depicted here, the size of C4.5 trees, Ripper rule base size and the
number of support vectors of SVM is lower with the usage of noise filters when
the noise amount increases, resulting in a shorter time when evaluating examples for
classification. This is specially critical for SVM, whose evaluation times dramatically
increase with the increment of selected support vectors.
In the previous Sect. 5.5.2 we have seen that the application of noise filters are
beneficial in most cases, especially when higher amounts of noise are present in
the data. However, applying a filter is not “free” in terms of computing time and
information loss. Indiscriminate application of noise filtering may be interpreted as
the outcome of the aforementioned example study, but it would be interesting to
study the noise filters’ behavior further and to obtain hints about whether filtering is
useful or not depending on the data case.
In an ideal case, only the examples that are completely wrong would be erased
from the data set. The truth is both correct examples and examples containing valu-
able information may be removed, as the filters are ML techniques with their inher-
ent limitations. This fact implies that these techniques do not always provide an
improvement in performance. The success of these methods depends on several
circumstances, such as the kind and nature of the data errors, the quantity of noise
removed or the capabilities of the classifier to deal with the loss of useful information
related to the filtering. Therefore, the efficacy of noise filters, i.e., whether their use
causes an improvement in classifier performance, depends on the noise-robustness
and the generalization capabilities of the classifier used, but it also strongly depends
on the characteristics of the data.
Describing the characteristics of the data is not an easy task, as specifying what
“difficult” means is usually not straightforward or it simply does not depend on a sin-
gle factor. Data complexity measures are a recent proposal to represent characteristics
of the data that are considered difficult in classification tasks, e.g. the overlapping
130 5 Dealing with Noisy Data
Fig. 5.5 Using C4.5 to build a rule set to predict noise filtering efficacy
among classes, their separability or the linearity of the decision boundaries. The
most commonly used data complexity set of measures are those gathered together
by Ho and Basu [35]. They consist of 12 metrics designed for binary classification
problems that numerically estimate the difficulty of 12 different aspects of the data.
For some measures lower/higher values mean a more difficult problem regarding to
such a characteristic. Having a numeric description of the difficult aspects of the data
opens a new question: can we predict which characteristics are related with noise
and will they be successfully corrected by noise filters?
This prediction can help, for example, to determine an appropriate noise filter
for a concrete noisy data set such a filter providing a signicant advantage in terms
of the results or to design new noise filters which select more or less aggressive
filtering strategies considering the characteristics of the data. Choosing a noise-
sensitive learner facilitates the checking of when a filter removes the appropriate
noisy examples in contrast to a robust learner-the performance of classiers built by
the former is more sensitive to noisy examples retained in the data set after the ltering
process.
A way to formulate rules that describe when it is appropriate to filter the data
follows the scheme depicted in Fig. 5.5. From an initial set of 34 data sets from
those described in Chap. 2, a large amount of two-class data sets are obtained by
5.5 Empirical Analysis of Noise Filters and Robust Strategies 131
binarization along their data complexity measures. Thus the filtering efficacy is
compared by using 1-NN as a classifier to obtain the accuracy of filtering versus
not filtering. This comparison is achieved by using a Wilcoxon Signed Rank test. If
the statistical test yields differences favouring the filtering, the two-class data set is
labeled as appropriate for filtering, and not favorable in other case. As a result for
each binary data set we will have 12 data complexity measures and a label describing
whether the data set is eligible for filtering or not. A simple way to summarize this
information into a rule set is to use a decision tree (C4.5) using the 12 data complexity
values as the input features, and the appropriateness label as the class.
An important appreciation about the scheme presented in Fig. 5.5 is that for every
label noise filter we want to consider, we will obtain a different set of rules. For
the sake of simplicity we will limit this illustrative study to our selected filters—EF,
CVCF and IPF—in Sect. 5.3.
How accurate is the set of rules when predicting the suitability of label noise
filters? Using a 10-FCV over the data set obtained in the fourth step in Fig. 5.5, the
training and test accuracy of C4.5 for each filter is summarized in Table 5.3.
The test accuracy above 80 % in all cases indicates that the description obtained
by C4.5 is precise enough.
Using a decision tree is also interesting not only due to the generated rule set,
but also because we can check which data complexity measures (that is, the input
attributes) are selected first, and thus are considered as more important and discrim-
inant by C4.5. Averaging the rank of selection of each data complexity measure
over the 10 folds, Table 5.4 shows which complexity measures are the most dis-
Table 5.3 C4.5 accuracy in Noise filter % Acc. training % Acc. Test
training and test for the
ruleset describing the EF 0.9948 0.8176
adequacy of label noise filters CVCF 0.9966 0.8353
IPF 0.9973 0.8670
criminating, and thus more interesting, for C4.5 to discern when a noise filter will
behave well or badly. Based on these rankings it is easy to observe that F2, N2, F1
and F3 are the predominant measures in the order of choice. Please remember that
behind these acronyms, the data complexity measures aim to describe one particu-
lar source of difficulty for any classification problem. Following the order from the
most important of these four outstanding measures to the least, the volume of overlap
region (F2) is key to describe the effectiveness of a class noise filter. The less any
attribute is overlapped, the better the filter is able to decide if the instance is noisy.
It is complemented with the ratio of average intra/inter class distance as defined by
the nearest neighbor rule. When the examples sharing the same class are closer than
the examples of other classes the filtering is effective for 1-NN. This measure is
expected to change if another classifier is chosen to build the classification problem.
F1 and F3 are also measures of individual attribute overlapping as F2, but they are
less important in general.
If the discriminant abilities of these complexity measures are as good as their ranks
indicate, using only these few measures we can expect to obtain a better and more
concise description of what a easy-to-filter problem is. In order to avoid the study
of all the existing combinations of the five metrics, the following experimentation
is mainly focused on the measures F2, N2 and F3, the most discriminative ones
since the order results can be considered more important than the percentage results.
The incorporation of F1 into this set is also studied. The prediction capability of
the measure F2 alone, since is the most discriminative one, is also shown. All these
results are presented in Table 5.5.
The use of the measure F2 alone to predict the noise filtering efficacy with good
performance can be discarded, since its results are not good enough compared with
the cases where more than one measure is considered. This fact reflects that the use
of single measures does not provide enough information to achieve a good filtering
efficacy prediction result. Therefore, it is necessary to combine several measures
which examine different aspects of the data. Adding the rest of selected measures
provides comparable results to those shown in Table 5.3 yet limits the complexity of
the rule set obtained.
The work carried out in this section is studied further in [77], showing how a
rule set obtained for one filter can be applied to other filters, how these rule sets are
validated with unseen data sets and even increasing the number of filters involved.
Table 5.5 Performance results of C4.5 predicting the noise filtering efficacy (measures used: F2,
N2, F3, and F1)
F2 F2-N2-F3-F1 F2-N2-F3
Noise Filter Training Test Training Test Training Test
CVCF 1.0000 0.5198 0.9983 0.7943 0.9977 0.8152
EF 1.0000 0.7579 0.9991 0.8101 0.9997 0.8421
IPF 1.0000 0.7393 0.9989 0.8119 0.9985 0.7725
Mean 1.0000 0.6723 0.9988 0.8054 0.9986 0.8099
5.5 Empirical Analysis of Noise Filters and Robust Strategies 133
We will dispose of three well-known classifiers to build the MCS used in this illus-
trative section. SVM [14], C4.5 [75] and KNN [63] are chosen based on their good
performance in a large number of real-world problems. Moreover, they were selected
because these methods have a highly differentiated and well known noise-robustness,
which is important in order to properly evaluate the performance of MCSs in the pres-
ence of noise. Considering thec previous classifiers (SVM, C4.5 and KNN), a MCS
composed by 3 individual classifiers (SVM, C4.5 and 1-NN) is built. Therefore,
the MCSs built with heterogeneous classifiers (MCS3-1) will contain a noise-robust
algorithm (C4.5), a noise-sensitive method (SVM) and a local distance dependent
method with a low tolerance to noise (1-NN).
Table 5.6 shows the performance (top part of the table) and robustness (bottom part
of table) results of each classification algorithm at each noise level on data sets with
class noise. Each one of these parts in the table (performance and robustness parts)
is divided into another two parts: one with the results of the uniform class noise
Table 5.6 Performance and robustness results on data sets with class noise
Results p-values MCS3-1 vs.
x% SVM C4.5 1-NN MCS3-1 SVM C4.5 1-NN
0% 83.25 82.96 81.42 85.42 5.20E-03 1.80E-03 7.10E-04
10% 79.58 82.08 76.28 83 1.10E-04 3.90E-01 1.30E-07
Uniform
and another with the results of the pairwise class noise. A star ‘∗’ next to a p-value
indicates that the corresponding single algorithm obtains more ranks than the MCS
in Wilcoxon’s test comparing the individual classifier and the MCS. Note that the
robustness can only be measured if the noise level is higher than 0 %, so the robustness
results are presented from a noise level of 5 % and higher.
From the raw results we can extract some interesting conclusions. If we consider
the performance results with uniform class noise we can observe that MCS3-k is
statistically better than SVM, but in the case of C4.5 statistical differences are only
found at the lowest noise level. For the rest of the noise levels, MCS3-1 is statistically
equivalent to C4.5. Statistical differences are found between MCS3-1 and 1-NN for
all the noise levels, indicating that MCS are specially suitable when taking noise
sensitive classifiers into account.
In the case of pairwise class noise the conclusions are very similar. MCS3-1
statistically outperforms its individual components when the noise level is below
45 %, whereas it only performs statistically worse than SVM when the noise level
reaches 50 % (regardless of the value of 1). MCS3-1 obtains more ranks than C4.5 in
most of the cases; moreover, it is statistically better than C4.5 when the noise level
is below 15 %. Again MCS3-1 statistically outperforms 1-NN regardless of the level
of noise.
In uniform class noise MCS3-1 is significantly more robust than SVM up to a
noise level of 30 %. Both are equivalent from 35 % onwards—even though MCS3-1
obtains more ranks at 35–40 % and SVM at 45–50 %. The robustness of C4.5 excels
with respect to MCS3-1, observing the differences found. MCS3-1 is statistically
better than 1-NN. The Robustness results with pairwise class noise present some
differences with respect to uniform class noise. MCS3-1 statistically overcomes SVM
up to a 20 % noise level, they are equivalent up to 45 % and MCS3-1 is outperformed
by SVM at 50 %. C4.5 is statistically more robust than MCS3-1 (except in highly
affected data sets, 45–50 %) and The superiority of MCS3-1 against 1-NN is notable,
as it is statistically better at all noise levels.
It is remarkable that the uniform scheme is the most disruptive class noise for
the majority of the classifiers. The higher disruptiveness of the uniform class noise
in MCSs built with heterogeneous classifiers can be attributed to two main reasons:
(i) this type of noise affects all the output domain, that is, all the classes, to the
same extent, whereas the pairwise scheme only affects the two majority classes; (ii)
a noise level x% with the uniform scheme implies that exactly x% of the examples in
the data sets contain noise, whereas with the pairwise scheme, the number of noisy
examples for the same noise level x% depends on the number of examples of the
majority class Nma j ; as a consequence, the global noise level in the whole data set
is usually lower—more specifically, the number of noisy examples can be computed
as (x · Nma j )/100.
With the performance results in uniform class noise MCS3-1 generally outper-
forms its single classifier components. MCS3-1 is better than SVM and 1-NN,
whereas it only performs statistically better than C4.5 at the lowest noise levels.
In pairwise class noise MCS3-1 improves SVM up to a 40 % noise level, it is better
than C4.5 at the lowest noise levels—these noise levels are lower than those of the
5.5 Empirical Analysis of Noise Filters and Robust Strategies 135
uniform class noise—and also outperforms 1-NN. Therefore, the behavior of MCS3-
1 with respect to their individual components is better in the uniform scheme than
in the pairwise one.
Table 5.7 shows the performance and robustness results of each classification algo-
rithm at each noise level on data sets with attribute noise.
At first glance we can appreciate that the results on data sets with uniform attribute
noise are much worse than those on data sets with Gaussian noise for all the classifiers,
including MCSs. Hence, the most disruptive attribute noise is the uniform scheme. As
the uniform attribute noise is the most disruptive noise scheme, MCS3-1 outperforms
SVM and 1-NN. However, with respect to C4.5, MCS3-1 is significantly better only
at the lowest noise levels (up to 10–15 %), and is equivalent at the rest of the noise
levels. With gaussian attribute noise MCS3-1 is only better than 1-NN and SVM,
and better than C4.5 at the lowest noise levels (up to 25 %).
The robustness of the MCS3-1 with uniform attribute noise does not outperform
that of its individual classifiers, as it is statistically equivalent to SVM and some-
times worse than C4.5. Regarding 1-NN, MCS3-1 performs better than 1-NN. When
Table 5.7 Performance and robustness results on data sets with attribute noise
Results p-values MCS3-1 vs.
x% SVM C4.5 1-NN MCS3-1 SVM C4.5 1-NN
0% 83.25 82.96 81.42 85.42 5.20E-03 1.80E-03 7.10E-04
10% 81.78 81.58 78.52 83.33 3.90E-03 8.30E-02 8.10E-06
Uniform
focusing in gaussian noise the robustness results are better than those of the uniform
noise. The main difference in this case is that MCS3-1 and MCS5 are not statistically
worse than C4.5.
5.5.4.3 Conclusions
The results obtained have shown that the MCSs studied do not always significantly
improve the performance of their single classification algorithms when dealing with
noisy data, although they do in the majority of cases (if the individual components
are not heavily affected by noise). The improvement depends on many factors, such
as the type and level of noise. Moreover, the performance of the MCSs built with
heterogeneous classifiers depends on the performance of their single classifiers, so
it is recommended that one studies the behavior of each single classifier before
building the MCS. Generally, the MCSs studied are more suitable for class noise
than for attribute noise. Particularly, they perform better with the most disruptive
class noise scheme (the uniform one) and with the least disruptive attribute noise
scheme (the gaussian one).
The robustness results show that the studied MCS built with heterogeneous clas-
sifiers will not be more robust than the most robust among their single classification
algorithms. In fact, the robustness can always be shown as an average of the robust-
ness of the individual methods. The higher the robustness of the individual classifiers
are, the higher the robustness of the MCS is.
In this section, the performance and robustness of the classification algorithms using
the OVO decomposition with respect to its baseline results when dealing with data
suffering from noise are analyzed. In order to investigate whether the decomposi-
tion is able to reduce the effect of noise or not, a large number of data sets are
created introducing different levels and types of noise, as suggested in the litera-
ture. Several well-known classification algorithms, with or without decomposition,
are trained with them in order to check when decomposition is advantageous. The
results obtained show that methods using the One-vs-One strategy lead to better per-
formances and more robust classifiers when dealing with noisy data, especially with
the most disruptive noise schemes. Section 5.5.5.1 is devoted to the study of the class
noise scheme, whereas Sect. 5.5.5.2 analyzes the attribute noise case.
Table 5.8 shows the test accuracy and RLA results for each classification algorithm
at each noise level along with the associated p-values between the OVO and the
5.5 Empirical Analysis of Noise Filters and Robust Strategies 137
Table 5.8 Test accuracy, RLA results and p-values on data sets with class noise. Cases where the
baseline classifiers obtain more ranks than the OVO version in the Wilcoxon’s test are indicated
with a star (*)
Uniform random class noise Pairwise class noise
C4.5 Ripper 5-NN C4.5 Ripper 5-NN
Base OVO Base OVO Base OVO Base OVO Base OVO Base OVO
0% 81.66 82.7 77.92 82.15 82.1 83.45 81.66 82.7 77.92 82.15 82.1 83.45
10% 80.5 81.71 71.3 79.86 81.01 82.56 80.94 81.86 75.94 80.71 81.42 82.82
Results
20% 78.13 80.27 66.71 77.35 79.55 81.36 79.82 81.03 74.77 79.62 79.41 81.01
30% 75.22 78.87 62.91 74.98 77.21 79.82 78.49 79.26 73.38 78.05 75.29 76.81
40% 71.1 76.88 58.32 72.12 73.82 76.83 76.17 76.91 71.6 76.19 69.89 71.65
Test accuracy
50% 64.18 73.71 53.79 67.56 68.04 72.73 63.63 63.52 67.11 65.78 64.02 65.52
0% - - - 0.007 0.0002 0.093
10% 0.0124 0.0001 0.0036 0.0033 0.0003 0.0137
p-values
20% 4.63 3.15 15.44 5.86 3.48 2.67 2.39 2.13 3.52 3.03 3.29 2.93
30% 8.36 4.9 20.74 8.82 6.4 4.53 4.16 4.49 5.13 4.84 8.05 7.81
40% 13.46 7.41 26.72 12.35 10.3 8.11 7.01 7.38 7.25 7.12 14.37 13.68
RLA value
50% 21.87 11.29 32.74 18.1 17.47 13.12 21.03 22.28 12.22 18.75 21.06 20.67
0% - - - - - -
10% 0.5257 0.0001 0.1354 0.8721(*) 0.3317 0.3507
p-values
non-OVO version from the Wilcoxon’s test. The few exceptions where the baseline
classifiers obtain more ranks than the OVO version in the Wilcoxon’s test are indicated
with a star next to the p-value.
For random class noise the test accuracy of the methods using OVO is higher in
all the noise levels. Moreover, the low p-values show that this advantage in favor of
OVO is significant. The RLA values of the methods using OVO are lower than those
138 5 Dealing with Noisy Data
of the baseline methods at all noise levels. These differences are also statistically
significant as reflected by the low p-values. Only at some very low noise levels—5 %
and 10 % for C4.5 and 5 % for 5-NN - the results between the OVO and the non-OVO
version are statistically equivalent, but notice that the OVO decomposition does not
hinder the results, simply the loss is not lower.
These results also show that OVO achieves more accurate predictions when deal-
ing with pairwise class noise, however, it is not so advantageous with C4.5 or
RIPPER as with 5-NN in terms of robustness when noise only affects one class.
For example, the behavior of RIPPER with this noise scheme can be related to the
hierarchical way in which the rules are learned: it starts learning rules of the class
with the lowest number of examples and continues learning those classes with more
examples. When introducing this type of noise, RIPPER might change its training
order, but the remaining part of the majority class can still be properly learned, since
it now has more priority. Moreover, the original second majority class, now with
noisy examples, will probably be the last one to be learned and it would depend on
how the rest of the classes have been learned. Decomposing the problem with OVO,
a considerable number of classifiers will have a notable quantity of noise—those of
the majority and the second majority classes—and hence, the tendency to predict the
original majority class decreases—when the noise level is high, it strongly affects
the accuracy, since the majority has more influence on it.
In contrast with the rest of noise schemes, with pairwise noise scheme, all the
data sets have different real percentages of noisy examples at the same noise level of
x%. This is because each data set has a different number of examples of the majority
class, and thus a noise level of x% does not affect all the data sets in the same way.
In this case, the percentage of noisy examples with a noise level of x% is computed
as (x · Nma j )/100, where Nma j is the percentage of examples of the majority class.
In this section, the performance and robustness of the classification algorithms using
OVO in comparison to its non-OVO version when dealing with data with attribute
noise are analyzed. The test accuracy, RLA results and p-values of each classification
algorithm at each noise level are shown in Table 5.9.
In the case of uniform attribute noise it can be pointed out that the test accuracy
of the methods using OVO is always statistically better at all the noise levels. The
RLA values of the methods using OVO are lower than those of the baseline methods
at all noise levels—except in the case of C4.5 with a 5 % of noise level. Regarding
the p-values, a clear tendency is observed, the p-value decreases when the noise level
increases with all the algorithms. With all methods—C4.5, RIPPER and 5-NN—the
p-values of the RLA results at the lowest noise levels (up to 20–25 %) show that
the robustness of OVO and non-OVO methods is statistically equivalent. From that
point on, the OVO versions statistically outperform the non-OVO ones. Therefore,
the usage of OVO is clearly advantageous in terms of accuracy and robustness when
5.5 Empirical Analysis of Noise Filters and Robust Strategies 139
Table 5.9 Test accuracy, RLA results and p-values on data sets with attribute noise. Cases where
the baseline classifiers obtain more ranks than the OVO version in the Wilcoxon’s test are indicated
with a star (*)
Uniform random attribute noise Gaussian attribute noise
C4.5 Ripper 5-NN C4.5 Ripper 5-NN
Base OVO Base OVO Base OVO Base OVO Base OVO Base OVO
0% 81.66 82.7 77.92 82.15 82.1 83.45 81.66 82.7 77.92 82.15 82.1 83.45
10% 80.31 81.65 76.08 80.85 79.81 81.34 80.93 81.67 76.53 81.12 80.91 82.52
Results
20% 78.71 80.27 73.95 79.15 77.63 79.38 79.77 81.11 75.35 80.06 80.16 81.74
30% 76.01 78.25 71.25 77.06 74.68 76.46 79.03 80.4 74.46 78.93 78.84 80.77
40% 73.58 76.19 68.66 74.56 71.29 73.65 77.36 79.51 72.94 78.1 77.53 79.11
Test accuracy
50% 70.49 73.51 65.5 71.66 67.72 70.07 75.29 78.03 71.57 76.27 76.02 77.72
0% 0.007 0.0002 0.093 0.007 0.0002 0.093
10% 0.0169 0.0003 0.091 0.1262 0.0004 0.0064
p-values
20% 3.88 3.11 5.46 3.77 5.72 5.03 2.4 1.99 3.49 2.59 2.51 2.09
30% 7.54 5.77 9.2 6.42 9.57 8.76 3.42 2.91 4.66 3.95 4.34 3.32
40% 10.64 8.25 12.81 9.64 13.73 12.08 5.67 4.03 6.74 4.96 6.03 5.38
RLA value
50% 14.74 11.74 17.21 13.33 18.14 16.55 8.37 5.87 8.5 7.35 7.82 7.16
0% - - - - - -
10% 0.4781 0.5755 1.0000(*) 0.0766(*) 0.8519(*) 0.4115
p-values
noise affects the attributes in a random and uniform way. This behavior is particularly
notable with the highest noise levels, where the effects of noise are expected to be
more detrimental.
140 5 Dealing with Noisy Data
On the other hand, analyzing the gaussian attribute noise results in the test accuracy
of the methods using OVO being better at all the noise levels. The low p-values show
that this advantage, also in favor of OVO, is statistically significant. With respect to
the RLA results the p-values show a clear decreasing tendency when the noise level
increases in all the algorithms. In the case of C4.5, OVO is statistically better from a
35 % noise level onwards. RIPPER and 5-NN are statistically equivalent at all noise
levels—although 5-NN with OVO obtains higher Wilcoxon’s ranks.
Hence, the OVO approach is also suitable considering the accuracy achieved with
this type of attribute noise. The robustness results are similar between the OVO and
non-OVO versions with RIPPER and 5-NN. However, for C4.5 there are statistical
differences in favor of OVO at the highest noise levels. It is important to note that in
some cases, particularly in the comparisons involving RIPPER, some RLA results
show that OVO is better than the non-OVO version in average but the latter obtains
more ranks in the statistical test—even though these differences are not significant.
This is due to the extreme results of some individual data sets, such as led7digit
or flare, in which the RLA results of the non-OVO version are much worse than
those of the OVO version. Anyway, we should notice that average results themselves
are not meaningful and the corresponding non-parametric statistical analysis must
be carried out in order to extract meaningful conclusions, which reflects the real
differences between algorithms.
5.5.5.3 Conclusions
The results obtained have shown that the OVO decomposition improves the baseline
classifiers in terms of accuracy when data is corrupted by noise in all the noise
schemes shown in this chapter. The robustness results are particularly notable with
the more disruptive noise schemes—the uniform random class noise scheme and the
uniform random attribute noise scheme—where a larger amount of noisy examples
and with higher corruptions are available, which produce greater differences (with
statistical significance).
In conclusion, we must emphasize that one usually does not know the type and
level of noise present in the data of the problem that is going to be addressed.
Decomposing a problem suffering from noise with OVO has shown a better accuracy,
higher robustness and homogeneity in all the classification algorithms tested. For
this reason, the use of the OVO decomposition strategy in noisy environments can
be recommended as an easy-to-applicate, yet powerful tool to overcome the negative
effects of noise in multi-class problems.
References
1. Abellán, J., Masegosa, A.R.: Bagging decision trees on data sets with classification noise. In:
Link S., Prade H. (eds.) FoIKS, Lecture Notes in Computer Science, vol. 5956, pp. 248–265.
Springer, Heidelberg (2009)
References 141
2. Abellán, J., Masegosa, A.R.: Bagging schemes on the presence of class noise in classification.
Expert Syst. Appl. 39(8), 6827–6837 (2012)
3. Aha, D.W., Kibler, D.: Noise-tolerant instance-based learning algorithms. In: Proceedings
of the 11th International Joint Conference on Artificial Intelligence, Vol. 1, IJCAI’89, pp.
794–799. Morgan Kaufmann Publishers Inc. (1989)
4. Alcalá-Fdez, J., Sánchez, L., García, S., del Jesus, M., Ventura, S., Garrell, J., Otero, J.,
Romero, C., Bacardit, J., Rivas, V., Fernández, J., Herrera, F.: KEEL: a software tool to assess
evolutionary algorithms for data mining problems. Soft Comput. Fus. Found. Methodol. Appl.
13, 307–318 (2009)
5. Allwein, E.L., Schapire, R.E., Singer, Y.: Reducing multiclass to binary: a unifying approach
for margin classifiers. J. Mach. Learn. Res. 1, 113–141 (2000)
6. Anand, R., Mehrotra, K., Mohan, C.K., Ranka, S.: Efficient classification for multiclass prob-
lems using modular neural networks. IEEE Trans. Neural Netw. 6(1), 117–124 (1995)
7. Angluin, D., Laird, P.: Learning from noisy examples. Mach. Learn. 2(4), 343–370 (1988)
8. Bonissone, P., Cadenas, J.M., Carmen Garrido, M., Díaz-Valladares, A.: A fuzzy random
forest. Int. J. Approx. Reason. 51(7), 729–747 (2010)
9. Bootkrajang, J., Kaban, A.: Multi-class classification in the presence of labelling errors. In:
ESANN 2011, 19th European Symposium on Artificial Neural Networks, Bruges, Belgium,
27–29 April 2011, Proceedings, ESANN (2011)
10. Brodley, C.E., Friedl, M.A.: Identifying and eliminating mislabeled training instances. In:
Clancey W.J., Weld D.S. (eds.) AAAI/IAAI, Vol. 1, pp. 799–805 (1996)
11. Brodley, C.E., Friedl, M.A.: Identifying mislabeled training data. J. Artif. Intell. Res. 11,
131–167 (1999)
12. Catal, C., Alan, O., Balkan, K.: Class noise detection based on software metrics and ROC
curves. Inf. Sci. 181(21), 4867–4877 (2011)
13. Chang, C.C., Lin, C.J.: A library for support vector machines. ACM Trans. Intell. Syst.
Technol. 2(3), 1–27 (2011)
14. Cortes, C., Vapnik, V.: Support vector networks. Mach. Learn. 20, 273–297 (1995)
15. Delany, S.J., Cunningham, P.: An analysis of case-base editing in a spam filtering system. In:
Funk P., González-Calero P.A. (eds.) ECCBR, pp. 128–141 (2004)
16. Dietterich, T.G.: An experimental comparison of three methods for constructing ensembles
of decision trees: bagging, boosting, and randomization. Mach. Learn. 40(2), 139–157 (2000)
17. Dietterich, T.G., Bakiri, G.: Solving multiclass learning problems via error-correcting output
codes. J. Artif. Intell. Res. 2(1), 263–286 (1995)
18. Du, W., Urahama, K.: Error-correcting semi-supervised pattern recognition with mode filter
on graphs. J. Adv. Comput. Intell. Intell. Inform. 15(9), 1262–1268 (2011)
19. Frenay, B., Verleysen, M.: Classification in the presence of label noise: a survey. Neural Netw.
Learn. Syst. IEEE Trans. 25(5), 845–869 (2014)
20. Fürnkranz, J.: Round robin classification. J. Mach. Learn. Res. 2, 721–747 (2002)
21. Galar, M., Fernández, A., Barrenechea, E., Bustince, H., Herrera, F.: An overview of ensemble
methods for binary classifiers in multi-class problems: experimental study on one-vs-one and
one-vs-all schemes. Pattern Recognit. 44(8), 1761–1776 (2011)
22. Galar, M., Fernández, A., Tartas, E.B., Sola, H.B., Herrera, F.: A review on ensembles for
the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans.
Syst. Man Cybern. Part C 42(4), 463–484 (2012)
23. Gamberger, D., Boskovic, R., Lavrac, N., Groselj, C.: Experiments with noise filtering in
a medical domain. In: Proceedings of the Sixteenth International conference on machine
learning, pp. 143–151. Morgan Kaufmann Publishers (1999)
24. Gamberger, D., Lavrac, N., Dzeroski, S.: Noise detection and elimination in data preprocess-
ing: experiments in medical domains. Appl. Artif. Intell. 14, 205–223 (2000)
25. García, V., Alejo, R., Sánchez, J., Sotoca, J., Mollineda, R.: Combined effects of class imbal-
ance and class overlap on instance-based classification. In: Corchado, E., Yin, H., Botti, V.,
Fyfe, C. (eds.) Intelligent Data Engineering and Automated Learning IDEAL 2006. Lecture
Notes in Computer Science, vol. 4224, pp. 371–378. Springer, Berlin (2006)
142 5 Dealing with Noisy Data
26. García, V., Mollineda, R., Sánchez, J.: On the k-NN performance in a challenging scenario
of imbalance and overlapping. Pattern Anal. Appl. 11(3–4), 269–280 (2008)
27. García, V., Sánchez, J., Mollineda, R.: An empirical study of the behavior of classifiers on
imbalanced and overlapped data sets. In: Rueda, L., Mery, D., Kittler, J. (eds.) CIARP 2007.
LNCS, vol. 4756, pp. 397–406. Springer, Heidelberg (2007)
28. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data
mining software: an update. SIGKDD Explor. Newsl. 11(1), 10–18 (2009)
29. Haralick, R.M.: The table look-up rule. Commun. Stat. Theory Methods A 5(12), 1163–1191
(1976)
30. Hart, P.E.: The condensed nearest neighbor rule. IEEE Trans. Inf. Theory 14, 515–516 (1968)
31. Hernández, M.A., Stolfo, S.J.: Real-world data is dirty: data cleansing and the merge/purge
problem. Data Min. Knowl. Discov. 2, 9–37 (1998)
32. Hernández-Lobato, D., Hernández-Lobato, J.M., Dupont, P.: Robust multi-class gaussian
process classification. In: Shawe-Taylor J., Zemel R.S., Bartlett P.L., Pereira F.C.N., Wein-
berger K.Q. (eds.) Advances in Neural Information Processing Systems 24: 25th Annual
Conference on Neural Information Processing Systems 2011. Proceedings of a meeting held
12–14 December 2011, Granada, Spain, NIPS, pp. 280–288 (2011)
33. Heskes, T.: The use of being stubborn and introspective. In: Ritter, H., Cruse, H., Dean, J.
(eds.) Prerational Intelligence: Adaptive Behavior and Intelligent Systems Without Symbols
and Logic, pp. 725–741. Kluwer, Dordrecht (2001)
34. Ho, T.K.: Multiple classifier combination: lessons and next steps. In: Kandel, Bunke E. (eds.)
Hybrid Methods in Pattern Recognition, pp. 171-198. World Scientific, New York (2002)
35. Ho, T.K., Basu, M.: Complexity measures of supervised classification problems. IEEE Trans.
Pattern Anal. Mach. Intell. 24(3), 289–300 (2002)
36. Ho, T.K., Hull, J.J., Srihari, S.N.: Decision combination in multiple classifier systems. IEEE
Trans. Pattern Anal. Mach. Intell. 16(1), 66–75 (1994)
37. Hsu, C.W., Lin, C.J.: A comparison of methods for multiclass support vector machines. IEEE
Trans. Neural Netw. 13(2), 415–425 (2002)
38. Huang, Y.S., Suen, C.Y.: A method of combining multiple experts for the recognition of
unconstrained handwritten numerals. IEEE Trans. Pattern Anal. Mach. Intell. 17, 90–93 (1995)
39. Huber, P.J.: Robust Statistics. Wiley, New York (1981)
40. Hüllermeier, E., Vanderlooy, S.: Combining predictions in pairwise classification: an optimal
adaptive voting strategy and its relation to weighted voting. Pattern Recognit. 43(1), 128–142
(2010)
41. Japkowicz, N.: Class imbalance: are we focusing on the right issue? In: II Workshop on
learning from imbalanced data sets, ICML, pp. 17–23 (2003)
42. Jeatrakul, P., Wong, K., Fung, C.: Data cleaning for classification using misclassification
analysis. J. Adv. Comput. Intell. Intell. Inform. 14(3), 297–302 (2010)
43. Jo, T., Japkowicz, N.: Class Imbalances versus small disjuncts. SIGKDD Explor. 6(1), 40–49
(2004)
44. John, G.H.: Robust decision trees: removing outliers from databases. In: Fayyad, U.M., Uthu-
rusamy, R. (eds.) Proceedings of the First International Conference on Knowledge Discovery
and Data Mining (KDD-95), pp. 174–179. Montreal, Canada, August (1995)
45. Karmaker, A., Kwek, S.: A boosting approach to remove class label noise. Int. J. Hybrid Intell.
Syst. 3(3), 169–177 (2006)
46. Kermanidis, K.L.: The effect of borderline examples on language learning. J. Exp. Theor.
Artif. Intell. 21, 19–42 (2009)
47. Khoshgoftaar, T., Van Hulse, J., Napolitano, A.: Comparing boosting and bagging techniques
with noisy and imbalanced data. IEEE Trans. Syst. Man Cybern. Part A Syst. Hum. 41(3),
552–568 (2011)
48. Khoshgoftaar, T.M., Rebours, P.: Improving software quality prediction by noise filtering
techniques. J. Comput. Sci. Technol. 22, 387–396 (2007)
49. Klebanov, B.B., Beigman, E.: Some empirical evidence for annotation noise in a bench-
marked dataset. In: Human Language Technologies: The 2010 Annual Conference of the
References 143
North American Chapter of the Association for Computational Linguistics, HLT ’10, pp.
438–446. Association for Computational Linguistics (2010)
50. Knerr, S., Personnaz, L., Dreyfus, G.: Single-layer learning revisited: a stepwise procedure
for building and training a neural network. In: Fogelman Soulié F., Hérault J. (eds.) Neu-
rocomputing: Algorithms, Architectures and Applications, pp. 41–50. Springer, Heidelberg
(1990)
51. Knerr, S., Personnaz, L., Dreyfus, G., Member, S.: Handwritten digit recognition by neural
networks with single-layer training. IEEE Trans. Neural Netw. 3, 962–968 (1992)
52. Kubat, M., Matwin, S.: Addresing the curse of imbalanced training sets: one-side selection. In:
Proceedings of the 14th International Conference on Machine Learning, pp. 179–186 (1997)
53. Kuncheva, L.: Combining Pattern Classifiers: Methods and Algorithms. Wiley, Chichester
(2004)
54. Kuncheva, L.I.: Diversity in multiple classifier systems. Inform. Fus. 6, 3–4 (2005)
55. Lorena, A., de Carvalho, A., Gama, J.: A review on the combination of binary classifiers in
multiclass problems. Artif. Intell. Rev. 30, 19–37 (2008)
56. Maclin, R., Opitz, D.: An empirical evaluation of bagging and boosting. In: Proceedings of the
fourteenth national conference on artificial intelligence and ninth conference on Innovative
applications of artificial intelligence, pp. 546–551 (1997)
57. Malossini, A., Blanzieri, E., Ng, R.T.: Detecting potential labeling errors in microarrays by
data perturbation. Bioinformatics 22(17), 2114–2121 (2006)
58. Mandler, E., Schuermann, J.: Combining the classification results of independent classifiers
based on the Dempster/Shafer theory of evidence. In: Gelsema E.S., Kanal L.N. (eds.) Pattern
Recognition and Artificial Intelligence, pp. 381–393. Amsterdam: North-Holland (1988)
59. Manwani, N., Sastry, P.S.: Noise tolerance under risk minimization. IEEE Trans. Cybern.
43(3), 1146–1151 (2013)
60. Maulik, U., Chakraborty, D.: A robust multiple classifier system for pixel classification of
remote sensing images. Fundamenta Informaticae 101(4), 286–304 (2010)
61. Mayoraz, E., Moreira, M.: On the decomposition of polychotomies into dichotomies (1996)
62. Mazurov, V.D., Krivonogov, A.I., Kazantsev, V.S.: Solving of optimization and identification
problems by the committee methods. Pattern Recognit. 20, 371–378 (1987)
63. Mclachlan, G.J.: Discriminant Analysis and Statistical Pattern Recognition (Wiley Series in
Probability and Statistics). Wiley-Interscience, New York (2004)
64. Melville, P., Shah, N., Mihalkova, L., Mooney, R.J.: Experiments on ensembles with missing
and noisy data. In: Roli F., Kittler J., Windeatt T. (eds.) Multiple Classifier Systems, Lecture
Notes in Computer Science, vol. 3077, pp. 293–302. Springer, Heidelberg (2004)
65. Miranda, A.L.B., Garcia, L.P.F., Carvalho, A.C.P.L.F., Lorena, A.C.: Use of classification
algorithms in noise detection and elimination. In: Corchado E., Wu X., Oja E., Herrero
l., Baruque B. (eds.) HAIS, Lecture Notes in Computer Science, vol. 5572, pp. 417–424.
Springer, Heidelberg (2009)
66. Muhlenbach, F., Lallich, S., Zighed, D.A.: Identifying and handling mislabelled instances. J.
Intell. Inf. Syst. 22(1), 89–109 (2004)
67. Napierala, K., Stefanowski, J., Wilk, S.: Learning from imbalanced data in presence of noisy
and borderline examples. Rough Sets and Current Trends in Computing. LNCS, vol. 6086,
pp. 158–167. Springer, Berlin (2010)
68. Nath, R.K.: Fingerprint recognition using multiple classifier system. Fractals 15(3), 273–278
(2007)
69. Nettleton, D., Orriols-Puig, A., Fornells, A.: A Study of the Effect of Different Types of Noise
on the Precision of Supervised Learning Techniques. Artif. Intell. Rev. 33, 275–306 (2010)
70. Pérez Carlos Javier, G.F.J.M.J.R.M.R.C.: Misclassified multinomial data: a Bayesian
approach. RACSAM 101(1), 71–80 (2007)
71. Pimenta, E., Gama, J.: A study on error correcting output codes. In: Portuguese Conference
on Artificial Intelligence EPIA 2005, 218–223 (2005)
72. Polikar, R.: Ensemble based systems in decision making. IEEE Circ. Syst. Mag. 6(3), 21–45
(2006)
144 5 Dealing with Noisy Data
73. Qian, B., Rasheed, K.: Foreign exchange market prediction with multiple classifiers. J. Fore-
cast. 29(3), 271–284 (2010)
74. Quinlan, J.R.: Induction of decision trees. Mach. Learn. 1(1), 81–106 (1986)
75. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San
Francisco (1993)
76. Rifkin, R., Klautau, A.: In defense of one-vs-all classification. J. Mach. Learn. Res. 5, 101–141
(2004)
77. Sáez, J.A., Luengo, J., Herrera, F.: Predicting noise filtering efficacy with data complexity
measures for nearest neighbor classification. Pattern Recognit. 46(1), 355–364 (2013)
78. Sánchez, J.S., Barandela, R., Marqués, A.I., Alejo, R., Badenas, J.: Analysis of new techniques
to obtain quality training sets. Pattern Recognit. Lett. 24(7), 1015–1022 (2003)
79. Segata, N., Blanzieri, E., Delany, S.J., Cunningham, P.: Noise reduction for instance-based
learning with a local maximal margin approach. J. Intell. Inf. Syst. 35(2), 301–331 (2010)
80. Shapley, L., Grofman, B.: Optimizing group judgmental accuracy in the presence of interde-
pendencies. Pub. Choice 43, 329–343 (1984)
81. Smith, M.R., Martinez, T.R.: Improving classification accuracy by identifying and removing
instances that should be misclassified. In: IJCNN, pp. 2690–2697 (2011)
82. Sun, J., ying Zhao, F., Wang, C.J., Chen, S.: Identifying and correcting mislabeled training
instances. In: FGCN (1), pp. 244–250. IEEE (2007)
83. Sun, Y., Wong, A.K.C., Kamel, M.S.: Classification of Imbalanced Data: a Review. Int. J.
Pattern Recognit. Artif. Intell. 23(4), 687–719 (2009)
84. Teng, C.M.: Correcting Noisy Data. In: Proceedings of the Sixteenth International Conference
on Machine Learning, pp. 239–248. Morgan Kaufmann Publishers, San Francisco, USA
(1999)
85. Teng, C.M.: Polishing blemishes: Issues in data correction. IEEE Intell. Syst. 19(2), 34–39
(2004)
86. Thongkam, J., Xu, G., Zhang, Y., Huang, F.: Support vector machine for outlier detection in
breast cancer survivability prediction. In: Ishikawa Y., He J., Xu G., Shi Y., Huang G., Pang
C., Zhang Q., Wang G. (eds.) APWeb Workshops, Lecture Notes in Computer Science, vol.
4977, pp. 99–109. Springer (2008)
87. Titterington, D.M., Murray, G.D., Murray, L.S., Spiegelhalter, D.J., Skene, A.M., Habbema,
J.D.F., Gelpke, G.J.: Comparison of discriminant techniques applied to a complex data set of
head injured patients. J. R. Stat. Soc. Series A (General) 144, 145–175 (1981)
88. Tomek, I.: Two Modifications of CNN. IEEE Tran. Syst. Man Cybern. 7(2), 679–772 (1976)
89. Verbaeten, S., Assche, A.V.: Ensemble methods for noise elimination in classification prob-
lems. In: Fourth International Workshop on Multiple Classifier Systems, pp. 317–325.
Springer, Heidelberg (2003)
90. Wang, R.Y., Storey, V.C., Firth, C.P.: A framework for analysis of data quality research. IEEE
Trans. Knowl. Data Eng. 7(4), 623–640 (1995)
91. Wemecke, K.D.: A coupling procedure for the discrimination of mixed data. Biometrics 48,
497–506 (1992)
92. Wheway, V.: Using boosting to detect noisy data. In: Revised Papers from the PRICAI 2000
Workshop Reader, Four Workshops Held at PRICAI 2000 on Advances in Artificial Intelli-
gence, pp. 123–132. Springer (2001)
93. Wilson, D.R., Martinez, T.R.: Instance pruning techniques. In: Proceedings of the Fourteenth
International Conference on Machine Learning, ICML ’97, pp. 403–411. Morgan Kaufmann
Publishers Inc. (1997)
94. Woźniak, M., Graña, M., Corchado, E.: A survey of multiple classifier systems as hybrid
systems. Inform. Fus. 16, 3–17 (2013)
95. Wu, T.F., Lin, C.J., Weng, R.C.: Probability estimates for multi-class classification by pairwise
coupling. J. Mach. Learn. Res. 5, 975–1005 (2004)
96. Wu, X.: Knowledge Acquisition From Databases. Ablex Publishing Corp, Norwood (1996)
97. Xu, L., Krzyzak, A., Suen, C.Y.: Methods of combining multiple classifiers and their appli-
cations to handwriting recognition. IEEE Trans. Syst. Man Cybern. 22(3), 418–435 (1992)
References 145
98. Zhang, C., Wu, C., Blanzieri, E., Zhou, Y., Wang, Y., Du, W., Liang, Y.: Methods for labeling
error detection in microarrays based on the effect of data perturbation on the regression model.
Bioinformatics 25(20), 2708–2714 (2009)
99. Zhong, S., Khoshgoftaar, T.M., Seliya, N.: Analyzing software measurement data with clus-
tering techniques. IEEE Intell. Syst. 19(2), 20–27 (2004)
100. Zhu, X., Wu, X.: Class noise vs. attribute noise: a quantitative study. Artif. Intell. Rev. 22,
177–210 (2004)
101. Zhu, X., Wu, X.: Class noise handling for effective cost-sensitive learning by cost-guided
iterative classification filtering. IEEE Trans. Knowl. Data Eng. 18(10), 1435–1440 (2006)
102. Zhu, X., Wu, X., Chen, Q.: Eliminating class noise in large datasets. In: Proceeding of the
Twentieth International Conference on Machine Learning, pp. 920–927 (2003)
103. Zhu, X., Wu, X., Chen, Q.: Bridging local and global data cleansing: Identifying class noise
in large, distributed data datasets. Data Min. Knowl. Discov. 12(2–3), 275–308 (2006)
104. Zhu, X., Wu, X., Yang, Y.: Error detection and impact-sensitive instance ranking in noisy
datasets. In: Proceedings of the Nineteenth National Conference on Artificial Intelligence,
pp. 378–383. AAAI Press (2004)
Chapter 6
Data Reduction
Abstract The most common tasks for data reduction carried out in Data Mining
consist of removing or grouping the data through the two main dimensions, examples
and attributes; and simplifying the domain of the data. A global overview to this
respect is given in Sect. 6.1. One of the well-known problems in Data Mining is
the “curse of dimensionality”, related with the usual high amount of attributes in
data. Section 6.2 deals with this problem. Data sampling and data simplification are
introduced in Sects. 6.3 and 6.4, respectively, providing the basic notions on these
topics for further analysis and explanation in subsequent chapters of the book.
6.1 Overview
Currently, it is not difficult to imagine the disposal of a data warehouse for an analysis
which contains millions of samples, thousands of attributes and complex domains.
Data sets will likely be huge, thus the data analysis and mining would take a long
time to give a respond, making such analysis infeasible and even impossible.
Data reduction techniques can be applied to achieve a reduced representation of
the data set,it is much smaller in volume and tries to keep most of the integrity of
the original data [11]. The goal is to provide the mining process with a mechanism
to produce the same (or almost the same) outcome when it is applied over reduced
data instead of the original data, at the same time as when mining becomes efficient.
In this section, we first present an overview of data reduction procedures. A closer
look at each individual technique will be provided throughout this chapter.
Basic data reduction techniques are usually categorized into three main families:
DR, sample numerosity reduction and cardinality reduction.
DR ensures the reduction of the number of attributes or random variables in
the data set. DR methods include FS and feature extraction/construction (Sect. 6.2
and Chap. 7 of this book), in which irrelevant dimensions are detected, removed or
combined. The transformation or projection of the original data onto a smaller space
can be done by PCA (Sect. 6.2.1), factor analysis (Sect. 6.2.2), MDS (Sect. 6.2.3) and
LLE (Sect. 6.2.4), being the most relevant techniques proposed in this field.
A major problem in DM in large data sets with many potential predictor variables is
the the curse of dimensionality. Dimensionality becomes a serious obstacle for the
efficiency of most of the DM algorithms, because of their computational complex-
ity. This statement was coined by Richard Bellman [4] to describe a problem that
increases as more variables are added to a model.
High dimensionality of the input increases the size of the search space in an
exponential manner and also increases the chance to obtain invalid models. It is well
known that there is a linear relationship between the required number of training
samples with the dimensionality for obtaining high quality models in DM [8]. But
when considering non-parametric learners, such as those instance-based or decision
trees, the situation is even more severe. It has been estimated that as the number of
dimensions increase, the sample size needs to increase exponentially in order to have
an effective estimate of multivariate densities [13].
It is evident that the curse of dimensionality affects data differently depending
on the following DM task or algorithm. For example, techniques like decision trees
could fail to provide meaningful and understandable results when the number of
dimensions increase, although the speed in the learning stage is barely affected.
On the contrary, instance-based learning has high dependence on dimensionality
affecting its order of efficiency.
In order to alleviate this problem, a number of dimension reducers have been
developed over the years. As linear methods, we can refer to factor analysis [18] and
6.2 The Curse of Dimensionality 149
PCA [7]. Nonlinear models are LLE [25], ISOMAP [26] and derivatives. They are
concerned with the transformation of the original variables into a smaller number
of projections. The underlying assumptions are that the variables are numeric and
that the dimensions can be expressed as combinations of the actual variables, and
vice versa. Further analysis on this type of techniques will be given in this chapter,
especially for the two most popular techniques: PCA and LLE.
A set of methods are aimed at eliminating irrelevant and redundant features,
reducing the number of variables in the model. They belong to the FS family of
methods. They have the following immediate positive effects on the analysis and
mining:
• Speed up the processing of the DM algorithm.
• Improve data quality.
• Increase the performance of the DM algorithm.
• Make the results easier to understand.
Formally, the problem of FS can be defined as follows [14]: Let A be the original set
of features, with cardinality m. Let f represent the desired number of features in the
selected subset B, B ⊂ A. Let the FS criterion function for the set B be represented
by J (B). Without any loss of generality, a lower value of J is considered to be a
better feature subset, thus, J could represent the generalization error. The problem
of FS is to find an optimal subset B that solves the following optimization problem:
min J (Z )
s.t.
Z⊂A
|Z | = d
m!
A brute force search would require examining all d!·(m−d)! possible combinations
of the feature set A. A vast number of FS approaches, trends and applications have
been proposed over the years, and therefore FS deserves a complete chapter of this
book: Chap. 7.
Other forms of widely used DR also deserve to be described in this section. They
are slightly more complicated than that previously seen, but also very widely used
in conjunction with advanced DM approaches and real applications.
The basic idea is to find a set of linear transformations of the original variables
which could describe most of the variance using a relatively fewer number of vari-
ables. Hence, it searches for k n-dimensional orthogonal vectors that can best repre-
sent the data, where k ≤ n. The new set of attributes are derived in a decreasing order
of contribution, letting the first obtained variable, the one called principal component
contain the largest proportion of the variance of the original data set. Unlike FS, PCA
allows the combination of the essence of original attributes to form a new smaller
subset of attributes.
The usual procedure is to keep only the first few principal components that may
contain 95 % or more of the variance of the original data set. PCA is particularly useful
when there are too many independent variables and they show high correlation.
The basic procedure is as follows:
• To normalize the input data, equalizing the ranges among attributes.
• To compute k orthonormal vectors to provide a basis for the normalized input
data. These vectors point to a direction that is perpendicular to the others and
are called principal components. The original data is in linear combination of the
principal components. In order to calculate them, the eigenvalue-eigenvectors of
the covariance matrix from the sample data are needed.
• To sort the principal components according to their strength, given by their asso-
ciated eigenvalues. The principal components serve as a new set of axes for the
data, adjusted according the variance of the original data. In Fig. 6.1, we show an
illustrative example of the first two principal components for a given data set.
Fig. 6.1 PCA. X and Y are the first two principal components obtained
6.2 The Curse of Dimensionality 151
• To reduce the data by removing weaker components, with low variance. A reliable
reconstruction of the data could be possible by using only the strongest principal
components.
The final output of PCA is a new set of attributes representing the original data set.
The user would use only the first few of these new variables because they contain most
of the information represented in the original data. PCA can be applied to any type
of data. It is also used as a data visualization tool by reducing any multidimensional
data into two- or three-dimensional data.
Factor analysis is similar to PCA in the sense that it leads to the deduction of a
new, smaller set of variables that practically describe the behaviour given in the
original data. Nevertheless, factor analysis is different because it does not seek to
find transformations for the given attributes. Instead, its goal is to discover hidden
factors in the current variables [17]. Although factor analysis has an important role
as a process of data exploration, we limit its description to a data reduction method.
In factor analysis, it is assumed that there are a set of unobservable latent factors
z j , j = 1, . . . , k; which when acting together generate the original data. Here, the
objective is to characterize the dependency among the variables by means of a smaller
number of factors.
The basic idea behind factor analysis is to attempt to find a set of hidden factors
so that the current attributes can be recovered by performing a set of linear transfor-
mations over these factors. Given the set of attributes a1 , a2 , . . . , am , factor analysis
attempts to find the set of factors f 1 , f 2 , . . . , f k , so that
A − μ = LF + ε
Thus, the factor analysis problem can be stated as given the attributes A, along with
the mean μ, we endeavor to find the set of factors F and the associated loadings L,
and therefore the above equation is accurate.
152 6 Data Reduction
Fig. 6.2 Factors are independent unit normals that are scaled, rotated and translated to compose
the inputs
Let us assume N points, and that we know the distances between the pairs of points,
di j , for all i, j = 1, . . . , N . Moreover, we do not know the precise coordinates of the
6.2 The Curse of Dimensionality 153
points, their dimensionality or the way the distances between them were computed.
Multidimensional scaling (MDS) is the method for situating these points in a low
space such that a classical distance measure (like Euclidean) between them is as close
as possible to each di j . There must be a projection from some unknown dimensional
space to another space whose number of dimensions is known.
One of the most typical examples of MDS is to draw an approximation of the map
that represents the travel distances between cities, knowing only the distance matrix.
Obviously, the outcome is distorted due to the differences between the distances
measured taking into account the geographical obstacles and the actual distance
in a straight line between the cities. It common for the map to be stretched out
to accommodate longer distances and that the map also is centered on the origin.
However, the solution is not unique, we can get any rotating view of it.
MDS is within the DR techniques because we can compute the distances in a
d-dimensional space of the actual data points and then to give as input this distance
matrix to MDS, which then projects it in to a lower-dimensional space so as to
preserve these distances.
Formally, let us say we have a sample X = {x t }t=1 N as usual, where x t ∈ Rd . For
the two points r and s, the squared Euclidean distance between them is
d
d
d
d
dr2s = ||x r − x s ||2 = (x rj − x sj )2 = (x rj )2 − 2 x rj x sj + (x sj )2
j=1 j=1 j=1 j=1
= brr + bss − 2br d
where br s is defined as
d
br s = x rj x sj
j=1
To constrain the solution, we center the data at the origin and assume
N
x tj = 0, ∀ j = 1, . . . , d
t=1
n
T = btt = (x tj )2
t=1 t j
we get
dr2s = T + N bss
r
154 6 Data Reduction
dr2s = N brr + T
s
dr2s = 2N T
r s
When we define
1 2 2 1 2 2 1 2
d·s2 = dr s , dr · = dr s , d·· = 2 d
N r N s N r s rs
1 2
br s = (d + d·s2 − d··2 − dr2s )
2 r·
Having now calculated br s and knowing that B = XXT , we look for an approxi-
mation. We know from the spectral decomposition that X = CD1/2 can be used as an
approximation for X, where C is the matrix whose columns are the eigenvectors of B
and D1/2 is a diagonal matrix with square roots of the eigenvalues on the diagonals.
Looking at the eigenvalues of B we decide on a dimensionality k lower than that of
d. Let us say c j are the eigenvectors with λ j as the corresponding eigenvalues. Note
that c j is N -dimensional. Then we get the new dimension as
z tj = λ j ctj , j = 1, . . . , k, t = 1, . . . , N
That is, the new coordinates of instance t are given by the tth elements of the
eigenvectors, c j , j = 1, . . . , k, after normalization.
In [5], it has been shown that the eigenvalues of XXT (N ×N ) are the same as those
of XT X (d × d) and the eigenvectors are related by a simple linear transformation.
This shows that PCA does the same work with MDS and does it more easily.
In the general case, we want to find a mapping z = g(x|θ ), where z ∈ Rk ,
x ∈ Rd , and g(x|θ ) is the mapping function from d to k dimensions defined up to a
set of parameters θ . Classical MDS we discussed previously corresponds to a linear
transformation
z = g(x|W) = WT x
but in a general case, nonlinear mapping can also be used: this is called Sammon map-
ping. the normalized error in mapping is called the Sammon stress and is defined as
In the case of classification, the class information can be included in the distance as
d r s = (1 − α)dr s + αcr s
where cr s is the “distance” between the classes xr and xs belong to. This interclass
distance should be supplied subjectively and α could be optimized using CV.
Locally Linear Embedding (LLE) recovers global nonlinear structure from locally
linear fits [25]. Its main idea is that each local patch of the manifold can be approxi-
mated linearly and given enough data, each point can be written as a linear, weighted
sum of its neighbors.
The LLE algorithm is based on simple geometric intuitions. Suppose the data
consists of N real-valued vectors Xi , each of dimensionality D, sampled from some
smooth underlying manifold. It is expected that each data point and its neighbors
to lie on or close to a locally linear patch of the manifold. The local geometry of
these patches can be characterized by linear coefficients that reconstruct each data
point from its neighbors. In the simplest formulation of LLE, the KNN are estimated
per data point, as measured by Euclidean distance. Reconstruction errors are then
measured by the cost function:
2
ε(W ) = Xi − W X
i j j
i j
which adds up the squared distances between all the data points and their recon-
structions. The weights Wi j summarize the contribution of the jth data point to the
isth reconstruction. To compute the weights Wi j , it is necessary to minimize the cost
function subject to two constraints: first, that each data point Xi is reconstructed only
from its neighbors, enforcing Wi j = 0 if Xj does not belong to this set; second, that
the rows of the weight matrix sum to one: j Wi j = 1 s. The optimal weights Wi j
subject to these constraints are found by solving a least squares problem.
The constrained weights that minimize these reconstruction errors are invariant
to rotations, scaling, and translations of that data point and its neighbors. Suppose
the data lie on or near a smooth nonlinear manifold of dimensionality d D.
To achieve a good approximation, then, there exists a linear mapping that maps the
high dimensional coordinates of each neighborhood to global internal coordinates on
the manifold. By design, the reconstruction weights Wi j reflect intrinsic geometric
properties of the data that are invariant to exactly such transformations. We therefore
expect their characterization of local geometry in the original data space to be equally
valid for local patches on the manifold. In particular, the same weights Wi j that
156 6 Data Reduction
reconstruct the ith data point in D dimensions should also reconstruct its embedded
manifold coordinates in d dimensions.
LLE constructs a neighborhood preserving mapping based on the above idea. In
the final step of the algorithm, each high dimensional observation Xi is mapped to a
low dimensional vector Yi representing global internal coordinates on the manifold.
This is done by choosing d-dimensional coordinates Yi to minimize the embedding
cost function:
2
Φ(Y ) = Wi j Yj
Yi −
i j
This cost function, like the previous one, is based on locally linear reconstruction
errors, but here, the weights Wi j are fixed while optimizing the coordinates Yi .
Now, the embedding cost can be minimized by solving a sparse N × N eigenvector
problem, whose bottom d non-zero eigenvectors provide an ordered set of orthogonal
coordinates centered on the origin.
It is noteworthy that while the reconstruction weights for each data point are
computed from its local neighborhood, the embedding coordinates are computed by
an N × N eigensolver, a global operation that couples all data points in connected
components of the graph defined by the weight matrix. The different dimensions in
the embedding space can be computed successively; this is done simply by computing
the bottom eigenvectors from previous equation one at a time. But the computation is
always coupled across data points. This is how the algorithm leverages overlapping
local information to discover global structure. Implementation of the algorithm is
fairly straightforward, as the algorithm has only one free parameter: the number of
neighbors per data point, K .
Sampling is used to ease the analysis and modeling of large data sets. In DM, data
sampling serves four purposes:
• To reduce the number of instances submitted to the DM algorithm. In many cases,
predictive learning can operate with 10–20 % of cases without a significant dete-
rioration of the performance. After that, the addition of more cases should have
expected outcomes. However, in descriptive analysis, it is better to have as many
cases as possible.
• To support the selection of only those cases in which the response is relatively
homogeneous. When you have data sets where different trends are clearly observ-
able or the examples can be easily separated, you can partition the data for different
types of modelling. For instance, imagine the learning of the approving decision
of bank loans depending on some economic characteristics of a set of customers.
6.3 Data Sampling 157
If data includes consumer loans and mortgages, it seems logical to partition both
types of loans because the parameters and quantities involved in each one are
completely different. Thus, it is a good idea to build separate models on each
partition.
• To assist regarding the balance of data and occurrence of rare events. Predictive
DM algorithms like ANNs or decision trees are very sensitive to imbalanced data
sets. An imbalanced data set is one in which one category of the target variable
is less represented compared to the other ones and, usually, this category has is
more important from the point of view of the learning task. Balancing the data
involves sampling the imbalanced categories more than average (over-sampling)
or sampling the common less often (under-sampling) [3].
• To divide a data set into three data sets to carry out the subsequent analysis of DM
algorithms. As we have described in Chap. 2, the original data set can be divided
into the training set and testing set. A third kind of division can be performed
within the training set, to aid the DM algorithm to avoid model over-fitting, which
is a very common strategy in ANNs and decision trees. This partition is usually
known as validation set, although, in various sources, it may be denoted as the
testing set interchangeably [22]. Whatever the nomenclature used, some learners
require an internal testing process and, in order to evaluate and compare a set
of algorithms, there must be an external testing set independent of training and
containing unseen cases.
Various forms of data sampling are known in data reduction. Suppose that a large
data set, T , contains N examples. The most common ways that we could sample T
for data reduction are [11, 24]:
• Simple random sample without replacement (SRSWOR) of size s: This is
created by drawing s of the N tuples from T (s < N ), where the probability
of drawing any tuple in T is 1/N , that is, all examples have equal chance to be
sampled.
• Simple random sample with replacement (SRSWR) of size s: This is similar
to SRSWOR, except that each time a tuple is drawn from T , it is recorded and
replaced. In other words, after an example is drawn, it is placed back in T and it
may be drawn again.
• Balanced sample: The sample is designed according to a target variable and
is forced to have a certain composition according to a predefined criterion. For
example, 90 % of customers who are older tah or who are 21 years old, and 10 %
of customers who are younger than 21 years old. One of the most successful
application of this type of sampling has been shown in imbalanced learning, as we
have mentioned before.
• Cluster sample: If the tuples in T are grouped into G mutually disjointed groups
or clusters, then an SRS of s clusters can be obtained, where s < G. For example,
in spatial data sets, we may choose to define clusters geographically based on how
closely different areas are located.
• Stratified sample: If T is divided into mutually disjointed parts called strata,
a stratified sample of T is generated by obtaining an SRS at each stratum. This
158 6 Data Reduction
The selection of a small representative subset from a very large data set is known as
data condensation. In some sources of DM, such as [22], this form of data reduction
is differentiated from others. In this book, data condensation is integrated as one of
the families of IS methods (see Chap. 8).
Data condensation emerges from the fact that naive sampling methods, such as
random sampling or stratified sampling, are not suitable for real-world problems
with noisy data since the performance of the algorithms may change unpredictably
and significantly. The data sampling approach practically ignores all the information
present in the samples which are not chosen in the reduced subset.
Most of the data condensation approaches are studied on classification-based
tasks, and in particular, for the KNN algorithm. These methods attempt to obtain a
minimal consistent set, i.e., a minimal set which correctly classifies all the original
examples. The very first method of this kind was the condensed nearest neighbor
rule (CNN) [12]. For a survey on data condensation methods for classification, we
again invite the reader to check the Chap. 8 of this book.
Regarding the data condensation methods which are not affiliated with classifica-
tion tasks, termed generic data condensation, condensation is performed by vector
quantization, such as the well-known self-organizing map [19] and different forms
of data clustering. Another group of generic data condensation methods are situated
on the density-based techniques which consider the density function of the data for
the aspiration of condensation instead of minimizing the quantization error. These
approaches do not concern any learning process and, hence, are deterministic, (i.e.,
for a concrete input data set, the output condensed set is established). Clear examples
of this kind of approaches are presented in [10, 21].
6.3 Data Sampling 159
A data squashing method seeks to compress, or “squash”, the data in such a way that
a statistical analysis carried out on the compressed data obtains the same outcome
that the one obtained with the original data set; that is, the statistical information is
preserved.
The first approach of data squashing was proposed in [6] and termed DS, as a
solution of constructing a reduced data set. DS approach to squashing is model-free
and relies on moment-matching. The squashed data set consists of a set of artificial
data points chosen to replicate the moments of the original data within subsets of the
actual data. DS studies various approaches to partitioning and ordering the moments
and also provides a theoretical justification of their method by considering a Tay-
lor series expansion of an arbitrary likelihood function. Since this relies upon the
moments of the data, it should work well for any application in which the likelihood
is well-approximated by the first few terms of a Taylor series. In practice, it is only
proven with logistic regression.
In [20], the authors proposed the “likelihood-based data squashing” (LDS). LDS
is similar to DS because it first partitions the data set and then chooses artificial data
points corresponding to each subset of the partition. Nevertheless, the algorithms
differ in how they build the partition and how they build the artificial data points. The
DS algorithm partitions the data along certain marginal quartiles, and then matches
moments. The LDS algorithm partitions the data using a likelihood-based clustering
and then selects artificial data points so as to mimic the target sampling or posterior
distribution. Both algorithms yield artificial data points with associated weights. The
usage of squashed data requires algorithms that can use these weights conveniently.
LDS is slightly more general than DS because it is also prepared for ANN-based
learning.
A subsequent approach described in [23] presents a form of data squashing based
on empirical likelihood. This method re-weights a random sample of data to match
certain expected values to the population. The benefits of this method are the reduc-
tion of optimization cost in terms of computational complexity and the interest in
enhancing the performance of boosted random trees.
Clustering algorithms partition the data examples into groups, or clusters, so that data
samples within a cluster are “similar” to one another and different to data examples
that belong to other clusters. The similarity is usually defined by means of how near
the examples are in space, according to a distance function. The quality of a cluster
could be measured as a function of the length of its diameter, which is the maximum
distance between any two samples belonging to the cluster. The average distance
of each object within the cluster to the centroid is an alternative measure of cluster
160 6 Data Reduction
quality. An illustration of a three cluster derivation from a set of 2-D data points is
depicted in Fig. 6.3.
In terms of data reduction, the cluster representations of the data are used instead
of the actual data. In many applications, such as those in which data can be organized
into distinct groups, this technique is higly effective.
There is a vast number of clustering techniques for defining clusters and for
measuring their quality. In fact, clustering is surely the most popular and common
form of unsupervised learning in DM, as we have mentioned in Chap. 1 of this
book. For this reason, we have included it here due to the clear overlapping that
clustering has with data reduction. Unfortunately, this book is not specifically devoted
to learning and a deep study on clustering is beyond the scope of this book. However,
the reader may consult the following references to an in-depth study: [1, 2, 9, 11,
15, 16, 27].
6.4 Binning and Reduction of Cardinality 161
Binning is the process of converting a continuous variable into a set of ranges. Then,
each range can be treated as categories, with the choice of imposing order on them.
This last choice is optional and depends on the further analysis to be made on the data.
For example, we can bin the variable representing the annual income of a customer
into ranges of 5,000 dollars (0–5,000; 5,001–10,000; 10,001–15,000, . . . , etc.). Such
a binning could allow the analysis in a business problem may reveal that customers
in the first range have less possibility to get a loan than customers in the last range,
grouping them within an interval that bounds a numerical variable. Therefore, it
demonstrates that keeping the strict order of bins is not always necessary.
Cardinality reduction of nominal and ordinal variables is the process of combining
two or more categories into one new category. It is well known that nominal variables
with a high number of categories are very problematic to handle. If we perform a
transformation of these large cardinality variables onto indicator variables, that is,
binary variables that indicate whether or not a category is set for each example; we
will produce a large number of new variables, almost all equal to zero. On the other
hand, if we do not perform this conversion and use them just as they are in with the
algorithm that can tolerate them, such as decision trees, we run into the problem of
over-fitting the model. It is realistic to consider reducing the number of categories in
such variables.
Both processes are two common transformations used to achieve two objectives:
• Reduce the complexity of independent and possible dependent variables.
• Improve the predictive power of the variable, by carefully binning or grouping
the categories in such a way that we model the dependencies regarding the target
variable in both estimation and classification problems.
Binning and cardinality reduction are very similar procedures, differing only in
the type of variable that we want to process. In fact, both processes are distinctively
grouped within the term discretization, which constitutes the most popular nota-
tion in the literature. It is also very common to distinguish between binning and
discretization depending on the ease of the process performed. Binning is usually
associated with a quick and easy discretization of a variable. In [11], the authors dis-
tinguish among three types of discretization: binning, histogram analysis-based and
advanced discretization. The first corresponds to a splitting technique based on the
specification of the number of bins. The second family is related with unsupervised
discretization and finally, a brief inspection of the rest of the methods is drawn.
Regardless of the above, and under the discretization nomenclature, we will dis-
cuss all related issues and techniques in Chap. 9 of this book.
162 6 Data Reduction
References
1. Aggarwal, C., Reddy, C.: Data clustering: recent advances and applications. Chapman and
Hall/CRC Data Mining and Knowledge Discovery Series. Taylor & Francis Group, Boca Raton
(2013)
2. Aggarwal, C.C., Reddy, C.K. (eds.): Data Clustering: Algorithms and Applications. CRC Press,
New York (2014)
3. Batista, G.E.A.P.A., Prati, R.C., Monard, M.C.: A study of the behavior of several methods for
balancing machine learning training data. SIGKDD Explor. Newsl. 6(1), 20–29 (2004)
4. Bellman, R.E.: Adaptive control processes—a guided tour. Princeton University Press, Prince-
ton (1961)
5. Chatfield, C., Collins, A.J.: Introduction to Multivariate Analysis. Chapman and Hall, London
(1980)
6. DuMouchel, W., Volinsky, C., Johnson, T., Cortes, C., Pregibon, D.: Squashing flat files flatter.
In: Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining, KDD ’99, pp. 6–15 (1999)
7. Dunteman, G.: Principal Components Analysis. SAGE Publications, Newbury Park (1989)
8. Fukunaga, K.: Introduction to Statistical Pattern Recognition, 2nd edn. Academic Press Pro-
fessional, Inc., San Diego (1990)
9. Gan, G., Ma, C., Wu, J.: Data Clustering—Theory, Algorithms, and Applications. SIAM,
Philadelphia (2007)
10. Girolami, M., He, C.: Probability density estimation from optimally condensed data samples.
IEEE Trans. Pattern Anal. Mach. Intell. 25(10), 1253–1264 (2003)
11. Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers
Inc., San Francisco (2011)
12. Hart, P.E.: The condensed nearest neighbor rule. IEEE Trans. Inf. Theory 14, 515–516 (1968)
13. Hwang, J., Lay, S., Lippman, A.: Nonparametric multivariate density estimation: a comparative
study. IEEE Trans. Signal Process. 42, 2795–2810 (1994)
14. Jain, A., Zongker, D.: Feature selection: evaluation, application, and small sample performance.
IEEE Trans. Pattern Anal. Mach. Intell. 19(2), 153–158 (1997)
15. Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: A review. ACM Comput. Surv. 31(3),
264–323 (1999)
16. Jain, A.K.: Data clustering: 50 years beyond k-means. Pattern Recogn. Lett. 31(8), 651–666
(2010)
17. Johnson, R.A., Wichern, D.W.: Applied Multivariate Statistical Analysis. Prentice-Hall, Engle-
wood Cliffs (2001)
18. Kim, J.O., Mueller, C.W.: Factor Analysis: Statistical Methods and Practical Issues (Quantita-
tive Applications in the Social Sciences). Sage Publications, Inc, Beverly Hills (1978)
19. Kohonen, T.: The self organizing map. Proc. IEEE 78(9), 1464–1480 (1990)
20. Madigan, D., Raghavan, N., DuMouchel, W., Nason, M., Posse, C., Ridgeway, G.: Likelihood-
based data squashing: a modeling approach to instance construction. Data Min. Knowl. Disc.
6(2), 173–190 (2002)
21. Mitra, P., Murthy, C.A., Pal, S.K.: Density-based multiscale data condensation. IEEE Trans.
Pattern Anal. Mach. Intell. 24(6), 734–747 (2002)
22. Nisbet, R., Elder, J., Miner, G.: Handbook of Statistical Analysis and Data Mining Applications.
Academic Press, Boston (2009)
23. Owen, A.: Data squashing by empirical likelihood. Data Min. Knowl. Disc. 7, 101–113 (2003)
24. Refaat, M.: Data Preparation for Data Mining Using SAS. Morgan Kaufmann Publishers Inc.,
San Francisco (2007)
25. Roweis, S., Saul, L.: Nonlinear dimensionality reduction by locally linear embedding. Science
290(5500), 2323–2326 (2000)
26. Tenenbaum, J.B., Silva, V., Langford, J.C.: A global geometric framework for nonlinear dimen-
sionality reduction. Science 290(5500), 2319–2323 (2000)
27. Xu, R., Wunsch, D.: Survey of clustering algorithms. IEEE Trans. Neural Networks 16(3),
645–678 (2005)
Chapter 7
Feature Selection
Abstract In this chapter, one of the most commonly used techniques for
dimensionality and data reduction will be described. The feature selection problem
will be discussed and the main aspects and methods will be analyzed. The chapter
starts with the topics theoretical background (Sect. 7.1), dividing it into the major
perspectives (Sect. 7.2) and the main aspects, including applications and the eval-
uation of feature selections methods (Sect. 7.3). From this point on, the successive
sections make a tour from the classical approaches, to the most advanced proposals,
in Sect. 7.4. Focusing on hybridizations, better optimization models and derivatives
methods related with feature selection, Sect. 7.5 provides a summary on related and
advanced topics, such as feature construction and feature extraction. An enumeration
of some comparative experimental studies conducted in the specialized literature is
included in Sect. 7.6.
7.1 Overview
Generally, the objective of FS is to identify the features in the data set which
are important, and discard others as redundant or irrelevant. Since FS reduces the
dimensionality of the data, DM algorithms, especially the predictive ones, can operate
faster and obtain better outcomes by using FS. The main reason for this achieved
improvement is mainly raised by an easier and more compact representation of the
target concept [6].
Reasons for performing FS may include [48]:
• removing irrelevant data;
• increasing predictive accuracy of learned models;
• reducing the cost of the data;
• improving learning efficiency, such as reducing storage requirements and compu-
tational cost;
• reducing the complexity of the resulting model description, improving the under-
standing of the data and the model.
7.2 Perspectives
Although FS is used for all types and paradigms of learning, the most well known
and commonly used field is classification. We will focus our efforts mainly on clas-
sification. The problem of FS can be explored in many perspectives. The four most
important are (1) searching for the best subset of features, (2) criteria for evaluat-
ing different subsets, (3) principle for selecting, adding, removing or changing new
features during the search and (4) applications.
First of all, FS is considered as a search problem for an optimal subset of features
for general or specific purposes, depending on the learning task and kind of algorithm.
Secondly, there must be a survey of evaluation criteria to determine proper applica-
tions. Third, the method used to evaluate the features is crucial to categorize methods
according to the direction of the search process. The consideration of univariate or
multivariate evaluation is also a key factor in FS. Lastly, we will specifically study
the interaction between FS and classification.
FS can be considered as a search problem, where each state of the search space
corresponds to a concrete subset of features selected. The selection can be represented
as a binary array, with each element corresponding to the value 1, if the feature is
currently selected by the algorithm and 0, if it does not occur. Hence, there should
be a total of 2 M subsets where M is the number of features of a data set. A simple
case of the search space for three features is depicted in Fig. 7.1. The optimal subset
would be between the beginning and the end of this graph.
7.2 Perspectives 165
More realistic problems in DM do not only have three features, thus the search
should not start with the full set of features. A search direction must be specified and
different search strategies should be adopted to try to achieve optimal subsets in real
problems.
With no prior knowledge about the problem, the search for a optimal subset can
be achieved from the empty set, by inserting new features, or from the full set, by
removing features, with the same probability. These two are the directions of search:
Algorithm 1 provides details for SFG. Here, the best feature f is chosen by
FindNext(F). f is added into S and removed from F, growing S and shrinking F.
There can be two stopping criteria, either when the F set is empty or when S satisfies
U . Adopting only the first criterion, we can obtain a ranked list of features.
To the contrary of SFG, Algorithm 2 illustrates the direction SBG, where the
search starts with the full set F and finds a feature subset S by removing one feature
at a time. The function GetNext(F) finds out the least relevant feature f which will
be removed from F. In this case, also F shrinks and S grows, but S only stores the
irrelevant features.
There are other search directions that base their existence in the usual case in
which the optimal subset should be in the middle range of the beginning and the
end of the search space. So, it is very intuitive to start from both ends and perform
a bidirectional search. The probability of finding the optimal subset is increased
because a search in one direction usually finishes faster than in the other direction.
When the number of relevant features m is smaller than M/2, SFG is faster,
otherwise if m is greater than M/2, then SBG is faster. As the value of m is usually
unknown, it would be impossible to know which scheme would be faster. Thus, the
bidirectional generation makes sense, and it is shown in operation in Algorithm 3.
In it, SFG and SBG are run in parallel and it stops if either find a satisfactory subset.
Finally, there is another search direction used to not pursue any particular direc-
tion, instead, the direction is randomly chosen.
• Random Generation (RG): It starts the search in a random direction. The choice of
adding or removing a features is a random decision. RG tries to avoid the stagnation
into a local optima by not following a fixed way for subset generation. Unlike SFG
or SBG, the size of the subset of features cannot be stipulated.
A random generation scheme produces subsets at random. Based on a good ran-
dom number generator attached with a function called RandGen(F) in such a way
that every combination of features F has a chance to occur and only once. This
scheme is summarized in Algorithm 4, where S is a subset of features.
From now on, we can combine these search directions with a suitable search
strategy to design the best FS algorithm for a certain problem we may encounter.
Brute force search doesn’t make sense when M is large. The more resources we
spend during the search process, the better the subset we may find. However, when
the resources are finite, as usual, we have to reduce the optimality of the selected
subsets. So, the purpose is to get a good trade-off between this optimality and the
lesser quantity of resources required. Since it is not an easy task, three main categories
summarize the search strategies:
168 7 Feature Selection
• Exhaustive Search: It corresponds to explore all possible subsets to find the optimal
ones. As we said before, the space complexity is O(2 M ). If we establish a threshold
m of minimum
M featuresto be selected and the direction of search, the search space
M
is M 0 + 1 + · · · + m , independent of the forward or backward generation.
Only exhaustive search can guarantee the optimality, however we can find an
optimal subset without visiting all possible states, by using exact algorithms such
as backtracking and branch and bound [38]. Nevertheless, they are also impractical
in real data sets with a high M.
• Heuristic Search: It employs heuristics to carry out the search. Thus, it prevents
brute force search, but it will surely find a non-optimal subset of features. It draws
a path connecting the beginning and the end in Fig. 7.1, such in a way of a depth-
first search. The maximum length of this path is M and the number of subsets
generated is O(M). The choice of the heuristic is crucial to find a closer optimal
subset of features in a faster operation.
• Nondeterministic Search: This third category arises from a complementary com-
bination of the previous two. It is also known as random search strategy and can
generate best subsets constantly and keep improving the quality of selected fea-
tures as time goes by. In each step, the next subset is obtained at random. There
are two properties of this type of search strategiy: (1) it is unnecessary to wait until
the search ends and (2) we do not know when the optimal set is obtained, although
we know which one is better than the previous one and which one is the best at
the moment.
After studying the essential approaches of search strategies and directions, the next
issue to tackle is the measurement of the quality or goodness of a feature. It is
necessary to distinguish between best or optimal subset of features and to be common
7.2 Perspectives 169
in all FS techniques. Usually, the evaluation metrics work in two ways: (1) supporting
the performance in terms of efficacy and (2) supporting the performance in terms of
efficiency or yielding more understandable outcomes. For instance, in classification,
as the main problem addressed in this chapter, the primary objective is to maximize
predictive accuracy.
Information serves to measure the uncertainty of the receiver when she/he receives
a message. If the receiver understands the message, her/his associated uncertainty is
low, but if the receiver is not able to completely understand the message, all mes-
sages have almost equal probability of being received and the uncertainty increases.
Under the context of predictive learning, the message is the output feature or class
in classification. An information measure U is defined so that larger values for U
represent higher levels of uncertainty.
Given an uncertainty function U and the prior class probabilities P(ci ) where
i = 1, 2, . . . , C, being C the number of classes; the information gain from
a feature
A, I G(A), is defined as the difference between the prior uncertainty U (P(ci ))
i
and the expected posterior uncertainty using A, i.e.,
I G(A) = U (P(ci )) − E U (P(ci |A))
i i
P(ci )P(x|ci )
P(ci |x) =
P(x)
P(x) = P(ci )P(x|ci )
A feature evaluation model inferred from the concept of information gain states
that feature Ai is chosen instead of featureA j if I G(Ai ) > I G(A j ); that is, if Ai
reduces more uncertainty than A j . Since U (P(ci )) is independent of features,
i
we can rewrite the rule as Ai is preferred to A j if U (Ai ) < U (A j ), where U =
E U (P(ci | A)) . This idea is used in C4.5 [43] for selecting a feature to generate
i
new branches.
A commonly used uncertainty function is Shannon’s entropy,
− P(ci ) log2 P(ci ).
i
170 7 Feature Selection
C
I (D Aj ) = − PD A (ci ) log2 PD A (ci ),
j j
i=1
p
|D j |
I G(A) = I (D) − I (D Aj ),
|D|
j=1
where |D| is the number of instances in D, and PD (ci ) are the prior probabilities for
data D.
Information gain has a tendency to choose features with more distinct values.
Instead, information gain ratio was suggested in [43] to balance the effect of many
values. It is worthy mentioning that it is only applied to discrete features. For con-
tinuous ones, we have to find a split point with the highest gain or gain ratio among
the sorted values in order to split the values into two segments. Then, information
gain can be computed as usual.
V (A j ) = P(ci )(P(ci |A j = a) − P(ci ))2 P(A j = a)d x.
Moreover, some of the most common distance measures for numeric variables
used in FS are summarized in Table 7.1.
They are also known as measures of association or correlation. Its main goal is to
quantify how strongly two variables are correlated or present some association with
each other, in such way that knowing the value of one of them, we can derive the
value for the other. In feature evaluation, the common procedure is to measure the
correlation between any feature with the class. Denoting by R(A) a dependence
measure between feature A and class C, we choose feature Ai over feature A j of
R(Ai ) > R(A j ). In other words, the feature most correlated with the class is chosen.
If A and C are statistically independent, they are not correlated and removing A
should not affect the class separability regarding the rest of the features. In a contrary
case, the feature should be selected because it could somewhat explain the trend of
the class.
One of the most used dependence measures is the Pearson correlation coefficient,
which measures the degree of linear correlation between two variables. For two
variables X and Y with measurements {xi } and {yi }, means x̄ and ȳ, this is given by
i (xi − x̄)(yi − ȳ)
ρ(X, Y ) =
1
i (x i − x̄) i (yi − ȳ)
2 2 2
If two variables are very correlated (ρ ≈ ±1), one of them could be removed.
However, linear correlations are not able to detect relationships that are not linear.
Correlations with respect to the target variable can also be computed in order to
172 7 Feature Selection
estimate the relationship between any attribute to the class. This requires the coding
of the target class as a binary vector.
Again, these types of measures are closely related to information and distance
measures. Features with a strong association with the class are good features for
predictive tasks. One of the most used dependence measures is the Bhattacharyya
dependence measure B, defined as:
B(A j ) = − log P(ci ) P(A j = a|ci )P(a j = a)d x
The previous measures attempt to find the best features that can explain maximally
one class from the others, but are not able to detect whether one of them is redundant.
On the other hand, consistency measures attempt to find a minimum number of
features that separate classes as the full set of features can. They aim to achieve
P(C|Full Set) = P(C|SubSet). Feature evaluation rules derived from consistency
measures state that we should select the minimum subset of features that can maintain
the consistency of data as observed by the full set of features. An inconsistency is
defined as the case of two examples with the same inputs (same feature values)
but with different output feature values (classes in classification). Using them, both
irrelevant and redundant features can be removed.
This form of evaluation relies on the classifier or learner. Among various possible
subsets of features, the subset which yields the best predictive accuracy is chosen.
This family is distinguished from the previous four due to the fact that is directly
focused on improving the accuracy of the same learner used in the DM task. However,
we have to take some considerations into account. Firstly, how to truly estimate the
predictive accuracy avoiding the problem of over-fitting. Secondly, it is important to
contemplate the required time taken by the DM model to complete learning from the
data (usually, classifiers perform more complex tasks than the computation of any of
the four measures seen above). Lastly, the subset of features could be biased towards
an unique model of learning, producing subsets of features that are not generalized.
Table 7.2 summarizes the computation of the accuracy metric and some derivatives
that have been used in FS. The notation used is: t p, true positives; f p, false positives;
f n, false negatives; tn, true negatives; t pr = t p/(t p + f n), sample true positive
rate; f pr = f p/( f p + tn), sample false positive rate; pr ecision = t p/(t p + f p);
r ecall = t pr .
7.2 Perspectives 173
Table 7.2 Accuracy metric and derivatives for a two-class (positive class and negative class)
problem
Mathematical form
tp+ f p
Accuracy t p+tn+ f p+ f n
Error rate 1− Accuracy
n( f p × f n−t p × tn)2
Chi-squared (t p+ f p)(t p+ f n)( f p+tn)(tn+ f n)
Information gain e(t p + f n, f p + tn) − (t p+ f p)e(tt p+ p, f p)+(tn+ f n)e( f n,tn)
f p+tn+ f n
y y
where e(x, y) = − x+y x x
log2 x+y − x+y log2 x+y
t pr f pr t p × tn
Odds ratio 1−t pr 1− f pr = f p × f n
t pr
Probability ratio f pr
It is surely the most known and employed categorization made in FS methods for
years [33]. In the following, we will detail the three famous categories of feature
selectors: filter, wrapper and embedded.
7.2.3.1 Filters
usual learning and testing process in which we obtain the predictive accuracy on test
data.
The filter model has several properties:
• measuring uncertainty, distances, dependence or consistency is usually cheaper
than measuring the accuracy of a learning process. Thus, filter methods are usually
faster.
• it does not rely on a particular learning bias, in such a way that the selected features
can be used to learn different models from different DM techniques.
• it can handle larger sized data, due to the simplicity and low time complexity of
the evaluation measures.
7.2.3.2 Wrappers
One can think that the simplest form of FS consists of engaging a classifier as an
evaluator method for deciding the insertion or deletion of a certain feature in the
subset, by using any metric for predictive performance. The aim is straightforward;
to achieve the highest predictive accuracy possible by selecting the features that
accomplish this for a fixed learning algorithm. This model is the so called the wrapper
model.
In other words, the wrapper approach [23] uses a learning algorithm as a black box
together with statistical validation (CV for example) to avoid over-fitting to select
the best feature subset, agreeing on a predictive measure.
7.2 Perspectives 175
Wrapper models can achieve the purpose of improving the particular learner’s
predictive performance. A wrapper model consists of two stages (see Fig. 7.3): (1)
feature subset selection, which selects the best subset using the accuracy provided
by the learning algorithm (on training data) as a criterion; (2) is the same as in the
filter model. Since we keep only the best subset in stage 1, we need to learn again the
DM model with the best subset. Stage 1 corresponds with the data reduction task.
It is well-known that the estimated accuracy using the training data may not
reflect the same accuracy on test data, so the key issue is how to truly estimate or
generalize the accuracy over non-seen data. In fact, the goal is not exactly to increase
the predictive accuracy, rather to improve overall the DM algorithm in question.
Solutions come from the usage of internal statistical validation to control the over-
fitting, ensembles of learners [41] and hybridizations with heuristic learning like
Bayesian classifiers or Decision Tree induction [23]. However, when data size is
huge, the wrapper approach cannot be applied because the learning method is not
able to manage all data.
Regarding filter models, they cannot allow a learning algorithm to fully exploit
its bias, whereas wrapper methods do.
The embedded approach [16] is similar to the wrapper approach in the sense that the
features are specifically selected for a certain learning algorithm. Moreover, in this
approach, the features are selected during the learning process.
176 7 Feature Selection
7.3 Aspects
This section is devoted to the discussion of several important aspects related to FS.
Each subsection will deal with one general facet of FS, but they are neither directly
connected or sorted following a certain criterion. For more advanced and more recent
developments of FS, please read Sect. 7.5 of this chapter.
From the point of view of the output of FS methods, they can be grouped into two
categories. The first one consists of ranking features according to some evaluation
criteria; the other consists of choosing a minimum set of features that satisfy an
evaluation criterion. Next, explain both of them in more detail.
In this category of methods, we expect as the output a ranked list of features which
are ordered according to evaluation measures. The measures can be of any type:
information, distance, dependence, consistency or accuracy. Thus, a feature selector
belonging to this family does not inform about the minimum subset of features;
instead, they return the relevance of the features.
The basic idea consists of evaluating each feature with a measure and attaching the
result values to each feature. Then, the features are sorted according to the values. The
run time complexity of this algorithm is O(M N + M 2 ), where M is the number of
features and N the number of instances. There are many variations of this algorithm
that draw different FS methods. The common property is the outcome based on a
ranked list of features. Algorithm 5 summarizes the operation of a univariate feature
ranking technique.
For performing actual FS, the simplest way is to choose the first m features for the
task at hand, whenever we know the most appropriate m value. But this is not always
true, there is not a straightforward procedure to obtain m. Solutions could proceed
from building DM models repeatedly until the generalization error is decreased. This
7.3 Aspects 177
The number of relevant features is a parameter that is often not known by the prac-
titioner. There must be a second category of techniques focused on obtaining the
minimum possible subset without ordering the features. An algorithm belonging to
this category returns a minimum feature subset and no difference is made for features
in the subset. So, whatever is relevant within the subset, is otherwise irrelevant.
The minimum subset algorithm is detailed in Algorithm 6. The subsetGenerate()
function returns a subset following a certain search method, in which a stopping
criterion determines when stop is set to true; function legitimacy() returns true
if subset Sk satisfies measure U . Function subsetGenerate() can take one of the
generation schemes.
7.3.2 Evaluation
Several aspects must be taken into account to evaluate a FS method. Among them, it
is important to known how well does a feature selector work or the conditions under
178 7 Feature Selection
which it may work. It is normal to compare one method with another or a subset
of a previously proposed model to enhance and justify its new benefits and also to
comment on it and find out when it does not work.
To measure the concerns described above, one have to appeal to quantitative
measures that overall define the performance of a method. Performance can be seen
as a list of objectives and, for FS, the list is basically composed by three main goals:
• Inferability: For predictive tasks, assumed as the main purpose for which FS is
developed, considered as an improvement of the prediction of unseen examples
with respect to the direct usage of the raw training data. In other words, the model or
structural representation obtained from the subset of features by the DM algorithms
obtained better predictive capability than that built from the original data.
• Interpretability: Again considering predictive tasks, related to the model generated
by the DM algorithm. Given the incomprehension of raw data by humans, DM
is also used for generating more understandable structure representation that can
explain the behavior of the data. It is obvious to pursue the simplest possible
structural representation because the simpler a representation is, the easier is to
interpret. This goal is at odds with accuracy.
• Data Reduction: Closely related to the previous goal, but in this case referring to
the data itself, without involving any DM algorithms. It is better and simpler, from
any point of view, to handle data with lower dimensions in terms of efficiency and
interpretability. However, evidence shows that it is not true that the greater the
reduction of the number of features, the better the understandability.
Our expectation is to increase the three goals mentioned above at the same time.
However, it is a multi-objective optimization problem with conflicting sub-objectives,
and it is necessary to find a good trade-off depending on the practice or on the
application in question. We can derive three assessment measures from these three
goals to be evaluated independently:
• Accuracy: It is the most commonly used measure to estimate the predictive power
and generalizability of a DM algorithm. A high accuracy shows that a learned
model works well on unseen data.
• Complexity: It indirectly measures the interpretability of a model. A model is
structured according to a union of simpler elements, thus if the number of such
elements is low, the complexity is also low. For instance, a decision tree is com-
posed by branches, leaves and nodes as its basic elements. In a standard decision
tree, the number of leaves is equal to the number of branches, although there may
be branches of different lengths. The number of nodes in a branch can define
the complexity of this branch. Even for each node, the mathematical expression
used inside for splitting data can have one or more comparisons or operators.
All together, the count of all of these elements may define the complexity of a
representation.
• Number of features selected: A measure for assessing the size of the data. Small
data sets mean fewer potential hypotheses to be learned, faster learning and simpler
results.
7.3 Aspects 179
7.3.3 Drawbacks
The usage of decision trees for FS has one major advantage known as “anytime”
[47]. Decision trees can be used to implement a trade-off between the performance
of the selected features and the computation time which is required to find a subset.
Decision tree inducers can be considered as anytime algorithms for FS, due to the
fact that they gradually improve the performance and can be stopped at any time,
providing sub-optimal feature subsets. In fact, decision trees have been used as an
evaluation methodology for directing the FS search.
180 7 Feature Selection
The rest of this section presents some representative and so-considered classical
FS algorithms in the specialized literature. The algorithms are described and their
components are identified by using the components described above. We categorize
them according to the search strategy in the next subsections. Also, we provide a
discussion on one variant: feature weighting schemes.
Exhaustive methods cover the whole search space. We find six combinations (C1–C6)
within this category and corresponding to a forward or a backward search (growing
or reducing gradually the subset of features selected, respectively) with any kind of
evaluation measure which must be satisfied in each step.
In particular, we will detail the combination C2, corresponding with the con-
sistency measure for forward search, respectively. Here, the most famous method
is Focus [2]. It considers
all the combinations
among A features starting from an
empty subset: M1 subsets first, M 2 subsets next, etc. When Focus finds a subset
that satisfies the consistency measure, it stops. The details of this method is shown
m
M
in Algorithm 7. Focus needs to generate i subsets in order to find a minimum
i
subset of m features that satisfies the consistency criterion. If m is not small, the run-
time is quite prohibitive. Heuristics variations of Focus replaces the pure consistency
objective with the definition of good subsets.
Also known as stochastic methods, they add or remove features to and from a subset
without a sequential order, allowing the search to follow feature subsets that are
randomly generated.
Common techniques in this category are genetic algorithms and simulated anneal-
ing . In [18], a comparison of genetic algorithms with sequential methods is presented,
it is very difficult to fairly compare them because of the parameter adjustment. The
7.4 Description of the Most Representative Feature Selection Methods 183
three combinations C16, C17 and C18 constitute the stochastic methods differing
on the evaluation measure. In them, features are randomly generated and a fitness
function, closely related to the evaluation measures, is defined.
As the most typical stochastic techniques, we will discuss two methods here: LVF
(C17) and LVW (C18).
LVF is the acronym of Las Vegas Filter FS [31]. It consists of a random procedure
that generates random subsets of features and an evaluation procedure that checks
if each subset satisfies the chosen measure. For more details about LVF, see the
Algorithm 7. The evaluation measure used in LVF is inconsistency rate. It receives as
parameter the allowed inconsistency rate, that can be estimated by the inconsistency
rate of the data considering all features. The other parameter is the maximum number
of subsets to be generated in the process, which acts as a stopping criterion.
In Algorithm 9, maxTries is a number proportional to the number of original
features (i.e., l × M, being l a pre-defined constant). The rule of thumb is that
the more features the data has, the more difficult the FS task is. Another way for
setting maxTries is to relate to the size of the search space we want to explore. If the
complete search space is 2 M and if we want to cover a p % of the entire space, then
l = 2 M · p %.
LVF can be easily modified due to the simplicity of the algorithm. Changing the
evaluation measure is the only thing we can do to keep it within this category. If we
decide to use the accuracy measure, we will obtain the LVW (Las Vegas Wrapper
FS) method. For estimating the classifier’s accuracy, we usually draw on statistical
validation, such as 10-FCV. LVF and LVW could be very different in terms of run
time as well as subsets selected. Another difference of LVW regarding LVF is that the
learning algorithm LA requires its input parameters to be set. Function estimate() in
184 7 Feature Selection
LVW replaces the function CalU() in LVF (see Algorithm 10). Probabilistic measures
(category C16) can also be used in LVF.
must be mentioned in a FS topic review. Relief was proposed in [22] and selects
features that are statistically relevant. Although its goal is still selecting features, it
does not explicitly generate feature subsets and test them like the methods reviewed
above. Instead of generating feature subsets, Relief focuses on sampling instances
without an explicit search for feature subsets. This follows the idea that relevant
features are those whose values can distinguish among instances that are close to
each other. Hence, two nearest neighbors (belonging to different classes in a two-
class problem) are found for each given instance I , one is the so-called near-hit H
and the other is near-miss J . We expect a feature to be relevant if its values are
the same between I and H , and different between I and J . This checking can be
carried out in terms of some distance between feature’s values, which should be
minimum for I and H and maximum for I and J . The distance of each feature for
each randomly chosen instance is accumulated in a weight vector w of the same
number of dimensions as the number of features. The relevant features are those
having their weights exceeding a relevance threshold τ , which can be statistically
estimated. The parameter m is the sample size and larger m produces a more reliable
approximation. The algorithm is presented in Algorithm 11. It does not fit into any
of the categories described in the previous section, although it evaluates a feature
using distance measures.
The main advantage of Relief is that it can handle discrete and continuous data, by
using distance measures which can work with categorical values. On the other hand,
its main weakness is that it is limited to two-class data, although some extensions
for multiple classes have been proposed, such as ReliefF [25].
FS is, for most researchers, the basic data preprocessing technique, especially after the
year 2000. Unfortunately, the related literature is huge, quite chaotic and difficult to
understand or categorize the differences among the hundreds of algorithms published,
due to the different conventions or notations adopted. These are the major reasons
that disable the possibility of summarizing all the feature selectors proposed in this
book. Instead of describing individual approaches, we prefer to focus attention on the
main ideas that lead to updates and improvements with respect to the classical FSs
methods reviewed in the previous sections. We intend to describe the most influential
methods and ideas (which are usually published in highly cited papers) and the most
recent and promising techniques published in high quality journals on DM, ML and
Pattern Recognition fields.
Modifications of classical feature selectors cover a vast number of proposals in
the literature. Among most of the representatives, we could emphasize some relevant
approaches. For example, in [28], the authors proposed an extension of the MIFS
algorithm under uniform information distribution (MIFS-U), and the combination of
the greedy search of MIFS with Taguchi method. The same authors, in [27] presented
a speed up MIFS based on Parzen windows, allowing the computation of MI without
requiring a large amount of memory. Advances on MI are the minimal-redundancy-
maximal-relevance (mRMR) criterion for incremental FS [42]. Behind the idea that,
in traditional feature selectors, MI is estimated on the whole sampling space, the
authors in [32], proposed the evaluation by dynamic MI, which is only estimated on
unlabeled instances. The normalized mutual information FS (NMIFS) is proposed in
[14] as an enhancement over classical MIFS, MIFS-U, and mRMR methods. Here,
the average normalized MI is proposed as a measure of redundancy among features.
A unifying framework for information theoretic FS can be found in [8]. Another
method widely studied is the Relief and its derivatives. In [45], a theoretical and
empirical analysis of this family of methods is conducted, concluding that they are
robust and noise tolerant, besides the can alleviate their computational complexity by
parallelism. Wrapper methods have been extensively studied by using classifiers such
as SVMs [34], or frameworks to jointly perform FS and SVM parameter learning
[39].
Other criteria related to separability measures and recently developed for per-
forming FS include the kernel class separability [57], which has been applied to a
variety of selection modes and different search strategies. In [15], the authors propose
7.5 Related and Advanced Topics 187
With the proliferation of extremely high-dimensional data, two issues occur at the
same time: FS becomes indispensable in any learning process and the efficiency and
stability of FS algorithms could be neglected. One of the earlier studies regarding
this issue can be found in [21]. The reduction of the FS task to a quadratic optimiza-
tion problem is addressed in [46]. In that paper, the authors presented the Quadratic
Programming FS (QPFS) that uses the Nyströn method for approximate matrix diag-
onalization, making it possible to deal with very large data sets. In their experiments,
it outperformed mRMR and ReliefF using two evaluation criteria: Pearson’s corre-
lation coefficient and MI . In the presence of a huge number of irrelevant features
and complex data distributions, a local learning based approach could be useful [53].
Using a prior stage for eliminating class-dependent density-based features for the
feature ranking process can alleviate the effects of high-dimensional data sets [19].
Finally, and closely related to the emerging Big Data solutions for large-scale busi-
ness data, there is a recent approach for massively parallel FS described in [63].
High-performance distributed computing architectures, such as Message Passing
Interface (MPI) and MapReduce are being applied to scale any kind of algorithms
to large data problems.
When class labels of the data are available, we can use supervised FS, otherwise
the unsupervised FS is the appropriate. This family of methods usually involve the
maximization of a clustering performance or the selection of features based on feature
dependence, correlation and relevance. The basic principle is to remove those features
carrying little or no additional information beyond that subsumed by the rest of fea-
tures. For instance, the proposal presented in [35] uses feature dependency/similarity
for redundancy reduction, without requiring any search process. The process follows
a clustering partitioning based on features and it is governed by a similarity measure
called maximal information compression index. Other algorithms for unsupervised
FS are the forward orthogonal search (FOS) [59] whose goal is to maximize the
overall dependency on the data to detect significant variables. Ensemble learning
was also used in unsupervised FS [13]. In clustering, Feature Weighting has also
been applied with promising results [36].
In feature extraction, we are interested in finding new features that are calculated as
a function of the original features. In this context, DR is a mapping of a multidimen-
sional space into a space of fewer dimensions.
The reader should now be reminded that in Chap. 6 we denoted these techniques
as DR techniques. The rationale behind this is that the literature has adopted this
term in greater extent than feature extraction, although both designations are correct.
In fact, the FS is a sub-family of the DR techniques, which seems logical. In this
book, we have preferred to separate FS from the general DR task due to its influence
in the research community. Furthermore, the aim of this section is to establish a link
between the corresponding sections of Chap. 6 with the FS task.
7.5 Related and Advanced Topics 189
As we have discussed in Chap. 6, PCA, factor analysis, MDS and LLE, are the
most relevant techniques proposed in this field.
The feature construction emerged from the replication problem observed in the
models produced by DM algorithms. See for example the case of subtrees repli-
cation in decision tree based learning. The main goal was to attach to the algorithms
some mechanism to compound new features from the original ones endeavouring to
improve accuracy and the decrease in model complexity.
The definition of feature construction as a data preprocessing task is the applica-
tion of a set of constructive operators to a set of existing features, resulting in the
generation of new features intended for use in the description of the target concept.
Due to the fact that the new features are constructed from the existing ones, no new
information is yielded. They have been extensively applied on separate-and-conquer
predictive learning approaches.
Many constructive operators have been designed and implemented. The most
common operator used in decision trees is the product (see an illustration on the
effect of this operator in Fig. 7.4). Other operators are equivalent (the value is true if
two features x = y, and false otherwise), inequalities, maximum, minimum, aver-
age, addition, subtraction, division, count (which estimates the number of features
satisfying a ceratin condition), and many more.
Fig. 7.4 The effect of using the product of features in decision tree modeling
190 7 Feature Selection
involved in the experimental study. According to them, ReliefF turned out to be the
best option independent of the particulars of the data, adding that it is a filter with low
computational cost. Like in the study mentioned before, wrapper approaches have
proven to be an interesting choice in some domains, provided they can be applied with
the same classifiers and taking into account that they require higher computational
costs.
References
22. Kira, K., Rendell, L.A.: A practical approach to feature selection. In: Proceedings of the Ninth
International Workshop on Machine Learning, ML92, pp. 249–256 (1992)
23. Kohavi, R., John, G.: Wrappers for feature subset selection. Artif. Intell. 97(1–2), 273–324
(1997)
24. Koller, D., Sahami, M.: Toward optimal feature selection. In: Proceedings of the Thirteenth
International Conference on Machine Learning, pp. 284–292 (1996)
25. Kononenko, I.: Estimating attributes: Analysis and extensions of relief. In: Proceedings of the
European Conference on Machine Learning on Machine Learning, ECML-94, pp. 171–182
(1994)
26. Kudo, M., Sklansky, J.: Comparison of algorithms that select features for pattern classifiers.
Pattern Recognit. 33(1), 25–41 (2000)
27. Kwak, N., Choi, C.H.: Input feature selection by mutual information based on parzen window.
IEEE Trans. Pattern Anal. Mach. Intell. 24(12), 1667–1671 (2002)
28. Kwak, N., Choi, C.H.: Input feature selection for classification problems. IEEE Trans. Neural
Netw. 13(1), 143–159 (2002)
29. Liu, H., Motoda, H.: Feature Selection for Knowledge Discovery and Data Mining. Kluwer
Academic, USA (1998)
30. Liu, H., Motoda, H., Dash, M.: A monotonic measure for optimal feature selection. In: Pro-
ceedings of European Conference of Machine Learning, Lecture Notes in Computer Science
vol. 1398, pp. 101–106 (1998)
31. Liu, H., Setiono, R.: A probabilistic approach to feature selection - a filter solution. In: Pro-
ceedings of the International Conference on Machine Learning (ICML), pp. 319–327 (1996)
32. Liu, H., Sun, J., Liu, L., Zhang, H.: Feature selection with dynamic mutual information. Pattern
Recognit. 42(7), 1330–1339 (2009)
33. Liu, H., Yu, L.: Toward integrating feature selection algorithms for classification and clustering.
IEEE Trans. Knowl. Data Eng. 17(4), 491–502 (2005)
34. Maldonado, S., Weber, R.: A wrapper method for feature selection using support vector
machines. Inf. Sci. 179(13), 2208–2217 (2009)
35. Mitra, P., Murthy, C., Pal, S.: Unsupervised feature selection using feature similarity. IEEE
Trans. Pattern Anal. Mach. Intell. 24(3), 301–312 (2002)
36. Modha, D., Spangler, W.: Feature weighting on k-means clustering. Mach. Learn. 52(3), 217–
237 (2003)
37. Mucciardi, A.N., Gose, E.E.: A Comparison of Seven Techniques for Choosing Subsets of
Pattern Recognition Properties, pp. 1023–1031. IEEE, India (1971)
38. Narendra, P.M., Fukunaga, K.: A branch and bound algorithm for feature subset selection.
IEEE Trans. Comput. 26(9), 917–922 (1977)
39. Nguyen, M., de la Torre, F.: Optimal feature selection for support vector machines. Pattern
Recognit. 43(3), 584–591 (2010)
40. Oh, I.S., Lee, J.S., Moon, B.R.: Hybrid genetic algorithms for feature selection. IEEE Trans.
Pattern Anal. Mach. Intell. 26(11), 1424–1437 (2004)
41. Opitz, D.W.: Feature selection for ensembles. In: Proceedings of the National Conference on
Artificial Intelligence, pp. 379–384 (1999)
42. Peng, H., Long, F., Ding, C.: Feature selection based on mutual information: Criteria of max-
dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell.
27(8), 1226–1238 (2005)
43. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, USA
(1993)
44. Raymer, M., Punch, W., Goodman, E., Kuhn, L., Jain, A.: Dimensionality reduction using
genetic algorithms. IEEE Trans. Evolut. Comput. 4(2), 164–171 (2000)
45. Robnik-Łikonja, M., Kononenko, I.: Theoretical and empirical analysis of relieff and rrelieff.
Mach. Learn. 53(1–2), 23–69 (2003)
46. Rodriguez-Lujan, I., Huerta, R., Elkan, C., Cruz, C.: Quadratic programming feature selection.
J. Mach. Learn. Res. 11, 1491–1516 (2010)
References 193
47. Rokach, L.: Data Mining with Decision Trees: Theory and Applications. Series in Machine
Perception and Artificial Intelligence. World Scientific Publishing, USA (2007)
48. Saeys, Y., Inza, I., Larrañaga, P.: A review of feature selection techniques in bioinformatics.
Bioinformatics 23(19), 2507–2517 (2007)
49. Setiono, R., Liu, H.: Neural-network feature selector. IEEE Trans. Neural Netw. 8(3), 654–662
(1997)
50. Sima, C., Attoor, S., Brag-Neto, U., Lowey, J., Suh, E., Dougherty, E.: Impact of error estimation
on feature selection. Pattern Recognit. 38(12), 2472–2482 (2005)
51. Song, L., Smola, A., Gretton, A., Bedo, J., Borgwardt, K.: Feature selection via dependence
maximization. J. Mach. Learn. Res. 13, 1393–1434 (2012)
52. Sun, Y.: Iterative relief for feature weighting: Algorithms, theories, and applications. IEEE
Trans. Pattern Anal. Mach. Intell. 29(6), 1035–1051 (2007)
53. Sun, Y., Todorovic, S., Goodison, S.: Local-learning-based feature selection for high-
dimensional data analysis. IEEE Trans. Pattern Anal. Mach. Intell. 32(9), 1610–1626 (2010)
54. Swiniarski, R., Skowron, A.: Rough set methods in feature selection and recognition. Pattern
Recognit. Lett. 24(6), 833–849 (2003)
55. Tuv, E., Borisov, A., Runger, G., Torkkola, K.: Feature selection with ensembles, artificial
variables, and redundancy elimination. J. Mach. Learn. Res. 10, 1341–1366 (2009)
56. Uncu, O., Trksen, I.: A novel feature selection approach: Combining feature wrappers and
filters. Inf. Sci. 177(2), 449–466 (2007)
57. Wang, L.: Feature selection with kernel class separability. IEEE Trans. Pattern Anal. Mach.
Intell. 30(9), 1534–1546 (2008)
58. Wang, X., Yang, J., Teng, X., Xia, W., Jensen, R.: Feature selection based on rough sets and
particle swarm optimization. Pattern Recognit. Lett. 28(4), 459–471 (2007)
59. Wei, H.L., Billings, S.: Feature subset selection and ranking for data dimensionality reduction.
IEEE Trans. Pattern Anal. Mach. Intell. 29(1), 162–166 (2007)
60. Xu, L., Yan, P., Chang, T.: Best first strategy for feature selection. In: Proceedings of the Ninth
International Conference on Pattern Recognition, pp. 706–708 (1988)
61. Zhang, H., Sun, G.: Feature selection using tabu search method. Pattern Recognit. 35(3), 701–
711 (2002)
62. Zhao, Z., Wang, L., Liu, H., Ye, J.: On similarity preserving feature selection. IEEE Trans.
Knowl. Data Eng. 25(3), 619–632 (2013)
63. Zhao, Z., Zhang, R., Cox, J., Duling, D., Sarle, W.: Massively parallel feature selection: An
approach based on variance preservation. Mach. Learn. 92(1), 195–220 (2013)
64. Zhou, L., Wang, L., Shen, C.: Feature selection with redundancy-constrained class separability.
IEEE Trans. Neural Netw. 21(5), 853–858 (2010)
65. Zhu, Z., Ong, Y.S., Dash, M.: Wrapper-filter feature selection algorithm using a memetic
framework. IEEE Trans. Syst. Man Cybern. Part B: Cybern. 37(1), 70–76 (2007)
Chapter 8
Instance Selection
8.1 Introduction
Instance selection (IS) plays a pivotal role in the data reduction task due to the
fact that it performs the complementary process regarding the FS. Although it is
independent of FS, in most of the cases, both processes are jointly applied. Facing
the enormous amounts of data may be achieved by scaling down the data as an
alternative to improve the scaling-up of the DM algorithms. We have previously
seen that FS already accomplishes this objective, through the removal of irrelevant
and unnecessary features. In an orthogonal way, the removal of instances can be
considered the same or even more interesting from the point of view of scaling down
the data in certain applications [108].
The major issue of scaling down the data is the selection or identification of
relevant data from an immense pool of instances, and next to prepare it as input for
a DM algorithm. Selection is synonymous of pressure in many scenarios, such as in
© Springer International Publishing Switzerland 2015 195
S. García et al., Data Preprocessing in Data Mining,
Intelligent Systems Reference Library 72, DOI 10.1007/978-3-319-10247-4_8
196 8 Instance Selection
The global objective of this chapter can be summarized into three main purposes:
• To provide a unified description for a wide set of IS methods proposed in the
literature.
• To present a complete taxonomy based on the main properties observed in IS. The
taxonomy will allow us to learn the advantages and drawbacks from a theoretical
point of view.
• To report an empirical study for analyzing the methods in terms of accuracy and
reduction capabilities. Our goal is to identify the best methods in each family and
to stress the relevant properties of each one. Moreover, some graphical represen-
tations of the main IS methods are depicted.
At first, several proposals for selecting the most relevant data from the training
set were proposed thinking mostly in the KNN algorithm [32]. Later, when the
term instance-based learning [1], also known as lazy learning [2], was minted for
gathering all those methods that do not perform a training phase during learning,
the prototype selection term arises from the literature (and many derivatives, such as
prototype reduction, prototype abstraction, prototype generation, etc.). Nowadays,
the family of IS methods also include the proposal which was thought to work with
other learning methods, such as decision trees, ANNs or SVMs. However, there was
no manner of appointing the concrete case in which an IS method is valid and can be
applied to any type of DM algorithm (within the same learning paradigm, of course).
For this reason, we distinguish between two types of processes: Prototype Selection
(PS) [68] and Training Set Selection (TSS) [21, 130]. This section is devoted to detail
and to explain both distinctions the as clearly as possible.
First, we provide a more detailed and formal definition of IS. It can be defined as
follows: Let X p be an instance where X p = (X p1 , X p2 , . . . , X pm , X pc ), with X p
belonging to a class c given by X pc and a m-dimensional space in which X pi is the
value of the i-th feature of the pth sample. Then, let us assume that there is a training
set T R which consists of N instances X p and a test set T S composed by t instances
X p . Let S ⊂ T R be the subset of selected samples that resulted from the execution
of an IS algorithm, then we classify a new pattern from T S by a DM algorithm acting
over S. The whole data set is noted as D and it is composed of the union of T R and
T S.
PS methods [68] are IS methods which expect to find training sets offering
best classification accuracy and reduction rates by using instance based classifiers
which consider a certain similarity or distance measure. Recently, PS methods have
increased in popularity within the data reduction field. Various approaches to PS
algorithms have been proposed in the literature (see [68, 88, 127, 167] for review).
Figure 8.1 shows the basic steps of the PS process.
198 8 Instance Selection
Training Prototype
Data Set Selection Instances
(TR) Algorithm Selected (S)
Test Instance-based
Data Set Classifier
(TS)
Data Mining
Algorithm
Test Model
Data Set Obtained
(TS)
TSS methods are defined in a similar way. They are known as the application
of IS methods over the training set used to build any predictive model. Thus, TSS
can be employed as a way to improve the behavior of predictive models, precision
and interpretability [135]. Figure 8.2 shows the basic steps of processing a decision
tree (C4.5) on the TSS. Among others, ANNs [51, 94, 160], SVMs [31], decision
trees [21, 85]; and even in other learning paradigms such as regression [8, 154],
time series forecasting [79, 170], subgroup discovery [22, 23], imbalanced learning
[11, 65, 66, 75, 110], multiple-instance learning [30, 58], and semi-supervised learn-
ing [80].
8.2 Training Set Selection Versus Prototype Selection 199
Due to the fact that most of the efforts in research are devoted to PS, we focus
the rest of this chapter on this issue. It is very arduous to give an exact number
of proposals belonging specifically to each of the two families mentioned above.
When an IS method is proposed, the first step is to improve instance-based learners.
When tackling TSS, the type of learning method is usually fixed to combine with the
instance selector. Few methods are proposed thinking in both processes, although our
experience allows us to suggest that any filter should work well for any DM model,
mainly due to the low reductions rates achieved and the efforts in removing noise
from the data with these kind of methods. An proper estimation of proposals reported
in the specialized literature specifically considered for TSS may be around 10 % of
the total number of techniques. Even though PS monopolizes almost all efforts in IS,
TSS currently shows an upward trend.
This section presents the taxonomy of PS methods and the criteria used for building
it. First, in Sec. 8.3.1, the main characteristics which will define the categories of the
taxonomy will be outlined. In Sec. 8.3.2, we briefly enumerate all the PS methods
proposed in the literature. The complete and abbreviated name will be given together
with the reference. Finally, Sec. 8.3.3, presents the taxonomy.
This section provides a framework for the discussion of the PS methods presented in
the next subsection. The issues discussed include order of the search, type of selection
and evaluation of the search. These mentioned issues are involved in the definition of
the taxonomy, since they are exclusive to the operation of the PS algorithms. Other
classifier-dependent issues such as distance functions or exemplar representation will
be presented. Finally, some criteria will also be pointed out in order to compare PS
methods.
When searching for a subset S of prototypes to keep from training set T R, there are
a variety of directions in which the search can proceed:
• Incremental: An incremental search begins with an empty subset S, and adds each
instance in T R to S if it fulfills some criteria. In this case, the algorithm depends
on the order of presentation and this factor could be very important. Under such a
scheme, the order of presentation of instances in T R should be random because by
200 8 Instance Selection
This factor is mainly conditioned by the type of search carried out by the PS algo-
rithms, whether they seek to retain border points, central points or some other set of
points.
• Condensation: This set includes the techniques which aim to retain the points
which are closer to the decision boundaries, also called border points. The intuition
behind retaining border points is that internal points do not affect the decision
boundaries as much as border points, and thus can be removed with relatively
little effect on classification. The idea is to preserve the accuracy over the training
set, but the generalization accuracy over the test set can be negatively affected.
Nevertheless, the reduction capability of condensation methods is normally high
due to the fact that there are fewer border points than internal points in most of the
data.
• Edition: These kinds of algorithms instead seek to remove border points. They
remove points that are noisy or do not agree with their neighbors. This removes
boundary points, leaving smoother decision boundaries behind. However, such
algorithms do not remove internal points that do not necessarily contribute to
the decision boundaries. The effect obtained is related to the improvement of
generalization accuracy in test data, although the reduction rate obtained is low.
• Hybrid: Hybrid methods try to find the smallest subset S which maintains or even
increases the generalization accuracy in test data. To achieve this, it allows the
removal of internal and border points based on criteria followed by the two previous
strategies. The KNN classifier is highly adaptable to these methods, obtaining great
improvements even with a very small subset of instances selected.
KNN is a simple technique and it can be used to direct the search of a PS algorithm.
The objective pursued is to make a prediction on a non-definitive selection and to
compare between selections. This characteristic influences the quality criterion and
it can be divided into:
• Filter: When the kNN rule is used for partial data to determine the criteria of
adding or removing and no leave-one-out validation scheme is used to obtain a
good estimation of generalization accuracy. The fact of using subsets of the training
data in each decision increments the efficiency of these methods, but the accuracy
may not be enhanced.
• Wrapper: When the kNN rule is used for the complete training set with the
leave-one-out validation scheme. The conjunction in the use of the two mentioned
factors allows us to get a great estimation of generalization accuracy, which helps to
obtain better accuracy over test data. However, each decision involves a complete
computation of the kNN rule over the training set and the learning phase can be
computationally expensive.
202 8 Instance Selection
When comparing PS methods, there are a number of criteria that can be used to
evaluate the relative strengths and weaknesses of each algorithm. These include
storage reduction, noise tolerance, generalization accuracy and time requirements.
• Storage reduction: One of the main goals of the PS methods is to reduce storage
requirements. Furthermore, another goal closely related to this is to speed up
classification. A reduction in the number of stored instances will typically yield a
corresponding reduction in the time it takes to search through these examples and
classify a new input vector.
• Noise tolerance: Two main problems may occur in the presence of noise. The first
is that very few instances will be removed because many instances are needed to
maintain the noisy decision boundaries. Secondly, the generalization accuracy can
suffer, especially if noisy instances are retained instead of good instances.
• Generalization accuracy: A successful algorithm will often be able to significantly
reduce the size of the training set without significantly reducing generalization
accuracy.
• Time requirements: Usually, the learning process is done just once on a training
set, so it seems not to be a very important evaluation method. However, if the
learning phase takes too long it can become impractical for real applications.
Almost 100 PS methods have been proposed in the literature. This section is devoted
to enumerating and designating them according to a standard followed in this chapter.
For more details on their descriptions and implementations, the reader can read the
next section of this chapter. Implementations of some of the algorithms in Java can
be found in KEEL software [3, 4], described in Chap. 10 of this book.
Table 8.1 presents an enumeration of PS methods reviewed in this chapter. The
complete name, abbreviation and reference are provided for each one. In the case of
there being more than one method in a row, they were proposed together and the best
performing method (indicated by the respective authors) is depicted in bold.
The properties studied above can be used to categorize the PS methods proposed
in the literature. The direction of the search, type of selection and evaluation of
the search may differ among PS methods and constitute a set of properties which
are exclusive to the way of operating of the PS methods. This section presents the
taxonomy of PS methods based on these properties.
8.3 Prototype Selection Taxonomy 203
Figure 8.3 illustrates the categorization following a hierarchy based on this order:
type of selection, direction of search and evaluation of the search. It allows us to
distinguish among families of methods and to estimate the size of each one.
One of the objectives in this chapter is to highlight the best methods depending on
their properties, taking into account that we are conscious that the properties could
determine the suitability of use of a specific scheme. To do this, in Sect. 8.6, we will
conclude which methods perform best for each family considering several metrics
of performance.
206 8 Instance Selection
This set includes the techniques which aim to retain the points which are closer to
the decision boundaries, also called border points.
Considering their search direction they can be classified as:
8.4.1.1 Incremental
8.4.1.2 Decremental
selected consistent set to determine the NUN distance thresholds for each sample
in the given set.
10. Repeat Step 3 through 8 using all the samples in the given set to identify a
new consistent set. This process of recursive application of step 2 through 8 is
continued till the selected set is no longer getting smaller. It is easy to see that
under this procedure this final subset remains consistent, i.e., is able to classify
all samples in the original set correctly.
• Modified Selective Algorithm (MSS) [9]—Let Ri be the set of all X i in T R such
that X j is of the same class of X i and is closer to X i than the nearest neighbor of
X i in T R of a different class than X i . Then, MSS is defined as that subset of the
T R containing, for every X i in T R, that element of its Ri that is the nearest to a
different class than that of X i .
8.4.1.3 Batch
These algorithms edit out noisy instances as well as close border class, leaving
smoother decision boundaries. They also retain all internal points, because an internal
instance may be labeled as the same class of its neighbors.
Considering their search direction they can be classified as:
8.4.2.1 Decremental
not agree with all of its k + l nearest neighbors, where l are all the instances in S
which are at the same distance as the last neighbor of X i .
In addition, MENN works with a prefixed number of pairs (k,k ). k is employed
as the number of neighbors involved to perform the editing process, and k is
employed to validate the edited set S obtained. The best pair found is employed
as the final reference set (if two or more sets are found as optimal, then both are
employed in the classification of the test instances. A majority rule is used to
decide the output of the classifier in this case).
• Nearest Centroid Neighbor Edition (NCNEdit) [140]—The NCN Editing algo-
rithm applies the NCN classification rule to perform an edition process over the
training set T R. The NCN classification rule can be defined as:
Having defined the NCN scheme, the editing process consists in set S = T R an
discard from S every prototype misclassified by the NCN rule.
8.4.2.2 Batch
Hybrids methods try to find the smallest subset S which lets keep or even increase
the generalization accuracy in test data. For doing it, it allows the removal of internal
and border points.
8.4.3.1 Incremental
IB2
–It starts with S initially empty, and each instance in T R is added to S if it is not
classified correctly by the instances already in S (with the first instance always added).
8.4 Description of Methods 213
IB2 is similar to CNN, except that IB2 not seed S with one instance of each class
and does not repeat the process after the first pass through the training set.
IB3
8.4.3.2 Decremental
8.4.3.3 Filter
• Variable Similarity Metric (VSM) [111]—In order to reduce storage and remove
noisy instances, an instance X i is removed if all k of its neighbors are of the same
214 8 Instance Selection
class, even if they are of a different class than X i . The instance is only removed
if its neighbors are at least 60 % sure of their classification. The VSM method
typically uses a fairly large k (i.e., k = 10).
• Decremental Reduction Optimization Procedure Family (DROP) [166] In
order to present these reduction techniques, we need to define some concepts.
Each instance X i has k nearest neighbors where k is typically a small odd integer.
X i also has a nearest enemy, which is the nearest instance with a different output
class. Those instances that have xi as one of their k nearest neighbors are called
associates of X i .
DROP1
–Uses the following basic rule to decide if it is safe to remove an instance from the
instance set S (where S = T R originally):
Remove X i if at least as many of its associates in S would be classified correctly
without X i .
The algorithm DROP1 proceeds as follows.
DROP1 (Training set T R): Selection set S.
Let S = T R.
For each instance X i in S:
Find the k + 1 nearest neighbors of X i in S.
Add X i to each of its lists of associates.
For each instance X i in S:
Let with = # of associates of X i classified
correctly with X i as a neighbor.
Let without = # of associates of X i classified
correctly without X i .
If without ≥ with
Remove X i from S.
For each associate a of X i
Remove X i from a’s list of neighbors.
Find a new nearest neighbor for a.
Add a to its new list of associates.
For each neighbor b of X i
Remove X i from b’s lists of associates.
Endif
Return S.
DROP2
Using this modification, each instance X i in the original training set T R continues
to maintain a list of its k + 1 nearest neighbors in S, even after X i is removed from
S. This means that instances in S have associates that are both in and out of S, while
instances that have been removed from S have no associates.
DROP2 also changes the order of removal of instances. It initially sorts the
instances in S by the distance to their nearest enemy. Instances are then checked
for removal beginning at the instance furthest from its nearest enemy.
DROP3
Temp = S {X i }
If S ≤ A then
go_on = θ
If T emp < S and go_on then
S = Temp
i = i + 1
else
go_on = θ
– Thin-out selection phase (THIN). It thins out points that are not important to
the decision boundary of the resulting 1-NN rule. This phase aims at selecting
a small number of instances without negatively affecting the 1NN empirical
error.
S f = { x S with in-degree G bc S >0 }
S pr ev = S
S1 = S\S f
go_on = 1
While go_on do
St
St = { a S1 with in-degree G bc > θ
S pr ev S
and with in-degree G bc or in G wcpr ev > 0}
go_on = S f St < S f
If go_on then
Sf = Sf St
S p r ev = S1
S1 = S\S f
8.4.3.4 Wrapper
8.4.3.5 Batch
• Iterative Case Filtering (ICF) [15]—ICF defines local set L(x) which con-
tain all cases inside largest hypersphere centered in xi such that the hypersphere
218 8 Instance Selection
contains only cases of the same class as instance xi . Authors define two properties,
reachability and coverage:
In the first phase ICF uses the ENN algorithm to remove the noise from the training
set. In the second phase the ICF algorithm removes each instance X i for which the
Reachabilit y(X i ) is bigger than the Coverage(X i ). This procedure is repeated
for each instance in T R. After that ICF recalculates reachability and coverage
properties and restarts the second phase (as long as any progress is observed).
• Hit-Miss Network Algorithms (HMN) [116]—Hit-Miss Networks are directed
graphs where the points are the instances in T R and the edges are connections
between an instance and its nearest neighbor from each class. The edge connecting
a instance with a neighbor of its same class is called “Hit”, and the rest of its edges
are called “Miss”.
The HMNC method builds the network and removes all nodes not connected. The
rest of the nodes are employed to build the final S set.
Hit Miss Network Edition (HMNE): Starting from the output of HMNC, HMNE
applies these four rules to prune the network:
1. Every point with more “Miss”edges than “Hit” edges is flagged for removal.
2. If the size of non flagged points of a class is too low, edges with at least one “Hit”
from those classes are unflagged.
3. If there are more than three classes, some points of each class with low number
of “Miss” are unflagged.
4. Points which are the “Hit” of a 25 % or more instances of a class are unflagged.
Finally, the instances which remain flagged for removal are deleted from the net-
work, in order to build the final S set.
Hit Miss Network Edition Iterative (HMNEI): The HMNE method can be
employed iteratively until the generalization accuracy of 1-NN on the original
training set with the reduced set decreases.
8.4.3.6 Mixed+Wrapper
weaving of the global and local search phases allows the two to influence each
other; i.e. SSGA chooses good starting points, and local search provides an accu-
rate representation of that region of the domain. A brief pseudocode of the SSMA
is shown as follows:
Initialize population
While (not termination-condition) do
Use binary tournament to select two parents
Apply crossover operator to create
offspring (O f f 1 ,O f f 2 )
Apply mutation to O f f 1 and O f f 2
Evaluate O f f 1 and O f f 2
For each O f f i
Invoke Adaptive-PL S -mechanism to
obtain PL Si for O f f i
If υ(0, 1) < PL Si then
Perform meme optimization for O f f i
End if
End for
Employ standard replacement for O f f 1 and O f f 2
End while
Return the best chromosome
8.4.3.7 Fixed+Wrapper
Research in enhancing instance and PS through other data reduction and learning
methods is common and in high demand nowadays. PS could represent a feasible
and promising technique to obtain expected results, which justifies its relationship
to other methods and problems. This section provides a wide review of other topics
closely related to PS and describes other works and future trends which have been
studied in the last few years. In each subsection, we provide a table that enumerates,
in not an exhaustive way, the most relevant methods and papers in each of the topics.
Although we do not extend them, it is included for informative purposes for the
interested reader.
Prototype generation methods are not limited only to select examples from the train-
ing set. They could also modify the values of the samples, changing their position in
the d-dimensional space considered. Most of them use merging or divide and conquer
strategies to set new artificial samples [27], or are based on clustering approaches
[12], LVQ [98] hybrids, advanced proposals [102, 113] and evolutionary algorithms
based schemes [25, 151, 152]. A complete survey on this topic is [153].
Table 8.2 itemizes the main prototype generation methods proposed in the litera-
ture.
This area refers to the combination of IS and PS methods with other well-known
schemes used for improving accuracy in classification problems. For example, the
weighting scheme combines the PS with the FS [40, 147] or Feature Weighting [55,
129, 163], where a vector of weights associated with each attribute determines and
influences the distance computations.
222 8 Instance Selection
Several distance metrics have been used with kNN and PS, especially when work-
ing with categorical attributes [166]. There are some PS approaches which learn not
only the subset of the selected prototype, but also the distance metric employed
[59, 128]. Also, PS is suitable for use on other types of dissimilarity based classifiers
[95, 131].
Table 8.3 enumerates the main advances in these topics proposed in the literature.
On the one hand, this family includes all the methods which simultaneously use
instances and rules in order to compute the classification of a new object. If the
values of the object are within the range of a rule, its consequent predicts the class;
otherwise, if no rule matches the object, the most similar rule or instance stored in
the data base is used to estimate the class. Similarity is viewed as the closest rule
or instance based on a distance measure. In short, these methods can generalize an
instance into a hyperrectangle or rule [50, 67, 114].
On the other hand, this area refers to ensemble learning, where an IS method
is run several times and a classification decision is made according to the majority
class obtained over several subsets and any performance measure given by a learner
[5, 71].
8.5 Related and Advanced Topics 223
Table 8.4 specifies the proposals in these issues proposed in the literature.
One of the disadvantages of the IS methods is that most of them report a prohibitive
run time or even cannot be applied over large size data sets. Recent improvements
in this field cover the stratification of data [20, 70, 81] and the development of
distributed approaches for PS [6].
Table 8.5 draws the research works done in scaling-up for IS.
This area studies the effect on the complexity of data when PS methods are applied
previous to the classification [96] or how to make a useful diagnosis of the benefits
of applying PS methods taking into account the complexity of the data [64, 119].
224 8 Instance Selection
Table 8.6 collects the developments in data complexity related with IS found in
the specialized literature.
The aim of this section is to show all the factors and issues related to the experimental
study. We specify the data sets, validation procedure, parameters of the algorithms,
performance metrics and PS methods involved in the analysis. The experimental
8.6 Experimental Comparative Analysis in Prototype Selection 225
conditions were discussed in Chap. 2 of this book. The data sets used are summarized
in Table 8.7.
The data sets considered are partitioned using the 10-FCV procedure. The para-
meters of the PS algorithms are those recommended by their respective authors. We
assume that the choice of the values of parameters is optimally chosen by their own
authors. Nevertheless, in the PS methods that require the specification of the number
of neighbors as a parameter, its value coincides with the k value of the KNN rule
afterwards. But all edition methods consider a minimum of 3 nearest neighbors to
operate (as recommended in [165]), although they were applied to a 1NN classifier.
The Euclidean distance is chosen as the distance metric because it is well-known and
the most used for KNN. All probabilistic methods (including incremental methods
which depend on the order of instance presentation) are run three times and the final
results obtained correspond to the average performance values of these runs.
Thus, the empirical study involves 42 PS methods from those listed in Table 8.1.
We want to outline that the implementations are only based on the descriptions and
specifications given by the respective authors in their papers. No advanced data struc-
tures and enhancements for improving the efficiency of PS methods have been carried
out. All methods (including the slowest ones) are collected in KEEL software [3].
Table 8.8 presents the average results obtained by the PS methods over the 39 small
size data sets. Red. denotes reduction rate achieved, tst Acc. and tst K ap. denote the
accuracy and kappa obtained in test data, respectively; Acc. ∗ Red. and K ap. ∗ Red.
correspond to the product of accuracy/kappa and reduction rate, which is an estimator
of how good a PS method is considering a tradeoff of reduction and success rate of
classification. Finally, T ime denotes the average time elapsed in seconds to complete
a run of a PS method.1 In the case of 1NN, the time required is not displayed due to the
fact that no PS stage is run before. For each type of result, the algorithms are ordered
from the best to the worst. Algorithms highlighted in bold are those which obtain
1 The machine used was an Intel Core i7 CPU 920 at 2.67GHz with 4GB of RAM.
226 8 Instance Selection
the best result in their corresponding family, according to the taxonomy illustrated in
Fig. 8.3. They will make up the experimental study of medium size data sets, showed
in the next subsection.
The Wilcoxon test [39, 63, 164] is adopted considering a level of significance
of α = 0.1. Table 8.9 shows a summary of all the possible comparisons employing
the Wilcoxon test among all PS methods over small data sets. This table collects the
statistical comparisons of the four main performance measures used in this chapter:
tst Acc., tst K ap., Acc. ∗ Red. and K ap. ∗ Red.. Table 8.9 shows, for each method
in the row, the number of PS methods outperformed by using the Wilcoxon test under
the column represented by the ‘+’ symbol. The column with the ‘±’ symbol indicates
Table 8.8 Average results obtained by the PS methods over small data sets
Red. tst Acc. tst K ap. Acc. ∗ Red. K ap. ∗ Red. T ime
Explore 0.9789 CHC 0.7609 SSMA 0.5420 CHC 0.7399 CHC 0.5255 1NN –
CHC 0.9725 SSMA 0.7605 CHC 0.5404 SSMA 0.7283 SSMA 0.5190 CNN 0.0027
NRMCS 0.9683 GGA 0.7566 GGA 0.5328 Explore 0.7267 GGA 0.5014 POP 0.0091
SSMA 0.9576 RNG 0.7552 RMHC 0.5293 GGA 0.7120 RMHC 0.4772 PSC 0.0232
GGA 0.9411 RMHC 0.7519 HMNEI 0.5277 RMHC 0.6779 Explore 0.4707 ENN 0.0344
RNN 0.9187 MoCS 0.7489 RNG 0.5268 RNN 0.6684 RNN 0.4309 IB3 0.0365
CCIS 0.9169 ENN 0.7488 MoCS 0.5204 NRMCS 0.6639 CoCoIS 0.4294 MSS 0.0449
IGA 0.9160 NCNEdit 0.7482 NCNEdit 0.5122 IGA 0.6434 IGA 0.4080 Multiedit 0.0469
CPruner 0.9129 AllKNN 0.7472 ENN 0.5121 CoCoIS 0.6281 MCNN 0.4051 ENNTh 0.0481
MCNN 0.9118 HMNEI 0.7436 AllKNN 0.5094 MCNN 0.6224 NRMCS 0.3836 FCNN 0.0497
RMHC 0.9015 ENNTh 0.7428 CoCoIS 0.4997 CCIS 0.6115 DROP3 0.3681 MoCS 0.0500
CoCoIS 0.8594 Explore 0.7424 ENNTh 0.4955 CPruner 0.6084 CCIS 0.3371 MCNN 0.0684
8.6 Experimental Comparative Analysis in Prototype Selection
DROP3 0.8235 MENN 0.7364 1NN 0.4918 DROP3 0.5761 IB3 0.3248 MENN 0.0685
SNN 0.7519 1NN 0.7326 POP 0.4886 IB3 0.4997 CPruner 0.3008 AllKNN 0.0905
ICF 0.7160 CoCoIS 0.7309 MENN 0.4886 ICF 0.4848 ICF 0.2936 TRKNN 0.1040
IB3 0.7114 POP 0.7300 Explore 0.4809 PSC 0.4569 HMNEI 0.2929 CCIS 0.1090
PSC 0.7035 RNN 0.7276 Multiedit 0.4758 TCNN 0.4521 TCNN 0.2920 HMNEI 0.1234
SVBPS 0.6749 Multiedit 0.7270 MSS 0.4708 FCNN 0.4477 FCNN 0.2917 ENRBF 0.1438
Shrink 0.6675 MSS 0.7194 RNN 0.4691 SVBPS 0.4448 MNV 0.2746 PSRCG 0.1466
TCNN 0.6411 FCNN 0.7069 FCNN 0.4605 SNN 0.4324 CNN 0.2631 CPruner 0.1639
FCNN 0.6333 MCS 0.7060 IB3 0.4566 MNV 0.4266 SVBPS 0.2615 ICF 0.1708
227
(continued)
228
the number of wins and ties obtained by the method in the row. The maximum value
for each column is highlighted in bold.
Observing Tables. 8.8 and 8.9, we can point out the best performing PS methods:
• In condensation incremental approaches, all methods are very similar in behav-
ior, except PSC, which obtains the worst results. FCNN could be highlighted in
accuracy/kappa performance and MCNN with respect to reduction rate with a low
decrease in efficacy.
• Two methods can be emphasized in from the condensation decremental family:
RNN and MSS. RNN obtains good reduction rates and accuracy/kappa perfor-
mances, whereas MSS also offers good performance. RNN has the drawback of
being quite slow.
• In general, the best condensation methods in terms of efficacy are the decremental
ones, but their main drawback is that they require more computation time. POP
and MSS methods are the best performing in terms of accuracy/kappa, although
the reduction rates are low, especially those achieved by POP. However, no con-
densation method is more accurate than 1NN.
• With respect to edition decremental approaches, few differences can be observed.
ENN, RNGE and NCNEdit obtain the best results in accuracy/kappa and MENN
and ENNTh offers a good tradeoff considering the reduction rate. Multiedit and
ENRBF are not on par with their competitors and they are below 1NN in terms of
accuracy.
• AllKNN and MoCS, in edition batch approaches, achieve similar results to the
methods belonging to the decremental family. AllKNN achieves better reduction
rates.
• Within the hybrid decremental family, three methods deserve mention: DROP3,
CPruner and NRMCS. The latter is the best, but curiously, its time complexity
rapidly increases in the presence of larger data sets and it cannot tackle medium size
data sets. DROP3 is more accurate than CPruner, which achieves higher reduction
rates.
• Considering the hybrid mixed+wrapper methods, SSMA and CHC techniques
achieve the best results.
• Remarkable methods belonging to the hybrid family are DROP3, CPruner, HMNEI,
CCIS, SSMA, CHC and RMHC. Wrapper based approaches are slower.
• The best global methods in terms of accuracy or kappa are MoCS, RNGE and
HMNEI.
• The best global methods considering the tradeoff reduction-accuracy/kappa are
RMHC, RNN, CHC, Explore and SSMA.
This section presents the study and analysis of medium size data sets and the best
PS methods per family, which are those highlighted in bold in Table 8.8. The goal
8.6 Experimental Comparative Analysis in Prototype Selection 231
pursued is to study the effect of scaling up the data in PS methods. Table 8.10 shows
the average results obtained in the distinct performance measures considered (it
follows the same format as Table 8.8) and Table 8.11 summarizes the Wilcoxon test
results over medium data sets.
We can analyze several details from the results collected in Tables. 8.10 and 8.11:
• Five techniques outperform 1NN in terms of accuracy/kappa over medium data
sets: RMHC, SSMA, HMNEI, MoCS and RNGE. Two of them are edition schemes
(MoCS and RNGE) and the rest are hybrid schemes. Again, no condensation
method is more accurate than 1NN.
• Some methods present clear differences when dealing with larger data sets. This is
the case with AllKNN, MENN and CHC. The first two, tend to try new reduction
passes in the edition process, which is against the interests of accuracy and kappa,
and in medium size problems this fact is more noticeable. Furthermore, CHC loses
the balance between reduction and accuracy when data size increases, due to the
fact that the reduction objective becomes easier.
• There are some techniques whose run could be prohibitive when the data scales
up. This is the case for RNN, RMHC, CHC and SSMA.
• The best methods in terms of accuracy or kappa are RNGE and HMNEI.
• The best methods considering the tradeoff reduction-accuracy/kappa are RMHC,
RNN and SSMA.
Table 8.10 Average results obtained by the best PS methods per family over medium data sets
Red. tst Acc. tst K ap. Acc. ∗ Red. K ap. ∗ Red. T ime
MCNN 0.9914 RMHC 0.8306 RMHC 0.6493 SSMA 0.8141 SSMA 0.6328 1NN –
CHC 0.9914 SSMA 0.8292 SSMA 0.6446 CHC 0.8018 CHC 0.6006 POP 0.1706
SSMA 0.9817 RNG 0.8227 HMNEI 0.6397 RNN 0.7580 RMHC 0.5844 CNN 1.1014
CCIS 0.9501 HMNEI 0.8176 ModelCS 0.6336 RMHC 0.7476 RNN 0.5617 FCNN 3.2733
RNN 0.9454 ModelCS 0.8163 RNG 0.6283 GGA 0.7331 GGA 0.5513 MCNN 4.4177
GGA 0.9076 CHC 0.8088 1NN 0.6181 CCIS 0.6774 IB3 0.4615 IB3 6.6172
RMHC 0.9001 GGA 0.8078 POP 0.6143 CPruner 0.6756 FCNN 0.4588 MSS 7.9165
DROP3 0.8926 1NN 0.8060 MSS 0.6126 MCNN 0.6748 DROP3 0.4578 CCIS 12.4040
CPruner 0.8889 AllKNN 0.8052 GGA 0.6074 DROP3 0.6635 CPruner 0.4555 ModelCS 15.4658
ICF 0.8037 POP 0.8037 CHC 0.6058 IB3 0.6144 CCIS 0.5579 AllKNN 24.6167
IB3 0.7670 RNN 0.8017 FCNN 0.6034 FCNN 0.6052 CNN 0.4410 HMNEI 28.9782
FCNN 0.7604 IB3 0.8010 IB3 0.6018 CNN 0.5830 MCNN 0.4295 CPruner 35.3761
CNN 0.7372 MSS 0.8008 CNN 0.5982 ICF 0.5446 Reconsistent 0.3654 MENN 37.1231
Reconsistent 0.6800 FCNN 0.7960 AllKNN 0.5951 Reconsistent 0.5101 MSS 0.3513 ICF 93.0212
MSS 0.5735 CNN 0.7909 RNN 0.5941 MSS 0.4592 HMNEI 0.3422 DROP3 160.0486
HMNEI 0.5350 MENN 0.7840 MENN 0.5768 HMNEI 0.4374 ICF 0.3337 Reconsistent 1,621.7693
MENN 0.3144 CPruner 0.7600 Reconsistent 0.5373 MENN 0.2465 MENN 0.1814 RNG 1,866.7751
AllKNN 0.2098 Reconsistent 0.7501 DROP3 0.5129 AllKNN 0.1689 AllKNN 0.1248 SSMA 6,306.6313
RNG 0.1161 DROP3 0.7433 CPruner 0.5124 RNG 0.0955 RNG 0.0729 CHC 6,803.7974
POP 0.0820 CCIS 0.7130 CCIS 0.4714 POP 0.0659 POP 0.0504 RMHC 12,028.3811
ModelCS 0.0646 MCNN 0.6806 MCNN 0.4332 ModelCS 0.0527 ModelCS 0.0409 GGA 21,262.6911
1NN – ICF 0.6776 ICF 0.4152 1NN – 1NN – RNN 24,480.0439
8 Instance Selection
8.6 Experimental Comparative Analysis in Prototype Selection 233
• For the tradeoff reduction-accuracy rate: The algorithms which obtain the best
behavior are RMHC and SSMA. However, these methods achieve a significant
improvement in the accuracy rate due to a high computation cost. The methods
that harm the accuracy at the expense of a great reduction of time complexity are
DROP3 and CCIS.
• If the interest is the accuracy rate: In this case, the best results are to be achieved
with the RNGE as editor and HMNEI as hybrid method.
• When the key factor is the condensation: FCNN is the highlighted one, being one
of the fastest.
This section is devoted to illustrating the subsets selected resulting from some PS
algorithms considered in our study. To do this, we focus on the banana data set, which
234 8 Instance Selection
(a) (b)
Fig. 8.4 Data subsets in banana data set (1). a Banana original (0.8751, 0.7476). b CNN
(0.7729, 0.8664, 0.7304). c FCNN (0.8010, 0.8655, 0.7284). d IB3 (0.8711, 0.8442, 0.6854).
e DROP3 (0.9151, 0.8696, 0.7356). f ICF (0.8635, 0.8081, 0.6088)
contains 5,300 examples in the complete set (Figs. 8.4 and 8.5). It is an artificial data
set of 2 classes composed of three well-defined clusters of instances of the class −1
and two clusters of the class 1. Although the borders are clear among the clusters
there is a high overlap between both classes. The complete data set is illustrated in
Fig. 8.4a.
8.6 Experimental Comparative Analysis in Prototype Selection 235
(a) (b)
Fig. 8.5 Data subsets in banana data set (2). a RNGE (0.1170, 0.8930, 0.7822). b AllKNN
(0.1758, 0.8934, 0.7831). c CPruner (0.8636, 0.8972, 0.7909). d HMNEI (0.3617, 0.8906,
0.7787). e RMHC (0.9000, 0.8972, 0.7915). f SSMA (0.9879, 0.8964, 0.7900)
The pictures of the subset selected by some PS methods could help to visualize
and understand their way of working and the results obtained in the experimental
study. The reduction rate, the accuracy and kappa values in test data registered in the
experimental study are specified in this order for each one. In original data sets, the
two values indicated correspond to accuracy and kappa with 1NN (Fig. 8.4a).
236 8 Instance Selection
References
1. Aha, D.W., Kibler, D., Albert, M.K.: Instance-based learning algorithms. Mach. Learn. 6(1),
37–66 (1991)
2. Aha, D.W. (ed.): Lazy Learning. Springer, Heidelberg (2010)
3. Alcalá-Fdez, J., Sánchez, L., García, S., del Jesus, M.J., Ventura, S., Garrell, J.M., Otero, J.,
Romero, C., Bacardit, J., Rivas, V.M., Fernández, J.C., Herrera, F.: KEEL: a software tool
to assess evolutionary algorithms for data mining problems. Soft Comput. 13(3), 307–318
(2009)
4. Alcalá-Fdez, J., Fernández, A., Luengo, J., Derrac, J., García, S., Sánchez, L., Herrera, F.:
KEEL data-mining software tool: Data set repository, integration of algorithms and experi-
mental analysis framework. J. Multiple-Valued Logic Soft Comput. 17(2–3), 255–287 (2011)
5. Alpaydin, E.: Voting over multiple condensed nearest neighbors. Artif. Intell. Rev. 11(1–5),
115–132 (1997)
6. Angiulli, F., Folino, G.: Distributed nearest neighbor-based condensation of very large data
sets. IEEE Trans. Knowl. Data Eng. 19(12), 1593–1606 (2007)
7. Angiulli, F.: Fast nearest neighbor condensation for large data sets classification. IEEE Trans.
Knowl. Data Eng. 19(11), 1450–1464 (2007)
8. Antonelli, M., Ducange, P., Marcelloni, F.: Genetic training instance selection in multiobjec-
tive evolutionary fuzzy systems: A coevolutionary approach. IEEE Trans. Fuzzy Syst. 20(2),
276–290 (2012)
9. Barandela, R., Cortés, N., Palacios, A.: The nearest neighbor rule and the reduction of the
training sample size. Proceedings of the IX Symposium of the Spanish Society for Pattern
Recognition (2001)
10. Barandela, R., Ferri, F.J., Sánchez, J.S.: Decision boundary preserving prototype selection for
nearest neighbor classification. Int. J. Pattern Recognit Artif Intell. 19(6), 787–806 (2005)
11. Batista, G.E.A.P.A., Prati, R.C., Monard, M.C.: A study of the behavior of several methods
for balancing machine learning training data. SIGKDD Explor. Newsl. 6(1), 20–29 (2004)
12. Bezdek, J.C., Kuncheva, L.I.: Nearest prototype classifier designs: An experimental study.
Int. J. Intell. Syst. 16, 1445–1473 (2001)
13. Bien, J., Tibshirani, R.: Prototype selection for interpretable classification. Ann. Appl. Stat.
5(4), 2403–2424 (2011)
14. Borzeshi, Z.E., Piccardi, M., Riesen, K., Bunke, H.: Discriminative prototype selection meth-
ods for graph embedding. Pattern Recognit. 46, 1648–1657 (2013)
15. Brighton, H., Mellish, C.: Advances in instance selection for instance-based learning algo-
rithms. Data Min. Knowl. Disc. 6(2), 153–172 (2002)
16. Brodley, C.E.: Recursive automatic bias selection for classifier construction. Mach. Learn.
20(1–2), 63–94 (1995)
17. Cai, Y.-H., Wu, B., He, Y.-L., Zhang, Y.: A new instance selection algorithm based on contri-
bution for nearest neighbour classification. In: International Conference on Machine Learning
and Cybernetics (ICMLC), pp. 155–160 (2010)
18. Cameron-Jones, R.M.: Instance selection by encoding length heuristic with random muta-
tion hill climbing. In: Proceedings of the Eighth Australian Joint Conference on Artificial
Intelligence, pp. 99–106 (1995)
19. Cano, J.R., Herrera, F., Lozano, M.: Using evolutionary algorithms as instance selection for
data reduction in KDD: an experimental study. IEEE Trans. Evol. Comput. 7(6), 561–575
(2003)
20. Cano, J.R., Herrera, F., Lozano, M.: Stratification for scaling up evolutionary prototype selec-
tion. Pattern Recogn. Lett. 26(7), 953–963 (2005)
21. Cano, J.R., Herrera, F., Lozano, M.: Evolutionary stratified training set selection for extracting
classification rules with trade off precision-interpretability. Data Knowl. Eng. 60(1), 90–108
(2007)
References 237
22. Cano, J.R., García, S., Herrera, F.: Subgroup discover in large size data sets preprocessed using
stratified instance selection for increasing the presence of minority classes. Pattern Recogn.
Lett. 29(16), 2156–2164 (2008)
23. Cano, J.R., Herrera, F., Lozano, M., García, S.: Making CN2-SD subgroup discovery algo-
rithm scalable to large size data sets using instance selection. Expert Syst. Appl. 35(4), 1949–
1965 (2008)
24. Cavalcanti, G.D.C., Ren, T.I., Pereira, C.L.: ATISA: Adaptive threshold-based instance selec-
tion algorithm. Expert Syst. Appl. 40(17), 6894–6900 (2013)
25. Cervantes, A., Galván, I.M., Isasi, P.: AMPSO: a new particle swarm method for nearest
neighborhood classification. IEEE Trans. Syst. Man Cybern. B Cybern. 39(5), 1082–1091
(2009)
26. Cerverón, V., Ferri, F.J.: Another move toward the minimum consistent subset: a tabu search
approach to the condensed nearest neighbor rule. IEEE Trans. Syst. Man Cybern. B Cybern.
31(3), 408–413 (2001)
27. Chang, C.L.: Finding prototypes for nearest neighbor classifiers. IEEE Trans. Comput. 23(11),
1179–1184 (1974)
28. Chang, F., Lin, C.C., Lu, C.J.: Adaptive prototype learning algorithms: Theoretical and exper-
imental studies. J. Mach. Learn. Res. 7, 2125–2148 (2006)
29. Chen, C.H., Jóźwik, A.: A sample set condensation algorithm for the class sensitive artificial
neural network. Pattern Recogn. Lett. 17(8), 819–823 (1996)
30. Chen, Y., Bi, J., Wang, J.Z.: MILES: Multiple-instance learning via embedded instance selec-
tion. IEEE Trans. Pattern Anal. Mach. Intell. 28(12), 1931–1947 (2006)
31. Chen, J., Zhang, C., Xue, X., Liu, C.L.: Fast instance selection for speeding up support vector
machines. Knowl.-Based Syst. 45, 1–7 (2013)
32. Cover, T.M., Hart, P.E.: Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 13(1),
21–27 (1967)
33. Czarnowski, I.: Prototype selection algorithms for distributed learning. Pattern Recognit.
43(6), 2292–2300 (2010)
34. Czarnowski, I.: Cluster-based instance selection for machine classification. Knowl. Inf. Syst.
30(1), 113–133 (2012)
35. Dai, B.R., Hsu, S.M.: An instance selection algorithm based on reverse nearest neighbor. In:
PAKDD (1), Lecture Notes in Computer Science, vol. 6634, pp. 1–12 (2011)
36. Dasarathy, B.V.: Minimal consistent set (MCS) identification for optimal nearest neighbor
decision system design. IEEE Trans. Syst. Man Cybern. B Cybern. 24(3), 511–517 (1994)
37. de Santana Pereira, C., Cavalcanti, G.D.C.: Competence enhancement for nearest neighbor
classification rule by ranking-based instance selection. In: International Conference on Tools
with Artificial Intelligence, pp. 763–769 (2012)
38. Delany, S.J., Segata, N., Namee, B.M.: Profiling instances in noise reduction. Knowl.-Based
Syst. 31, 28–40 (2012)
39. Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res.
7, 1–30 (2006)
40. Derrac, J., García, S., Herrera, F.: IFS-CoCo: Instance and feature selection based on coop-
erative coevolution with nearest neighbor rule. Pattern Recognit. 43(6), 2082–2105 (2010)
41. Derrac, J., García, S., Herrera, F.: Stratified prototype selection based on a steady-state
memetic algorithm: a study of scalability. Memetic Comput. 2(3), 183–199 (2010)
42. Derrac, J., García, S., Herrera, F.: A survey on evolutionary instance selection and generation.
Int. J. Appl. Metaheuristic Comput. 1(1), 60–92 (2010)
43. Derrac, J., Cornelis, C., García, S., Herrera, F.: Enhancing evolutionary instance selection
algorithms by means of fuzzy rough set based feature selection. Inf. Sci. 186(1), 73–92
(2012)
44. Derrac, J., Triguero, I., García, S., Herrera, F.: Integrating instance selection, instance weight-
ing, and feature weighting for nearest neighbor classifiers by coevolutionary algorithms. IEEE
Trans. Syst. Man Cybern. B Cybern. 42(5), 1383–1397 (2012)
238 8 Instance Selection
45. Derrac, J., Verbiest, N., García, S., Cornelis, C., Herrera, F.: On the use of evolutionary feature
selection for improving fuzzy rough set based prototype selection. Soft Comput. 17(2), 223–
238 (2013)
46. Devi, V.S., Murty, M.N.: An incremental prototype set building technique. Pattern Recognit.
35(2), 505–513 (2002)
47. Devijver, P.A., Kittler, J.: A Statistical Approach Pattern Recognition. Prentice Hall, New
Jersey (1982)
48. Devijver, P.A.: On the editing rate of the multiedit algorithm. Pattern Recogn. Lett. 4, 9–12
(1986)
49. Domingo, C., Gavaldà, R., Watanabe, O.: Adaptive sampling methods for scaling up knowl-
edge discovery algorithms. Data Min. Knowl. Disc. 6, 131–152 (2002)
50. Domingos, P.: Unifying instance-based and rule-based induction. Mach. Learn. 24(2), 141–
168 (1996)
51. El-Hindi, K., Al-Akhras, M.: Smoothing decision boundaries to avoid overfitting in neural
network training. Neural Netw. World 21(4), 311–325 (2011)
52. Fayed, H.A., Hashem, S.R., Atiya, A.F.: Self-generating prototypes for pattern classification.
Pattern Recognit. 40(5), 1498–1509 (2007)
53. Fayed, H.A., Atiya, A.F.: A novel template reduction approach for the k-nearest neighbor
method. IEEE Trans. Neural Networks 20(5), 890–896 (2009)
54. Fernández, F., Isasi, P.: Evolutionary design of nearest prototype classifiers. J. Heuristics
10(4), 431–454 (2004)
55. Fernández, F., Isasi, P.: Local feature weighting in nearest prototype classification. IEEE
Trans. Neural Networks 19(1), 40–53 (2008)
56. Ferrandiz, S., Boullé, M.: Bayesian instance selection for the nearest neighbor rule. Mach.
Learn. 81(3), 229–256 (2010)
57. Franco, A., Maltoni, D., Nanni, L.: Data pre-processing through reward-punishment editing.
Pattern Anal. Appl. 13(4), 367–381 (2010)
58. Fu, Z., Robles-Kelly, A., Zhou, J.: MILIS: multiple instance learning with instance selection.
IEEE Trans. Pattern Anal. Mach. Intell. 33(5), 958–977 (2011)
59. Gagné, C., Parizeau, M.: Coevolution of nearest neighbor classifiers. IEEE Trans. Pattern
Anal. Mach. Intell. 21(5), 921–946 (2007)
60. Galar, M., Fernández, A., Barrenechea, E., Bustince, H., Herrera, F.: A review on ensembles
for the class imbalance problem: Bagging-, boosting-, and hybrid-based approaches. IEEE
Trans. Syst. Man Cybern. C 42(4), 463–484 (2012)
61. Galar, M., Fernández, A., Barrenechea, E., Herrera, F.: Eusboost: enhancing ensembles for
highly imbalanced data-sets by evolutionary undersampling. Pattern Recognit. 46(12), 3460–
3471 (2013)
62. García, S., Cano, J.R., Herrera, F.: A memetic algorithm for evolutionary prototype selection:
A scaling up approach. Pattern Recognit. 41(8), 2693–2709 (2008)
63. García, S., Herrera, F.: An extension on “statistical comparisons of classifiers over multiple
data sets" for all pairwise comparisons. J. Mach. Learn. Res. 9, 2677–2694 (2008)
64. García, S., Cano, J.R., Bernadó-Mansilla, E., Herrera, F.: Diagnose of effective evolutionary
prototype selection using an overlapping measure. Int. J. Pattern Recognit. Artif. Intell. 23(8),
1527–1548 (2009)
65. García, S., Fernández, A., Herrera, F.: Enhancing the effectiveness and interpretability of deci-
sion tree and rule induction classifiers with evolutionary training set selection over imbalanced
problems. Appl. Soft Comput. 9(4), 1304–1314 (2009)
66. García, S., Herrera, F.: Evolutionary under-sampling for classification with imbalanced data
sets: Proposals and taxonomy. Evol. Comput. 17(3), 275–306 (2009)
67. García, S., Derrac, J., Luengo, J., Carmona, C.J., Herrera, F.: Evolutionary selection of hyper-
rectangles in nested generalized exemplar learning. Appl. Soft Comput. 11(3), 3032–3045
(2011)
68. García, S., Derrac, J., Cano, J.R., Herrera, F.: Prototype selection for nearest neighbor classifi-
cation: taxonomy and empirical study. IEEE Trans. Pattern Anal. Mach. Intell. 34(3), 417–435
(2012)
References 239
69. García, S., Derrac, J., Triguero, I., Carmona, C.J., Herrera, F.: Evolutionary-based selection of
generalized instances for imbalanced classification. Knowl.-Based Syst. 25(1), 3–12 (2012)
70. García-Osorio, C., de Haro-García, A., García-Pedrajas, N.: Democratic instance selection: a
linear complexity instance selection algorithm based on classifier ensemble concepts. Artif.
Intell. 174(5–6), 410–441 (2010)
71. García-Pedrajas, N.: Constructing ensembles of classifiers by means of weighted instance
selection. IEEE Trans. Neural Networks 20(2), 258–277 (2009)
72. García-Pedrajas, N., Romero del Castillo, J.A., Ortiz-Boyer, D.: A cooperative coevolutionary
algorithm for instance selection for instance-based learning. Mach. Learn. 78(3), 381–420
(2010)
73. García-Pedrajas, N., Pérez-Rodríguez, J.: Multi-selection of instances: a straightforward way
to improve evolutionary instance selection. Appl. Soft Comput. 12(11), 3590–3602 (2012)
74. García-Pedrajas, N., de Haro-García, A., Pérez-Rodríguez, J.: A scalable approach to simul-
taneous evolutionary instance and feature selection. Inf. Sci. 228, 150–174 (2013)
75. García-Pedrajas, N., Pérez-Rodríguez, J.: OligoIS: scalable instance selection for class-
imbalanced data sets. IEEE Trans. Cybern. 43(1), 332–346 (2013)
76. Gates, G.W.: The reduced nearest neighbor rule. IEEE Trans. Inf. Theory 22, 431–433 (1972)
77. Gil-Pita, R., Yao, X.: Evolving edited k-nearest neighbor classifiers. Int. J. Neural Syst. 18(6),
459–467 (2008)
78. Gowda, K.C., Krishna, G.: The condensed nearest neighbor rule using the concept of mutual
nearest neighborhood. IEEE Trans. Inf. Theory 29, 488–490 (1979)
79. Guillén, A., Herrera, L.J., Rubio, G., Pomares, H., Lendasse, A., Rojas, I.: New method for
instance or prototype selection using mutual information in time series prediction. Neuro-
computing 73(10–12), 2030–2038 (2010)
80. Guo, Y., Zhang, H., Liu, X.: Instance selection in semi-supervised learning. Canadian con-
ference on AI, Lecture Notes in Computer Science, vol. 6657, pp. 158–169 (2011)
81. Haro-García, A., García-Pedrajas, N.: A divide-and-conquer recursive approach for scaling
up instance selection algorithms. Data Min. Knowl. Disc. 18(3), 392–418 (2009)
82. de Haro-García, A., García-Pedrajas, N., del Castillo, J.A.R.: Large scale instance selection
by means of federal instance selection. Data Knowl. Eng. 75, 58–77 (2012)
83. Hart, P.E.: The condensed nearest neighbor rule. IEEE Trans. Inf. Theory 14, 515–516 (1968)
84. Hattori, K., Takahashi, M.: A new edited k-nearest neighbor rule in the pattern classification
problem. Pattern Recognit. 33(3), 521–528 (2000)
85. Hernandez-Leal, P., Carrasco-Ochoa, J.A., Trinidad, J.F.M., Olvera-López, J.A.: Instancerank
based on borders for instance selection. Pattern Recognit. 46(1), 365–375 (2013)
86. Ho, S.Y., Liu, C.C., Liu, S.: Design of an optimal nearest neighbor classifier using an intelligent
genetic algorithm. Pattern Recogn. Lett. 23(13), 1495–1503 (2002)
87. Ivanov, M.: Prototype sample selection based on minimization of the complete cross validation
functional. Pattern Recognit. Image anal. 20(4), 427–437 (2010)
88. Jankowski, N., Grochowski, M.: Comparison of instances selection algorithms I. algorithms
survey. In: ICAISC, Lecture Notes in Computer Science, vol. 3070, pp. 598–603 (2004)
89. Kibler, D., Aha, D.W.: Learning representative exemplars of concepts: an initial case study.
In: Proceedings of the Fourth International Workshop on Machine Learning, pp. 24–30 (1987)
90. Kim, S.W., Oomenn, B.J.: Enhancing prototype reduction schemes with LVQ3-type algo-
rithms. Pattern Recognit. 36, 1083–1093 (2003)
91. Kim, S.W., Oommen, B.J.: Enhancing prototype reduction schemes with recursion: a method
applicable for large data sets. IEEE Trans. Syst. Man Cybern. B 34(3), 1384–1397 (2004)
92. Kim, S.W., Oommen, B.J.: On using prototype reduction schemes to optimize kernel-based
nonlinear subspace methods. Pattern Recognit. 37(2), 227–239 (2004)
93. Kim, S.W., Oommen, B.J.: On using prototype reduction schemes and classifier fusion strate-
gies to optimize kernel-based nonlinear subspace methods. IEEE Trans. Pattern Anal. Mach.
Intell. 27(3), 455–460 (2005)
94. Kim, K.J.: Artificial neural networks with evolutionary instance selection for financial fore-
casting. Expert Syst. Appl. 30(3), 519–526 (2006)
240 8 Instance Selection
95. Kim, S.W., Oommen, B.J.: On using prototype reduction schemes to optimize dissimilarity-
based classification. Pattern Recognit. 40(11), 2946–2957 (2007)
96. Kim, S.W., Oommen, B.J.: On using prototype reduction schemes to enhance the computation
of volume-based inter-class overlap measures. Pattern Recognit. 42(11), 2695–2704 (2009)
97. Kim, S.W.: An empirical evaluation on dimensionality reduction schemes for dissimilarity-
based classifications. Pattern Recogn. Lett. 32(6), 816–823 (2011)
98. Kohonen, T.: The self organizing map. Proc. IEEE 78(9), 1464–1480 (1990)
99. Koplowitz, J., Brown, T.: On the relation of performance to editing in nearest neighbor rules.
Pattern Recognit. 13, 251–255 (1981)
100. Kuncheva, L.I.: Editing for the k-nearest neighbors rule by a genetic algorithm. Pattern
Recogn. Lett. 16(8), 809–814 (1995)
101. Kuncheva, L.I., Jain, L.C.: Nearest neighbor classifier: simultaneous editing and feature selec-
tion. Pattern Recogn. Lett. 20(11–13), 1149–1156 (1999)
102. Lam, W., Keung, C.K., Liu, D.: Discovering useful concept prototypes for classification based
on filtering and abstraction. IEEE Trans. Pattern Anal. Mach. Intell. 14(8), 1075–1090 (2002)
103. Leyva, E., González, A., Pérez, R.: Knowledge-based instance selection: a compromise
between efficiency and versatility. Knowl.-Based Syst. 47, 65–76 (2013)
104. Li, Y., Hu, Z., Cai, Y., Zhang, W.: Support vector based prototype selection method for
nearest neighbor rules. In: First International Conference on Advances in Natural Computation
(ICNC), Lecture Notes in Computer Science, vol. 3610, pp. 528–535 (2005)
105. Li, Y., Maguire, L.P.: Selecting critical patterns based on local geometrical and statistical
information. IEEE Trans. Pattern Anal. Mach. Intell. 33(6), 1189–1201 (2011)
106. Li, I.J., Chen, J.C., Wu, J.L.: A fast prototype reduction method based on template reduction
and visualization-induced self-organizing map for nearest neighbor algorithm. Appl. Intell.
39(3), 564–582 (2013)
107. Lipowezky, U.: Selection of the optimal prototype subset for 1-nn classification. Pattern
Recogn. Lett. 19(10), 907–918 (1998)
108. Liu, H., Motoda, H.: Instance Selection and Construction for Data Mining. Kluwer Academic
Publishers, Norwell (2001)
109. Liu, H., Motoda, H.: On issues of instance selection. Data Min. Knowl. Disc. 6(2), 115–130
(2002)
110. López, V., Fernández, A., García, S., Palade, V., Herrera, F.: An insight into classification with
imbalanced data: empirical results and current trends on using data intrinsic characteristics.
Inf. Sci. 250, 113–141 (2013)
111. Lowe, D.G.: Similarity metric learning for a variable-kernel classifier. Neural Comput. 7(1),
72–85 (1995)
112. Lozano, M.T., Sánchez, J.S., Pla, F.: Using the geometrical distribution of prototypes for
training set condensing. CAEPIA, Lecture Notes in Computer Science, vol. 3040, pp. 618–
627 (2003)
113. Lozano, M., Sotoca, J.M., Sánchez, J.S., Pla, F., Pekalska, E., Duin, R.P.W.: Experimental
study on prototype optimisation algorithms for prototype-based classification in vector spaces.
Pattern Recognit. 39(10), 1827–1838 (2006)
114. Luaces, O., Bahamonde, A.: Inflating examples to obtain rules. Int. J. Intell. syst. 18, 1113–
1143 (2003)
115. Luengo, J., Fernández, A., García, S., Herrera, F.: Addressing data complexity for imbal-
anced data sets: analysis of smote-based oversampling and evolutionary undersampling. Soft
Comput. 15(10), 1909–1936 (2011)
116. Marchiori, E.: Hit miss networks with applications to instance selection. J. Mach. Learn. Res.
9, 997–1017 (2008)
117. Marchiori, E.: Class conditional nearest neighbor for large margin instance selection. IEEE
Trans. Pattern Anal. Mach. Intell. 32, 364–370 (2010)
118. Miloud-Aouidate, A., Baba-Ali, A.R.: Ant colony prototype reduction algorithm for knn
classification. In: International Conference on Computational Science and Engineering, pp.
289–294 (2012)
References 241
119. Mollineda, R.A., Sánchez, J.S., Sotoca, J.M.: Data characterization for effective prototype
selection. In: Proc. of the 2nd Iberian Conf. on Pattern Recognition and Image Analysis
(ICPRIA), Lecture Notes in Computer Science, vol. 3523, pp. 27–34 (2005)
120. Narayan, B.L., Murthy, C.A., Pal, S.K.: Maxdiff kd-trees for data condensation. Pattern Recog-
nit. Lett. 27(3), 187–200 (2006)
121. Neo, T.K.C., Ventura, D.: A direct boosting algorithm for the k-nearest neighbor classifier via
local warping of the distance metric. Pattern Recognit. Lett. 33(1), 92–102 (2012)
122. Nikolaidis, K., Goulermas, J.Y., Wu, Q.H.: A class boundary preserving algorithm for data
condensation. Pattern Recognit. 44(3), 704–715 (2011)
123. Nikolaidis, K., Rodriguez-Martinez, E., Goulermas, J.Y., Wu, Q.H.: Spectral graph optimiza-
tion for instance reduction. IEEE Trans. Neural Networks Learn. Syst. 23(7), 1169–1175
(2012)
124. Nikolaidis, K., Mu, T., Goulermas, J.: Prototype reduction based on direct weighted pruning.
Pattern Recognit. Lett. 36, 22–28 (2014)
125. Olvera-López, J.A., Martínez-Trinidad, J.F., Carrasco-Ochoa, J.A.: Edition schemes based
on BSE. In: 10th Iberoamerican Congress on Pattern Recognition (CIARP), Lecture Notes in
Computer Science, vol. 3773, pp. 360–367 (2005)
126. Olvera-López, J.A., Carrasco-Ochoa, J.A., Martínez-Trinidad, J.F.: A new fast prototype
selection method based on clustering. Pattern Anal. Appl. 13(2), 131–141 (2010)
127. Olvera-López, J.A., Carrasco-Ochoa, J.A., Martínez-Trinidad, J.F., Kittler, J.: A review of
instance selection methods. Artif. Intell. Rev. 34(2), 133–143 (2010)
128. Paredes, R., Vidal, E.: Learning prototypes and distances: a prototype reduction technique
based on nearest neighbor error minimization. Pattern Recog. 39(2), 180–188 (2006)
129. Paredes, R., Vidal, E.: Learning weighted metrics to minimize nearest-neighbor classification
error. IEEE Trans. Pattern Anal. Mach. Intell. 28(7), 1100–1110 (2006)
130. García-Pedrajas, N.: Evolutionary computation for training set selection. Wiley Interdisc.
Rev.: Data Min. Knowl. Disc. 1(6), 512–523 (2011)
131. Pekalska, E., Duin, R.P.W., Paclík, P.: Prototype selection for dissimilarity-based classifiers.
Pattern Recognit. 39(2), 189–208 (2006)
132. Raniszewski, M.: Sequential reduction algorithm for nearest neighbor rule. In: ICCVG (2),
Lecture Notes in Computer Science, vol. 6375, pp. 219–226. Springer, Heidelberg (2010)
133. Reinartz, T.: A unifying view on instance selection. Data Min. Knowl. Disc. 6(2), 191–210
(2002)
134. Calana, Y.P., Reyes, E.G., Alzate, M.O., Duin, R.P.W.: Prototype selection for dissimilarity
representation by a genetic algorithm. In: International Conference on Pattern Recogition
(ICPR), pp. 177–180 (2010)
135. Riquelme, J.C., Aguilar-Ruiz, J.S., Toro, M.: Finding representative patterns with ordered
projections. Pattern Recognit. 36(4), 1009–1018 (2003)
136. Ritter, G.L., Woodruff, H.B., Lowry, S.R., Isenhour, T.L.: An algorithm for a selective nearest
neighbor decision rule. IEEE Trans. Inf. Theory 25, 665–669 (1975)
137. Sáez, J.A., Luengo, J., Herrera, F.: Predicting noise filtering efficacy with data complexity
measures for nearest neighbor classification. Pattern Recognit. 46(1), 355–364 (2013)
138. Salzberg, S.: A nearest hyperrectangle learning method. Mach. Learn. 6, 251–276 (1991)
139. Sánchez, J.S., Pla, F., Ferri, F.J.: Prototype selection for the nearest neighbor rule through
proximity graphs. Pattern Recognit. Lett. 18, 507–513 (1997)
140. Sánchez, J.S., Barandela, R., Marqués, A.I., Alejo, R., Badenas, J.: Analysis of new techniques
to obtain quality training sets. Pattern Recognit. Lett. 24(7), 1015–1022 (2003)
141. Sánchez, J.S.: High training set size reduction by space partitioning and prototype abstraction.
Pattern Recognit. 37(7), 1561–1564 (2004)
142. Dos Santos, E.M., Sabourin, R., Maupin, P.: Overfitting cautious selection of classifier ensem-
bles with genetic algorithms. Inf. Fusion 10(2), 150–162 (2009)
143. Sebban, M., Nock, R.: Instance pruning as an information preserving problem. In: ICML ’00:
Proceedings of the Seventeenth International Conference on Machine Learning, pp. 855–862
(2000)
242 8 Instance Selection
144. Sebban, M., Nock, R., Brodley, E., Danyluk, A.: Stopping criterion for boosting-based data
reduction techniques: from binary to multiclass problems. J. Mach. Learn. Res. 3, 863–885
(2002)
145. Segata, N., Blanzieri, E., Delany, S.J., Cunningham, P.: Noise reduction for instance-based
learning with a local maximal margin approach. J. Intell. Inf. Sys. 35(2), 301–331 (2010)
146. Sierra, B., Lazkano, E., Inza, I., Merino, M., Larrañaga, P., Quiroga, J.: Prototype selection and
feature subset selection by estimation of distribution algorithms. a case study in the survival
of cirrhotic patients treated with TIPS. In: AIME ’01: Proceedings of the 8th Conference on
AI in Medicine in Europe, Lecture Notes in Computer Science, vol. 2101, pp. 20–29 (2001)
147. Skalak, D.B.: Prototype and feature selection by sampling and random mutation hill climbing
algorithms. In: Proceedings of the Eleventh International Conference on Machine Learning,
pp. 293–301 (1994)
148. Steele, B.M.: Exact bootstrap k-nearest neighbor learners. Mach. Learn. 74(3), 235–255
(2009)
149. Tomek, I.: An experiment with the edited nearest-neighbor rule. IEEE Trans. Syst. Man
Cybern. 6(6), 448–452 (1976)
150. Tomek, I.: Two modifications of CNN. IEEE Trans. Syst. Man Cybern. 6(6), 769–772 (1976)
151. Triguero, I., García, S., Herrera, F.: IPADE: iterative prototype adjustment for nearest neighbor
classification. IEEE Trans. Neural Networks 21(12), 1984–1990 (2010)
152. Triguero, I., García, S., Herrera, F.: Differential evolution for optimizing the positioning of
prototypes in nearest neighbor classification. Pattern Recognit. 44(4), 901–916 (2011)
153. Triguero, I., Derrac, J., García, S., Herrera, F.: A taxonomy and experimental study on proto-
type generation for nearest neighbor classification. IEEE Trans. Syst. Man Cybern. C 42(1),
86–100 (2012)
154. Tsai, C.F., Chang, C.W.: SVOIS: support vector oriented instance selection for text classifi-
cation. Inf. Syst. 38(8), 1070–1083 (2013)
155. Tsai, C.F., Eberle, W., Chu, C.Y.: Genetic algorithms in feature and instance selection. Knowl.-
Based Syst. 39, 240–247 (2013)
156. Ullmann, J.R.: Automatic selection of reference data for use in a nearest-neighbor method of
pattern classification. IEEE Trans. Inf. Theory 24, 541–543 (1974)
157. Vascon, S., Cristani, M., Pelillo, M., Murino, V.: Using dominant sets for k-nn prototype
selection. In: International Conference on Image Analysis and Processing (ICIAP (2)), pp.
131–140 (2013)
158. Vázquez, F., Sánchez, J.S., Pla, F.: A stochastic approach to Wilson’s editing algorithm. In:
2nd Iberian Conference on Pattern Recognition and Image Analysis (IbPRIA), Lecture Notes
in Computer Science, vol. 3523, pp. 35–42 (2005)
159. Verbiest, N., Cornelis, C., Herrera, F.: FRPS: a fuzzy rough prototype selection method.
Pattern Recognit. 46(10), 2770–2782 (2013)
160. Wang, X., Miao, Q., Zhai, M.Y., Zhai, J.: Instance selection based on sample entropy for
efficient data classification with elm. In: International Conference on Systems, Man and
Cybernetics, pp. 970–974 (2012)
161. Wang, X.Z., Wu, B., He, Y.L., Pei, X.H.: NRMCS : Noise removing based on the MCS. In:
Proceedings of the Seventh International Conference on Machine Learning and Cybernetics,
pp. 89–93 (2008)
162. Wettschereck, D., Dietterich, T.G.: An experimental comparison of the nearest-neighbor and
nearest-hyperrectangle algorithms. Mach. Learn. 19(1), 5–27 (1995)
163. Wettschereck, D., Aha, D.W., Mohri, T.: A review and empirical evaluation of feature weight-
ing methods for a class of lazy learning algorithms. Artif. Intell. Rev. 11(1–5), 273–314
(1997)
164. Wilcoxon, F.: Individual comparisons by ranking methods. Biometrics 1, 80–83 (1945)
165. Wilson, D.L.: Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans.
Syst. Man Cybern. B Cybern. 2(3), 408–421 (1972)
166. Wilson, D.R., Martinez, T.R.: Improved heterogeneous distance functions. J. Artif. Intell. Res.
6, 1–34 (1997)
References 243
167. Wilson, D.R., Martinez, T.R.: Reduction techniques for instance-based learning algorithms.
Mach. Learn. 38(3), 257–286 (2000)
168. Wu, Y., Ianakiev, K.G., Govindaraju, V.: Improved k-nearest neighbor classification. Pattern
Recognit. 35(10), 2311–2318 (2002)
169. Yang, T., Cao, L., Zhang, C.: A novel prototype reduction method for the k-nearest neighbor
algorithm with k >= 1. In: PAKDD (2), Lecture Notes in Computer Science, vol. 6119, pp.
89–100 (2010)
170. Zhai, T., He, Z.: Instance selection for time series classification based on immune binary
particle swarm optimization. Knowl.-Based Syst. 49, 106–115 (2013)
171. Zhang, H., Sun, G.: Optimal reference subset selection for nearest neighbor classification by
tabu search. Pattern Recognit. 35(7), 1481–1490 (2002)
172. Zhang, L., Chen, C., Bu, J., He, X.: A unified feature and instance selection framework using
optimum experimental design. IEEE Trans. Image Process. 21(5), 2379–2388 (2012)
173. Zhao, K.P., Zhou, S.G., Guan, J.H., Zhou, A.Y.: C-pruner: An improved instance pruning
algorithm. In: Proceeding of the 2th International Conference on Machine Learning and
Cybernetics, pp. 94–99 (2003)
174. Zhu, X., Yang, Y.: A lazy bagging approach to classification. Pattern Recognit. 41(10), 2980–
2992 (2008)
Chapter 9
Discretization
9.1 Introduction
As it was mentioned in the introduction of this book, data usually comes in different
formats, such as discrete, numerical, continuous, categorical, etc. Numerical data,
provided by discrete or continuous values, assumes that the data is ordinal, there is
an order among the values. However, in categorical data, no order can be assumed
amongst them. The domain and type of data is crucial to the learning task to be
performed next. For example, in a decision tree induction process a feature must
be chosen from a subset based on some metric gain associated with its values. This
process usually requires inherent finite values and also prefers to perform a branch of
values that are not ordered. Obviously, the tree structure is a finite structure and there
is a need to split the feature to produce the associated nodes in further divisions. If
data is continuous, there is a need to discretize the features either before the decision
tree induction or throughout the process of tree modelling.
Discretization, as one of the basic data reduction techniques, has received increas-
ing research attention in recent years [75] and has become one of the preprocessing
techniques most broadly used in DM. The discretization process transforms quanti-
tative data into qualitative data, that is, numerical attributes into discrete or nominal
attributes with a finite number of intervals, obtaining a non-overlapping partition of a
continuous domain. An association between each interval with a numerical discrete
value is then established. In practice, discretization can be viewed as a data reduc-
tion method since it maps data from a huge spectrum of numeric values to a greatly
reduced subset of discrete values. Once the discretization is performed, the data can
be treated as nominal data during any induction or deduction DM process. Many
existing DM algorithms are designed only to learn in categorical data, using nom-
inal attributes, while real-world applications usually involve continuous features.
Numerical features have to be discretized before using such algorithms.
In supervised learning, and specifically classification, the topic of this survey,
we can define the discretization as follows. Assuming a data set consisting of N
examples and C target classes, a discretization algorithm would discretize the contin-
uous attribute A in this data set into m discrete intervals D = {[d0 , d1 ], (d1 , d2 ], . . . ,
(dm−1 , dm ]}, where d0 is the minimal value, dm is the maximal value and di < di+i ,
for i = 0, 1, . . . , m − 1. Such a discrete result D is called a discretization scheme
on attribute A and P = {d1 , d2 , . . . , dm−1 } is the set of cut points of attribute A.
The necessity of using discretization on data can be caused by several factors.
Many DM algorithms are primarily oriented to handle nominal attributes [36, 75,
123], or may even only deal with discrete attributes. For instance, three of the ten
methods considered as the top ten in DM [120] require an embedded or an external
discretization of data: C4.5 [92], Apriori [1] and Naïve Bayes [44, 122]. Even with
algorithms that are able to deal with continuous data, learning is less efficient and
effective [29, 94]. Other advantages derived from discretization are the reduction and
the simplification of data, making the learning faster and yielding more accurate, with
compact and shorter results; and any noise possibly present in the data is reduced.
For both researchers and practitioners, discrete attributes are easier to understand,
use, and explain [75]. Nevertheless, any discretization process generally leads to a
loss of information, making the minimization of such information loss the main goal
of a discretizer.
Obtaining the optimal discretization is NP-complete [25]. A vast number of dis-
cretization techniques can be found in the literature. It is obvious that when dealing
with a concrete problem or data set, the choice of a discretizer will condition the suc-
cess of the posterior learning task in accuracy, simplicity of the model, etc. Different
heuristic approaches have been proposed for discretization, for example, approaches
based on information entropy [36, 41], statistical χ 2 test [68, 76], likelihood [16,
119], rough sets [86, 124], etc. Other criteria have been used in order to provide a clas-
sification of discretizers, such as univariate/multivariate, supervised/unsupervised,
top-down/bottom-up, global/local, static/dynamic and more. All these criteria are
the basis of the taxonomies already proposed and they will be deeply elaborated
upon in this chapter. The identification of the best discretizer for each situation is a
very difficult task to carry out, but performing exhaustive experiments considering a
representative set of learners and discretizers could help to make the best choice.
Some reviews of discretization techniques can be found in the literature [9, 36, 75,
123]. However, the characteristics of the methods are not studied completely, many
discretizers, even classic ones, are not mentioned, and the notation used for catego-
rization is not unified. In spite of the wealth of literature, and apart from the absence of
a complete categorization of discretizers using a unified notation, it can be observed
that, there are few attempts to empirically compare them. In this way, the algorithms
proposed are usually compared with a subset of the complete family of discretizers
9.1 Introduction 247
and, in most of the studies, no rigorous empirical analysis has been carried out. In
[51], it was noticed that the most compared techniques are EqualWidth, EqualFre-
quency, MDLP [41], ID3 [92], ChiMerge [68], 1R [59], D2 [19] and Chi2 [76].
These reasons motivate the global purpose of this chapter. We can summarize it
into three main objectives:
• To provide an updated and complete taxonomy based on the main properties
observed in the discretization methods. The taxonomy will allow us to charac-
terize their advantages and drawbacks in order to choose a discretizer from a
theoretical point of view.
• To make an empirical study analyzing the most representative and newest dis-
cretizers in terms of the number of intervals obtained and inconsistency level of
the data.
• Finally, to relate the best discretizers to a set of representative DM models using
two metrics to measure the predictive classification success.
Discretization is a wide field and there have been many advances and ideas over the
years. This section is devoted to provide a proper background on the topic, together
with a set of related areas and future perspectives on discretization.
Before starting, we must first introduce some terms used by different sources for the
sake of unification.
9.2.1.1 Feature
Also called attribute or variable refers to an aspect of the data and it is usually
associated to the columns in a data table. M stands for the number of features in the
data.
9.2.1.2 Instance
Also called tuple, example, record or data point refers to a collection of feature
values for all features. A set of instances constitute a data set and they are usually
associated to row in a data table. According to the introduction, we will set N as the
number of instances in the data.
248 9 Discretization
Refers to a real value that divides the range into two adjacent intervals, being the
first interval less than or equal to the cut point and the second interval greater than
the cut point.
9.2.1.4 Arity
point for splitting or adjacent intervals for merging, (3) splitting or merging intervals
of continuous values according to some defined criterion, and (4) stopping at some
point. Next, we explain these four steps in detail.
9.2.1.5 Sorting
The continuous values for a feature are sorted in either descending or ascending
order. It is crucial to use an efficient sorting algorithm with a time complexity of
O(Nlog N ), for instance the well-known Quick Sort algorithm. Sorting must be
done only once and for all the start of discretization. It is a mandatory treatment and
can be applied when the complete instance space is used for discretization. However,
if the discretization is within the process of other algorithms (such as decision trees
induction), it is a local treatment and only a region of the whole instance space is
considered for discretization.
After sorting, the best cut point or the best pair of adjacent intervals should be found
in the attribute range in order to split or merge in a following required step. An evalu-
ation measure or function is used to determine the correlation, gain, improvement in
performance and any other benefit according to the class label. There are numerous
evaluation measures and they will be discussed in Sect. 9.3.1, the entropy and the
statistical dependency being the most well known.
9.2.1.7 Splitting/Merging
or consistency. For example, a threshold for m can be an upper bound for the arity
of the resulting discretization. A stopping criterion can be very simple such as fixing
the number of final intervals at the beginning of the process or more complex like
estimating a function.
This section presents a taxonomy of discretization methods and the criteria used
for building it. First, in Sect. 9.3.1, the main characteristics which will define the
categories of the taxonomy will be outlined. Then, in Sect. 9.3.2, we enumerate the
discretization methods proposed in the literature and we will consider by using both
their complete name and abbreviated name together with the associated reference.
Finally, we present the taxonomy.
This section provides a framework for the discussion of the discretizers presented in
the next subsection. The issues discussed include several properties involved in the
structure of the taxonomy, since they are exclusive to the operation of the discretizer.
Other, less critical issues such as parametric properties or stopping conditions will
be presented although they are not involved in the taxonomy. Finally, some criteria
will also be pointed out in order to compare discretization methods.
In [36, 51, 75, 123], various axes have been described in order to make a cate-
gorization of discretization methods. We review and explain them in this section,
emphasizing the main aspects and relations found among them and unifying the
notation. The taxonomy presented will be based on these characteristics:
• Static versus Dynamic: This characteristic refers to the moment and independence
which the discretizer operates in relation to the learner. A dynamic discretizer
acts when the learner is building the model, thus they can only access partial
information (local property, see later) embedded in the learner itself, yielding
compact and accurate results in conjuntion with the associated learner. Otherwise,
a static discretizer proceeds prior to the learning task and it is independent from
252 9 Discretization
the learning algorithm [75]. Almost all known discretizers are static, due to the
fact that most of the dynamic discretizers are really subparts or stages of DM
algorithms when dealing with numerical data [13]. Some examples of well-known
dynamic techniques are ID3 discretizer [92] and ITFP [6].
• Univariate versus Multivariate: Multivariate techniques, also known as 2D dis-
cretization [81], simultaneously consider all attributes to define the initial set of
cut points or to decide the best cut point altogether. They can also discretize one
attribute at a time when studying the interactions with other attributes, exploiting
high order relationships. By contrast, univariate discretizers only work with a sin-
gle attribute at a time, once an order among attributes has been established, and
the resulting discretization scheme in each attribute remains unchanged in later
stages. Interest has recently arisen in developing multivariate discretizers since
they are very influential in deductive learning [10, 49] and in complex classifi-
cation problems where high interactions among multiple attributes exist, which
univariate discretizers might obviate [42, 121].
• Supervised versus Unsupervised: Unsupervised discretizers do not consider the
class label whereas supervised ones do. The manner in which the latter consider
the class attribute depends on the interaction between input attributes and class
labels, and the heuristic measures used to determine the best cut points (entropy,
interdependence, etc.). Most discretizers proposed in the literature are supervised
and theoretically using class information, should automatically determine the best
number of intervals for each attribute. If a discretizer is unsupervised, it does
not mean that it cannot be applied over supervised tasks. However, a supervised
discretizer can only be applied over supervised DM problems. Representative
unsupervised discretizers are EqualWidth and EqualFrequency [73], PKID and
FFD [122] and MVD [10].
• Splitting versus Merging: This refers to the procedure used to create or define new
intervals. Splitting methods establish a cut point among all the possible boundary
points and divide the domain into two intervals. By contrast, merging methods start
with a pre-defined partition and remove a candidate cut point to mix both adjacent
intervals. These properties are highly related to Top-Down and Bottom-up respec-
tively (explained in the next section). The idea behind them is very similar, except
that top-down or bottom-up discretizers assume that the process is incremental
(described later), according to a hierarchical discretization construction. In fact,
there can be discretizers whose operation is based on splitting or merging more
than one interval at a time [72, 96]. Also, some discretizers can be considered
hybrid due to the fact that they can alternate splits with merges in running time
[24, 43].
• Global versus Local: To make a decision, a discretizer can either require all
available data in the attribute or use only partial information. A discretizer is said
to be local when it only makes the partition decision based on local information.
Examples of widely used local techniques are MDLP [41] and ID3 [92]. Few
discretizers are local, except some based on top-down partition and all the dynamic
techniques. In a top-down process, some algorithms follow the divide-and-conquer
scheme and when a split is found, the data is recursively divided, restricting access
9.3 Properties and Taxonomy 253
to partial data. Regarding dynamic discretizers, they find the cut points in internal
operations of a DM algorithm, so they never gain access to the full data set.
• Direct versus Incremental: Direct discretizers divide the range into k intervals
simultaneously, requiring an additional criterion to determine the value of k. They
do not only include one-step discretization methods, but also discretizers which
perform several stages in their operation, selecting more than a single cut point at
every step. By contrast, incremental methods begin with a simple discretization
and pass through an improvement process, requiring an additional criterion to
know when to stop it. At each step, they find the best candidate boundary to be
used as a cut point and afterwards the rest of the decisions are made accordingly.
Incremental discretizers are also known as hierarchical discretizers [9]. Both types
of discretizers are widespread in the literature, although there is usually a more
defined relationship between incremental and supervised ones.
• Evaluation Measure: This is the metric used by the discretizer to compare two
candidate schemes and decide which is more suitable to be used. We consider five
main families of evaluation measures:
– Information: This family includes entropy as the most used evaluation measure
in discretization (MDLP [41], ID3 [92], FUSINTER [126]) and other derived
information theory measures such as the Gini index [66].
– Statistical: Statistical evaluation involves the measurement of dependency/
correlation among attributes (Zeta [58], ChiMerge [68], Chi2 [76]), probability
and bayesian properties [119] (MODL [16]), interdependency [70], contingency
coefficient [106], etc.
– Rough Sets: This group is composed of methods that evaluate the discretization
schemes by using rough set measures and properties [86], such as lower and
upper approximations, class separability, etc.
– Wrapper: This collection comprises methods that rely on the error provided by
a classifier that is run for each evaluation. The classifier can be a very simple
one, such as a majority class voting classifier (Valley [108]) or general classifiers
such as Naïve Bayes (NBIterative [87]).
– Binning: This category refers to the absence of an evaluation measure. It is the
simplest method to discretize an attribute by creating a specified number of bins.
Each bin is defined as a priori and allocates a specified number of values per
attribute. Widely used binning methods are EqualWidth and EqualFrequency.
We can discuss other properties related to discretization which also influence the
operation and results obtained by a discretizer, but to a lower degree than the
characteristics explained above. Furthermore, some of them present a large variety
of categorizations and may harm the interpretability of the taxonomy.
• Parametric versus Non-Parametric: This property refers to the automatic determi-
nation of the number of intervals for each attribute by the discretizer. A nonpara-
254 9 Discretization
metric discretizer computes the appropriate number of intervals for each attribute
considering a trade-off between the loss of information or consistency and obtain-
ing the lowest number of them. A parametric discretizer requires a maximum
number of intervals desired to be fixed by the user. Examples of non-parametric
discretizers are MDLP [41] and CAIM [70]. Examples of parametric ones are
ChiMerge [68] and CADD [24].
• Top-Down versus Bottom Up: This property is only observed in incremental dis-
cretizers. Top-Down methods begin with an empty discretization. Its improve-
ment process is simply to add a new cutpoint to the discretization. On the other
hand, Bottom-Up methods begin with a discretization that contains all the possible
cutpoints. Its improvement process consists of iteratively merging two intervals,
removing a cut point. A classic Top-Down method is MDLP [41] and a well-known
Bottom-Up method is ChiMerge [68].
• Stopping Condition: This is related to the mechanism used to stop the discretization
process and must be specified in nonparametric approaches. Well-known stopping
criteria are the Minimum Description Length measure [41], confidence thresholds
[68], or inconsistency ratios [26].
• Disjoint versus Non-Disjoint: Disjoint methods discretize the value range of the
attribute into disassociated intervals, without overlapping, whereas non-disjoint
methods dicsretize the value range into intervals that can overlap. The methods
reviewed in this chapter are disjoint, while fuzzy discretization is usually non-
disjoint [62].
• Ordinal versus Nominal: Ordinal discretization transforms quantitative data into
ordinal qualitative data whereas nominal discretization transforms it into nominal
qualitative data, discarding the information about order. Ordinal discretizers are
less common, and not usually considered classic discretizers [80].
When comparing discretization methods, there are a number of criteria that can be
used to evaluate the relative strengths and weaknesses of each algorithm. These
include the number of intervals, inconsistency, predictive classification rate and time
requirements
• Number of Intervals: A desirable feature for practical discretization is that dis-
cretized attributes have as few values as possible, since a large number of intervals
may make the learning slow and ineffective [19].
• Inconsistency: A supervision-based measure used to compute the number of
unavoidable errors produced in the data set. An unavoidable error is one asso-
ciated to two examples with the same values for input attributes and different class
labels. In general, data sets with continuous attributes are consistent, but when
a discretization scheme is applied over the data, an inconsistent data set may be
obtained. The desired inconsistency level that a discretizer should obtain is 0.0.
9.3 Properties and Taxonomy 255
At the time of writting, more than 80 discretization methods have been proposed in the
literature. This section is devoted to enumerating and designating them according to
a standard followed in this chapter. We have used 30 discretizers in the experimental
study, those that we have identified as the most relevant ones. For more details on
their descriptions, the reader can visit the URL associated to the KEEL project.1
Additionaly, implementations of these algorithms in Java can be found in KEEL
software [3, 4].
Table 9.1 presents an enumeration of discretizers reviewed in this chapter. The
complete name, abbreviation and reference are provided for each one. This chapter
does not collect the descriptions of the discretizers. Instead, we recommend that
readers consult the original references to understand the complete operation of the
discretizers of interest. Discretizers used in the experimental study are depicted in
bold. The ID3 discretizer used in the study is a static version of the well-known
discretizer embedded in C4.5.
The properties studied above can be used to categorize the discretizers proposed in
the literature. The seven characteristics studied allows us to present the taxonomy of
discretization methods in an established order. All techniques enumerated in Table 9.1
are collected in the taxonomy drawn in Fig. 9.2. It illustrates the categorization
following a hierarchy based on this order: static/dynamic, univariate/multivariate,
supervised/unsupervised, splitting/merging/hybrid, global/local, direct/incremental
and evaluation measure. The rationale behind the choice of this order is to achieve a
clear representation of the taxonomy.
The proposed taxonomy assists us in the organization of many discretization
methods so that we can classify them into categories and analyze their behavior. Also,
we can highlight other aspects in which the taxonomy can be useful. For example, it
provides a snapshot of existing methods and relations or similarities among them. It
also depicts the size of the families, the work done in each one and what is currently
missing. Finally, it provides a general overview of the state-of-the-art methods in
discretization for researchers/practitioners who are beginning in this field or need to
discretize data in real applications.
1 http://www.keel.es.
Table 9.1 Discretizers
256
Complete name Abbr. name Reference Complete name Abbr. name Reference
Equal Width Discretizer EqualWidth [115] No name specified Butterworth04 [18]
Equal Frequency Discretizer EqualFrequency [115] No name specified Zhang04 [124]
No name specified Chou91 [27] Khiops Khiops [15]
Adaptive Quantizer AQ [21] Class-Attribute Interdependence CAIM [70]
Maximization
Discretizer 2 D2 [19] Extended Chi2 Extended Chi2 [101]
ChiMerge ChiMerge [68] Heterogeneity Discretizer Heter-Disc [78]
One-Rule Discretizer 1R [59] Unsupervised Correlation Preserving UCPD [81]
Discretizer
Iterative Dichotomizer 3 Discretizer ID3 [92] No name specified Multi-MDL [42]
Minimum Description Length Principle MDLP [41] Difference Similitude Set Theory Discretizer DSST [116]
Valley Valley [108, 109] Multivariate Interdependent Discretizer MIDCA [22]
Class-Attribute Dependent Discretizer CADD [24] MODL MODL [16]
ReliefF Discretizer ReliefF [69] Information Theoretic Fuzzy Partitioning ITFP [6]
Class-driven Statistical Discretizer StatDisc [94] No name specified Wu06 [118]
No name specified NBIterative [87] Fast Independent Component Analysis FastICA [67]
Boolean Reasoning Discretizer BRDisc [86] Linear Program Relaxation LP-Relaxation [37]
Minimum Description Length Discretizer MDL-Disc [89] Hellinger-Based Discretizer HellingerBD [72]
Bayesian Discretizer Bayesian [119] Distribution Index-Based Discretizer DIBD [117]
No name specified Friedman96 [45] Wrapper Estimation of Distribution Algorithm WEDA [43]
Cluster Analysis Discretizer ClusterAnalysis [26] Clustering + Rought Sets Discretizer Cluster-RS-Disc [100]
Zeta Zeta [58] Interval Distance Discretizer IDD [96]
Distance-based Discretizer Distance [20] Class-Attribute Contingency Coefficient CACC [106]
Finite Mixture Model Discretizer FMM [102] Rectified Chi2 Rectified Chi2 [91]
9 Discretization
(continued)
Table 9.1 (continued)
Chi2 Chi2 [76] Ameva Ameva [53]
No name specified FischerExt [127] Unification Unification [66]
Contextual Merit Numerical Feature Discretizer CM-NFD [60] Multiple Scanning Discretizer MultipleScan [54]
Concurrent Merger ConMerge [110] Optimal Flexible Frequency Discretizer OFFD [111]
Knowledge EXplorer Discretizer KEX-Disc [11] Proportional Discretizer PKID [122]
LVQ-based Discretization LVQ-Disc [88] Fixed Frequency Discretizer FFD [122]
No name specified Multi-Bayesian [83] Discretization Class intervals Reduce DCR [90]
No name specified A* [47] MVD-CG MVD-CG [112]
9.3 Properties and Taxonomy
The splitting algorithm above consists of all four steps in the discretization scheme,
sort the feature values, (2) search for an appropriate cut point, (3) split the range of
continuous values according to the cut point, and (4) stop when a stopping criterion
satisfies, otherwise go to (2). In the following, we include the description of the most
representative discretizers based on splitting criterion.
Equal Width or Frequency [75]
They belong to the simplest family of methods that discretize an attribute by creating
a specified number of bins, which are created by equal width or equal frequency. This
family is known as Binning methods. The arity m must be specified at the beginning
of both discretizers and determine the number of bins. Each bin is associated with a
different discrete value. In equal width, the continuous range of a feature is divided
into intervals that have an equal width and each interval represents a bin. The arity
can be calculated by the relationship between the chosen width for each interval
and the total length of the attribute range. In equal frequency, an equal number of
260 9 Discretization
continuous values are placed in each bin. Thus, the width of each interval is computed
by dividing the length of the attribute range by the desired arity.
There is no need for a stopping criterion as the number of bins is computed directly
at the beginning of the process. In practice, both methods are equivalent as they only
depend on the number of bins desired, regardless of the calculation method.
MDLP [41]
This discretizer uses the entropy measure to evaluate candidate cut points. Entropy is
one of the most commonly used discretization measures in the literature. The entropy
of a sample variable X is
H (X ) = − px log px
x
I (x) = − log px
Information is high for lower probable events and low otherwise. This discretizer
uses the Information Gain of a cut point, which is defined as
|S1 | |S2 |
G(A, T ; S) = H (S) − H (A, T ; S) = H (S) − H (S1 ) − H (S2 )
N N
where A is the attribute in question, T is a candidate cut point and S is the set of N
examples. So, Si is a partitioned subset of examples produced by T .
The MDLP discretizer applies the Minimum Description Length Principle to
decide the acceptation or rejection for each cut point and to govern the stopping
criterion. It is defined in information theory to be the minimum number of bits
required to uniquely specify an object out of the universe of all objects. It computes
a final cost of coding and takes part in the decision making of the discretizer.
In summary, the MDLP criterion is that the partition induced by a cut point T for
a set S of N examples is accepted iff
log2 (N − 1) δ(A, T ; S)
G(A, T ; S) > +
N N
each containing ca and Cb number of classes. The Mantaras distance between two
partitions due to a single cut point is given below.
Since,
where,
ca
I (Sa ) = − Si log2 Si
i=1
cb
I (Sb ) = − S j log2 S j
j=1
ca
cb
I (Sa ∩ Sb ) = − Si j log2 Si j
i=1 j=1
|Ci |
Si =
N
|Ci | = total count of class i
N = total number of instances
Si j = Si × S j
It chooses the cut point that minimizes the distance. As a stopping criterion, it uses
the minimum description length discussed previously to determine whether more cut
points should be added.
PKID [122]
In order to maintain a low bias and a low variance in a learning scheme, it is recom-
mendable to increase both the interval frequency and the number of intervals as the
amount of training data increases too. A good way to achieve this is to set interval
frequency and interval number equally proportional to the amount of training data.
This the main purpose of proportional discretization (PKID).
When discretizing a continuous attribute for which there are N instances, sup-
posing that the desired interval frequency is s and the desired interval number is t,
PKID calculates s and t by the following expressions:
262 9 Discretization
s × t =n
s =t
Thus, this discretizer is an equal frequency (or equal width) discretizer where both
the interval frequency and number of intervals have the same quantity and they only
depend on the number of instances in training data.
FFD [122]
FFD stands for fixed frequency discretization and was proposed for managing bias
and variance especially in naive-bayes based classifiers. To discretize a continuous
attribute, FFD sets a sufficient interval frequency, f . Then it discretizes the ascend-
ingly sorted values into intervals of frequency f . Thus each interval has approxi-
mately the same number f of training instances with adjacent values. Incorporating
f , FFD aims to ensure that the interval frequency is sufficient so that there are enough
training instances in each interval to reliably estimate the Naïve Bayes probabilities.
There may be confusion when distinguishing equal frequency discretization from
FFD. The former one fixes the interval number, thus it arbitrarily chooses the interval
number and then discretizes a continuous attribute into intervals such that each inter-
val has the same number of training instances. On the other hand, the later method,
FFD, fixes the interval frequency by the value f . It then sets cut points so that each
interval contains f training instances, controlling the discretization variance.
CAIM [70]
CAIM stands for Class-Attribute Interdependency Maximization criterion, which
measures the dependency between the class variable C and the discretized variable
D for attibute A. The method requires the computation of the quanta matrix [24],
which, in summary, collects a snapshot of the number of real values of A within
each interval and for each class of the corresponding example. The criterion is cal-
culated as:
n maxr2
r =1 M+r
C AI M(C, D, A) =
m
where m is the number of intervals, r iterates through all intervals, i.e. r =
1, 2, . . . , m, maxr is the maximum value among all qir values (maximum value
within the r th column of the quanta matrix), M+r is the total number of continuous
values of attribute A that are within the interval (dr −1 , dr ].
According to the authors, CAIM has the following properties:
• The larger the value of CAIM, the higher the interdependence between the class
labels and the discrete intervals.
• It generates discretization schemes where each interval has all of its values grouped
within a single class label.
• It has taken into account the negative impact that values belonging to classes, other
than the class with the maximum number of values within an interval, have on the
discretization scheme.
9.3 Properties and Taxonomy 263
The merging algorithm above consists of four steps in the discretization process:
(1) sort the feature values, (2) search for the best two neighboring intervals, (3) merge
the pair into one interval, and (4) stop when a stopping criterion satisfies, otherwise
go to (2). In the following, we include the description of the most representative
discretizers based on merging criteria.
ChiMerge [68]
First of all, we introduce the χ 2 measure as other important evaluation metric in
discretization. χ 2 is a statistical measure that conducts a significance test on the
relationship between the values of an attribute and the class. The rationale behind
this is that in accurate discretization, the relative class frequencies should be fairly
consistent within an interval but two adjacent intervals should not have similar relative
class frequency. The χ 2 statistic determines the similarity of adjacent intervals based
on some significance level. Actually, it tests the hypothesis that two adjacent intervals
of an attribute are independent of the class. If they are independent, they should be
merged; otherwise they should remain separate. χ 2 is computed as:
2
c
(Ni j − E i j )2
χ2 =
Ei j
i=1 j=1
264 9 Discretization
where:
c = number of classes
Ni j = number of distinct values in the ith interval, jth class
c
Ri = number of examples in ith interval = Ni j
j=1
m
C j = number of examples in jth class = Ni j
i=1
c
N = total number of examples = Cj
j=1
E i j = expected frequency of Ni j = (Ri × C j )/N
Another fixed problem is related to the merging criterion used in Chi2, which was
not very accurate, leading to overmerging. The original Chi2 algorithm computes the
χ 2 value using an initially predefined degree of freedom (number of classes minus
one). From the view of statistics, this was inaccurate because the degree of freedom
may change according to the two adjacent intervals to be merged. This fact may
change the order of merging and benefit the inconsistency after the discretization.
FUSINTER [126]
This method uses the same strategy as the ChiMerge method, but rather than trying
to merge adjacent intervals locally, FUSINTER tries to find the partition which
optimizes the measure. Next, we provide a short description of the FUSINTER
algorithm.
• Obtain the boundary cut points after an increasing sorting of the values and the
formation of intervals with run of examples of the same class. Superposition of
several classes into an unique cut point is also allowed.
• Construct a matrix with a similar structure like a quanta matrix.
• Find two adjacent intervals whose merging would improve the value of the criterion
and check if they can be merged using a differential criterion.
• Repeat until no improvement is possible or only an interval is achieved.
Two criteria can be used for deciding the merging of intervals. The first is the
Shannon’s entropy and the second is the quadratic entropy.
This section presents the experimental framework and the results collected and dis-
cussions on them. Sect. 9.4.1 will describe the complete experimental set up. Then,
we offer the study and analysis of the results obtained over the data sets used in
Sect. 9.4.2.
The goal of this section is to show all the properties and issues related to the exper-
imental study. We specify the data sets, validation procedure, classifiers used, para-
meters of the classifiers and discretizers, and performance metrics. Data sets and
statistical tests used to contrast were described in the Chap. 2 of this book. Here,
we will only specify the name of the data sets used. The performance of discretiza-
tion algorithms is analyzed by using 40 data sets taken from the UCI ML Database
Repository [8] and KEEL data set repository [3]. They are enumerated in Table 9.2.
In this study, six classifiers have been used in order to find differences in perfor-
mance among the discretizers. The classifiers are:
266 9 Discretization
• C4.5 [92]: A well-known decision tree, considered one of the top 10 DM algorithms
[120].
• DataSqueezer [28]: This learner belongs to the family of inductive rule extraction.
In spite of its relative simplicity, DataSqueezer is a very effective learner. The rules
generated by the algorithm are compact and comprehensible, but accuracy is to
some extent degraded in order to achieve this goal.
• KNN: One of the simplest and most effective methods based on similarities among
a set of objects. It is also considered one of the top 10 DM algorithms [120] and
it can handle nominal attributes using proper distance functions such as HVDM
[114]. It belongs to the lazy learning family [2, 48].
• Naïve Bayes: This is another of the top 10 DM algorithms [120]. Its aim is to
construct a rule which will allow us to assign future objects to a class, assuming
independence of attributes when probabilities are established.
• PUBLIC [93]: It is an advanced decision tree that integrates the pruning phase
with the building stage of the tree in order to avoid the expansion of branches that
would be pruned afterwards.
• Ripper [32]: This is a widely used rule induction method based on a separate and
conquer strategy. It incorporates diverse mechanisms to avoid overfitting and to
handle numeric and nominal attributes simultaneously. The models obtained are
in the form of decision lists.
9.4 Experimental Comparative Analysis 267
The data sets considered are partitioned using the 10-FCV procedure. The para-
meters of the discretizers and classifiers are those recommended by their respective
authors. They are specified in Table 9.3 for those methods which require them. We
assume that the choice of the values of parameters is optimally chosen by their own
authors. Nevertheless, in discretizers that require the input of the number of intervals
as a parameter, we use a rule of thumb which is dependent on the number of instances
in the data set. It consists in dividing the number of instances by 100 and taking the
maximum value between this result and the number of classes. All discretizers and
classifiers are run one time in each partition because they are non-stochastic.
Two performance measures are widely used because of their simplicity and suc-
cessful application when multi-class classification problems are dealt with. We refer
to accuracy and Cohen’s kappa [31] measures, which will be adopted to measure the
efficacy discretizers in terms of the generalization classification rate. The explanation
of Cohen’s kappa was given in Chap. 2.
The empirical study involves 30 discretization methods from those listed in
Table 9.1. We want to outline that the implementations are only based on the descrip-
tions and specifications given by the respective authors in their papers.
268 9 Discretization
Table 9.4 Average results collected from intrinsic properties of the discretizers: number of intervals
obtained and inconsistency rates in training and test data
Number int. Incons. train Incons. tst
Heter-Disc 8.3125 ID3 0.0504 ID3 0.0349
MVD 18.4575 PKID 0.0581 PKID 0.0358
Distance 23.2125 Modified Chi2 0.0693 FFD 0.0377
UCPD 35.0225 FFD 0.0693 HDD 0.0405
MDLP 36.6600 HDD 0.0755 Modified Chi2 0.0409
Chi2 46.6350 USD 0.0874 USD 0.0512
FUSINTER 59.9850 ClusterAnalysis 0.0958 Khiops 0.0599
DIBD 64.4025 Khiops 0.1157 ClusterAnalysis 0.0623
CADD 67.7100 EqualWidth 0.1222 EqualWidth 0.0627
ChiMerge 69.5625 EqualFrequency 0.1355 EqualFrequency 0.0652
CAIM 72.5125 Chi2 0.1360 Chi2 0.0653
Zeta 75.9325 Bayesian 0.1642 FUSINTER 0.0854
Ameva 78.8425 MODL 0.1716 MODL 0.0970
Khiops 130.3000 FUSINTER 0.1735 HellingerBD 0.1054
1R 162.1925 HellingerBD 0.1975 Bayesian 0.1139
EqualWidth 171.7200 IDD 0.2061 UCPD 0.1383
Extended Chi2 205.2650 ChiMerge 0.2504 ChiMerge 0.1432
HellingerBD 244.6925 UCPD 0.2605 IDD 0.1570
EqualFrequency 267.7250 CAIM 0.2810 CAIM 0.1589
PKID 295.9550 Extended Chi2 0.3048 Extended Chi2 0.1762
MODL 335.8700 Ameva 0.3050 Ameva 0.1932
FFD 342.6050 1R 0.3112 CACC 0.2047
IDD 349.1250 CACC 0.3118 1R 0.2441
Modified Chi2 353.6000 MDLP 0.3783 Zeta 0.2454
CACC 505.5775 Zeta 0.3913 MDLP 0.2501
ClusterAnalysis 1116.1800 MVD 0.4237 DIBD 0.2757
USD 1276.1775 Distance 0.4274 Distance 0.2987
Bayesian 1336.0175 DIBD 0.4367 MVD 0.3171
ID3 1858.3000 CADD 0.6532 CADD 0.5688
HDD 2202.5275 Heter-Disc 0.6749 Heter-Disc 0.5708
Table 9.4 presents the average results corresponding to the number of intervals and
inconsistency rate in training and test data by all the discretizers over the 40 data
sets. Similarly, Tables 9.5 and 9.6 collect the average results associated to accuracy
and kappa measures for each classifier considered. For each metric, the discretizers
are ordered from the best to the worst. In Tables 9.5 and 9.6, we highlight those
Table 9.5 Average results of accuracy considering the six classifiers
C4.5 DataSqueezer KNN Naïve Bayes PUBLIC Ripper
FUSINTER 0.7588 Distance 0.5666 PKID 0.7699 PKID 0.7587 FUSINTER 0.7448 Modified Chi2 0.7241
ChiMerge 0.7494 CAIM 0.5547 FFD 0.7594 Modified Chi2 0.7578 CAIM 0.7420 Chi2 0.7196
Zeta 0.7488 Ameva 0.5518 Modified Chi2 0.7573 FUSINTER 0.7576 ChiMerge 0.7390 PKID 0.7097
CAIM 0.7484 MDLP 0.5475 EqualFrequency 0.7557 ChiMerge 0.7543 MDLP 0.7334 MODL 0.7089
9.4 Experimental Comparative Analysis
UCPD 0.7447 Zeta 0.5475 Khiops 0.7512 FFD 0.7535 Distance 0.7305 FUSINTER 0.7078
Distance 0.7446 ChiMerge 0.5472 EqualWidth 0.7472 CAIM 0.7535 Zeta 0.7301 Khiops 0.6999
MDLP 0.7444 CACC 0.5430 FUSINTER 0.7440 EqualWidth 0.7517 Chi2 0.7278 FFD 0.6970
Chi2 0.7442 Heter-Disc 0.5374 ChiMerge 0.7389 Zeta 0.7507 UCPD 0.7254 EqualWidth 0.6899
Modified Chi2 0.7396 DIBD 0.5322 CAIM 0.7381 EqualFrequency 0.7491 Modified Chi2 0.7250 EqualFrequency 0.6890
Ameva 0.7351 UCPD 0.5172 MODL 0.7372 MODL 0.7479 Khiops 0.7200 CAIM 0.6870
Khiops 0.7312 MVD 0.5147 HellingerBD 0.7327 Chi2 0.7476 Ameva 0.7168 HellingerBD 0.6816
MODL 0.7310 FUSINTER 0.5126 Chi2 0.7267 Khiops 0.7455 HellingerBD 0.7119 USD 0.6807
EqualFrequency 0.7304 Bayesian 0.4915 USD 0.7228 USD 0.7428 EqualFrequency 0.7110 ChiMerge 0.6804
EqualWidth 0.7252 Extended Chi2 0.4913 Ameva 0.7220 ID3 0.7381 MODL 0.7103 ID3 0.6787
(continued)
269
270
UCPD 0.5299 UCPD 0.2324 EqualWidth 0.5338 Chi2 0.5620 Zeta 0.5010 PKID 0.4915
Ameva 0.5297 Zeta 0.2189 CAIM 0.5260 CAIM 0.5616 Ameva 0.4986 EqualFrequency 0.4892
Chi2 0.5290 USD 0.2174 FUSINTER 0.5242 EqualWidth 0.5593 Chi2 0.4899 ChiMerge 0.4878
Distance 0.5288 Distance 0.2099 ChiMerge 0.5232 Khiops 0.5570 UCPD 0.4888 EqualWidth 0.4875
Modified Chi2 0.5163 Khiops 0.2038 MODL 0.5205 EqualFrequency 0.5564 Khiops 0.4846 CAIM 0.4870
MODL 0.5131 HDD 0.2030 HellingerBD 0.5111 MODL 0.5564 CACC 0.4746 Ameva 0.4810
EqualFrequency 0.5108 EqualFrequency 0.2016 Chi2 0.5100 USD 0.5458 HellingerBD 0.4736 FFD 0.4809
Khiops 0.5078 HellingerBD 0.1965 Ameva 0.5041 Zeta 0.5457 Modified Chi2 0.4697 Zeta 0.4769
HellingerBD 0.4984 Bayesian 0.1941 USD 0.4943 Ameva 0.5456 MODL 0.4620 HellingerBD 0.4729
CACC 0.4961 MODL 0.1918 HDD 0.4878 ID3 0.5403 EqualFrequency 0.4535 USD 0.4560
EqualWidth 0.4909 MDLP 0.1875 ClusterAnalysis 0.4863 HDD 0.5394 DIBD 0.4431 UCPD 0.4552
(continued)
271
272
discretizers whose performance is within 5 % of the range between the best and the
worst method in each measure, that is, valuebest − (0.05 · (valuebest − valuewor st )).
They should be considered as outstanding methods in each category, regardless of
their specific position in the table.
The Wilcoxon test [35, 50, 113] is adopted in this study considering a level of
significance equal to α = 0.05. Tables 9.7, 9.8 and 9.9 show a summary of all possible
274 9 Discretization
comparisons involved in the Wilcoxon test among all discretizers and measures, for
number of intervals and inconsistency rate, accuracy and kappa respectively. Again,
the individual comparisons between all possible discretizers are exhibited in the
aforementioned URL, where a detailed report of statistical results can be found for
each measure and classifier. Tables 9.7, 9.8 and 9.9 summarize, for each method in the
rows, the number of discretizers outperformed by using the Wilcoxon test under the
9.4 Experimental Comparative Analysis 275
column represented by the ‘+’ symbol. The column with the ‘±’ symbol indicates
the number of wins and ties obtained by the method in the row. The maximum value
for each column is highlighted by a shaded cell.
Once the results are presented in the mentioned tables and graphics, we can
stress some interesting properties observed from them, and we can point out the best
performing discretizers:
276 9 Discretization
• Regarding the number of intervals, the discretizers which divide the numerical
attributes in fewer intervals are Heter-Disc, MVD and Distance, whereas dis-
cretizers which require a large number of cut points are HDD, ID3 and Bayesian.
The Wilcoxon test confirms that Heter-Disc is the discretizer that obtains the least
intervals outperforming the rest.
• The inconsistency rate both in training data and test data follows a similar trend for
all discretizers, considering that the inconsistency obtained in test data is always
lower than in training data. ID3 is the discretizer that obtains the lowest average
inconsistency rate in training and test data, albeit the Wilcoxon test cannot find
significant differences between it and the other two discretizers: FFD and PKID.
We can observe a close relationship between the number of intervals produced and
the inconsistency rate, where discretizers that compute fewer cut points are usually
those which have a high inconsistency rate. They risk the consistency of the data
in order to simplify the result, although the consistency is not usually correlated
with the accuracy, as we will see below.
• In decision trees (C4.5 and PUBLIC), a subset of discretizers can be stressed as
the best performing ones. Considering average accuracy, FUSINTER, ChiMerge
and CAIM stand out from the rest. Considering average kappa, Zeta and MDLP
are also added to this subset. The Wilcoxon test confirms this result and adds
another discretizer, Distance, which outperforms 16 of the 29 methods. All meth-
ods emphasized are supervised, incremental (except Zeta) and use statistical and
information measures as evaluators. Splitting/Merging and Local/Global proper-
ties have no effect on decision trees.
• Considering rule induction (DataSqueezer and Ripper), the best performing dis-
cretizers are Distance, Modified Chi2, Chi2, PKID and MODL in average accuracy
and CACC, Ameva, CAIM and FUSINTER in average kappa. In this case, the results
are very irregular due to the fact that the Wilcoxon test emphasizes the ChiMerge
as the best performing discretizer for DataSqueezer instead of Distance and incor-
porates Zeta in the subset. With Ripper, the Wilcoxon test confirms the results
obtained by averaging accuracy and kappa. It is difficult to discern a common
set of properties that define the best performing discretizers due to the fact that
rule induction methods differ in their operation to a greater extent than decision
trees. However, we can say that, in the subset of best methods, incremental and
supervised discretizers predominate in the statistical evaluation.
• Lazy and bayesian learning can be analyzed together, due to the fact that the
HVDM distance used in KNN is highly related to the computation of bayesian
probabilities considering attribute independence [114]. With respect to lazy and
bayesian learning, KNN and Naïve Bayes, the subset of remarkable discretizers is
formed by PKID, FFD, Modified Chi2, FUSINTER, ChiMerge, CAIM, EqualWidth
and Zeta, when average accuracy is used; and Chi2, Khiops, EqualFrequency and
MODL must be added when average kappa is considered. The statistcal report by
Wilcoxon informs us of the existence of two outstanding methods: PKID for KNN,
which outperforms 27/29 and FUSINTER for Naïve Bayes. Here, supervised and
unsupervised, direct and incremental, binning and statistical/information evalua-
9.4 Experimental Comparative Analysis 277
tion are characteristics present in the best performing methods. However, we can
see that all of them are global, thus identifying a trend towards binning methods.
• In general, accuracy and kappa performance registered by discretizers do not differ
too much. The behavior in both evaluation metrics are quite similar, taking into
account that the differences in kappa are usually lower due to the compensation of
random success offered by it. Surprisingly, in DataSqueezer, accuracy and kappa
offer the greatest differences in behavior, but they are motivated by the fact that this
method focuses on obtaining simple rule sets, leaving precision in the background.
• It is obvious that there is a direct dependence between discretization and the
classifier used. We have pointed out that a similar behavior in decision trees and
lazy/bayesian learning can be detected, whereas in rule induction learning, the
operation of the algorithm conditions the effectiveness of the discretizer. Knowing
a subset of suitable discretizers for each type of discretizer is a good starting point
to understand and propose improvements in the area.
• Another interesting observation can be made about the relationship between accu-
racy and the number of intervals yielded by a discretizer. A discretizer that com-
putes few cut points does not have to obtain poor results in accuracy and vice
versa.
• Finally, we can stress a subset of global best discretizers considering a trade-off
between the number of intervals and accuracy obtained. In this subset, we can
include FUSINTER, Distance, Chi2, MDLP and UCPD.
On the other hand, an analysis centered on the 30 discretizers studied is given as
follows:
• Many classic discretizers are usually the best performing ones. This is the case of
ChiMerge, MDLP, Zeta, Distance and Chi2.
• Other classic discretizers are not as good as they should be, considering that they
have been improved over the years: EqualWidth, EqualFrequency, 1R, ID3 (the
static version is much worse than the dynamic inserted in C4.5 operation), CADD,
Bayesian and ClusterAnalysis.
• Slight modifications of classic methods have greatly enhanced their results, such
as, for example, FUSINTER, Modified Chi2, PKID and FFD; but in other cases,
the extensions have diminished their performance: USD, Extended Chi2.
• Promising techniques that have been evaluated under unfavorable circumstances
are MVD and UCP, which are unsupervised methods useful for application to other
DM problems apart from classification.
• Recent proposed methods that have been demonstrated to be competitive compared
with classic methods and even outperforming them in some scenarios are Khiops,
CAIM, MODL, Ameva and CACC. However, recent proposals that have reported
bad results in general are Heter-Disc, HellingerBD, DIBD, IDD and HDD.
• Finally, this study involves a higher number of data sets than the quantity con-
sidered in previous works and the conclusions achieved are impartial towards
an specific discretizer. However, we have to stress some coincidences with the
conclusions of these previous works. For example in [105], the authors propose
278 9 Discretization
References
1. Agrawal, R., Srikant, R.: Fast algorithms for mining association rules. In: Proceedings of the
20th Very Large Data Bases conference (VLDB), pp. 487–499 (1994)
2. Aha, D.W. (ed.): Lazy Learning. Springer, New York (2010)
3. Alcalá-Fdez, J., Fernández, A., Luengo, J., Derrac, J., García, S., Sánchez, L., Herrera, F.:
KEEL data-mining software tool: data set repository, integration of algorithms and experimental
analysis framework. J. Multiple-Valued Logic Soft Comput. 17(2–3), 255–287 (2011)
4. Alcalá-Fdez, J., Sánchez, L., García, S., del Jesus, M.J., Ventura, S., Garrell, J.M., Otero, J.,
Romero, C., Bacardit, J., Rivas, V.M., Fernández, J.C., Herrera, F.: KEEL: a software tool to
assess evolutionary algorithms for data mining problems. Soft Comput. 13(3), 307–318 (2009)
5. An, A., Cercone, N.: Discretization of Continuous Attributes for Learning Classification Rules.
In: Proceedings of the Third Conference on Methodologies for Knowledge Discovery and Data
Mining, pp. 509–514 (1999)
6. Au, W.H., Chan, K.C.C., Wong, A.K.C.: A fuzzy approach to partitioning continuous attributes
for classification. IEEE Trans. Knowl. Data Eng. 18(5), 715–719 (2006)
7. Augasta, M.G., Kathirvalavakumar, T.: A new discretization algorithm based on range coef-
ficient of dispersion and skewness for neural networks classifier. Appl. Soft Comput. 12(2),
619–625 (2012)
8. Bache, K., Lichman, M.: UCI machine learning repository (2013). http://archive.ics.uci.edu/
ml
9. Bakar, A.A., Othman, Z.A., Shuib, N.L.M.: Building a new taxonomy for data discretization
techniques. In: Proceedings on Conference on Data Mining and Optimization (DMO), pp.
132–140 (2009)
10. Bay, S.D.: Multivariate discretization for set mining. Knowl. Inf. Syst. 3, 491–512 (2001)
11. Berka, P., Bruha, I.: Empirical comparison of various discretization procedures. Int. J. Pattern
Recognit. Artif. Intell. 12(7), 1017–1032 (1998)
12. Berrado, A., Runger, G.C.: Supervised multivariate discretization in mixed data with ran-
dom forests. In: ACS/IEEE International Conference on Computer Systems and Applications
(ICCSA), pp. 211–217 (2009)
13. Berzal, F., Cubero, J.C., Marín, N., Sánchez, D.: Building multi-way decision trees with numer-
ical attributes. Inform. Sci. 165, 73–90 (2004)
14. Bondu, A., Boulle, M., Lemaire, V.: A non-parametric semi-supervised discretization method.
Knowl. Inf. Syst. 24, 35–57 (2010)
15. Boulle, M.: Khiops: a statistical discretization method of continuous attributes. Mach. Learn.
55, 53–69 (2004)
16. Boullé, M.: MODL: a bayes optimal discretization method for continuous attributes. Mach.
Learn. 65(1), 131–165 (2006)
17. Breiman, L., Friedman, J., Stone, C.J., Olshen, R.A.: Classification and Regression Trees.
Chapman and Hall/CRC, New York (1984)
18. Butterworth, R., Simovici, D.A., Santos, G.S., Ohno-Machado, L.: A greedy algorithm for
supervised discretization. J. Biomed. Inform. 37, 285–292 (2004)
References 279
19. Catlett, J.: On changing continuous attributes into ordered discrete attributes. In European
Working Session on Learning (EWSL), Lecture Notes on Computer Science, vol. 482, pp.
164–178. Springer (1991)
20. Cerquides, J., Mantaras, R.L.D.: Proposal and empirical comparison of a parallelizable
distance-based discretization method. In: Proceedings of the Third International Conference
on Knowledge Discovery and Data Mining (KDD), pp. 139–142 (1997)
21. Chan, C., Batur, C., Srinivasan, A.: Determination of quantization intervals in rule based model
for dynamic systems. In: Proceedings of the Conference on Systems and Man and and Cyber-
netics, pp. 1719–1723 (1991)
22. Chao, S., Li, Y.: Multivariate interdependent discretization for continuous attribute. Proc. Third
Int. Conf. Inf. Technol. Appl. (ICITA) 2, 167–172 (2005)
23. Chen, C.W., Li, Z.G., Qiao, S.Y., Wen, S.P.: Study on discretization in rough set based on genetic
algorithm. In: Proceedings of the Second International Conference on Machine Learning and
Cybernetics (ICMLC), pp. 1430–1434 (2003)
24. Ching, J.Y., Wong, A.K.C., Chan, K.C.C.: Class-dependent discretization for inductive learning
from continuous and mixed-mode data. IEEE Trans. Pattern Anal. Mach. Intell. 17, 641–651
(1995)
25. Chlebus, B., Nguyen, S.H.: On finding optimal discretizations for two attributes. Lect. Notes
Artif. Intell. 1424, 537–544 (1998)
26. Chmielewski, M.R., Grzymala-Busse, J.W.: Global discretization of continuous attributes as
preprocessing for machine learning. Int. J. Approximate Reasoning 15(4), 319–331 (1996)
27. Chou, P.A.: Optimal partitioning for classification and regression trees. IEEE Trans. Pattern
Anal. Mach. Intell. 13, 340–354 (1991)
28. Cios, K.J., Kurgan, L.A., Dick, S.: Highly scalable and robust rule learner: performance eval-
uation and comparison. IEEE Trans. Syst. Man Cybern. Part B 36, 32–53 (2006)
29. Cios, K.J., Pedrycz, W., Swiniarski, R.W., Kurgan, L.A.: Data Mining: A Knowledge Discovery
Approach. Springer, New York (2007)
30. Clarke, E.J., Barton, B.A.: Entropy and MDL discretization of continuous variables for bayesian
belief networks. Int. J. Intell. Syst. 15, 61–92 (2000)
31. Cohen, J.A.: Coefficient of agreement for nominal scales. Educ. Psychol. Measur. 20, 37–46
(1960)
32. Cohen, W.W.: Fast Effective Rule Induction. In: Proceedings of the Twelfth International
Conference on Machine Learning (ICML), pp. 115–123 (1995)
33. Dai, J.H.: A genetic algorithm for discretization of decision systems. In: Proceedings of the
Third International Conference on Machine Learning and Cybernetics (ICMLC), pp. 1319–
1323 (2004)
34. Dai, J.H., Li, Y.X.: Study on discretization based on rough set theory. In: Proceedings of the
First International Conference on Machine Learning and Cybernetics (ICMLC), pp. 1371–1373
(2002)
35. Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res.
7, 1–30 (2006)
36. Dougherty, J., Kohavi, R., Sahami, M.: Supervised and unsupervised discretization of contin-
uous features. In: Proceedings of the Twelfth International Conference on Machine Learning
(ICML), pp. 194–202 (1995)
37. Elomaa, T., Kujala, J., Rousu, J.: Practical approximation of optimal multivariate discretization.
In: Proceedings of the 16th International Symposium on Methodologies for Intelligent Systems
(ISMIS), pp. 612–621 (2006)
38. Elomaa, T., Rousu, J.: General and efficient multisplitting of numerical attributes. Mach. Learn.
36, 201–244 (1999)
39. Elomaa, T., Rousu, J.: Necessary and sufficient pre-processing in numerical range discretiza-
tion. Knowl. Inf. Syst. 5, 162–182 (2003)
40. Elomaa, T., Rousu, J.: Efficient multisplitting revisited: Optima-preserving elimination of par-
tition candidates. Data Min. Knowl. Disc. 8, 97–126 (2004)
280 9 Discretization
41. Fayyad, U.M., Irani, K.B.: Multi-interval discretization of continuous-valued attributes for
classification learning. In: Proceedings of the 13th International Joint Conference on Artificial
Intelligence (IJCAI), pp. 1022–1029 (1993)
42. Ferrandiz, S., Boullé, M.: Multivariate discretization by recursive supervised bipartition of
graph. In: Proceedings of the 4th Conference on Machine Learning and Data Mining (MLDM),
pp. 253–264 (2005)
43. Flores, J.L., Inza, I., Larrañaga, P.: Larra: Wrapper discretization by means of estimation of
distribution algorithms. Intell. Data Anal. 11(5), 525–545 (2007)
44. Flores, M.J., Gámez, J.A., Martínez, A.M., Puerta, J.M.: Handling numeric attributes when
comparing bayesian network classifiers: does the discretization method matter? Appl. Intell.
34, 372–385 (2011)
45. Friedman, N., Goldszmidt, M.: Discretizing continuous attributes while learning bayesian net-
works. In: Proceedings of the 13th International Conference on Machine Learning (ICML),
pp. 157–165 (1996)
46. Gaddam, S.R., Phoha, V.V., Balagani, K.S.: K-Means+ID3: a novel method for supervised
anomaly detection by cascading k-means clustering and ID3 decision tree learning methods.
IEEE Trans. Knowl. Data Eng. 19, 345–354 (2007)
47. Gama, J., Torgo, L., Soares, C.: Dynamic discretization of continuous attributes. In: Proceedings
of the 6th Ibero-American Conference on AI: Progress in Artificial Intelligence, IBERAMIA,
pp. 160–169 (1998)
48. Garcia, E.K., Feldman, S., Gupta, M.R., Srivastava, S.: Completely lazy learning. IEEE Trans.
Knowl. Data Eng. 22, 1274–1285 (2010)
49. García, M.N.M., Lucas, J.P., Batista, V.F.L., Martín, M.J.P.: Multivariate discretization for
associative classification in a sparse data application domain. In: Proceedings of the 5th Inter-
national Conference on Hybrid Artificial Intelligent Systems (HAIS), pp. 104–111 (2010)
50. García, S., Herrera, F.: An extension on “statistical comparisons of classifiers over multiple
data sets” for all pairwise comparisons. J. Mach. Learn. Res. 9, 2677–2694 (2008)
51. García, S., Luengo, J., Sáez, J.A., López, V., Herrera, F.: A survey of discretization techniques:
taxonomy and empirical analysis in supervised learning. IEEE Trans. Knowl. Data Eng. 25(4),
734–750 (2013)
52. Giráldez, R., Aguilar-Ruiz, J., Riquelme, J., Ferrer-Troyano, F., Rodríguez-Baena, D.: Dis-
cretization oriented to decision rules generation. Frontiers Artif. Intell. Appl. 82, 275–279
(2002)
53. González-Abril, L., Cuberos, F.J., Velasco, F., Ortega, J.A.: AMEVA: an autonomous dis-
cretization algorithm. Expert Syst. Appl. 36, 5327–5332 (2009)
54. Grzymala-Busse, J.W.: A multiple scanning strategy for entropy based discretization. In: Pro-
ceedings of the 18th International Symposium on Foundations of Intelligent Systems, ISMIS,
pp. 25–34 (2009)
55. Grzymala-Busse, J.W., Stefanowski, J.: Three discretization methods for rule induction. Int. J.
Intell. Syst. 16(1), 29–38 (2001)
56. Gupta, A., Mehrotra, K.G., Mohan, C.: A clustering-based discretization for supervised learn-
ing. Stat. Probab. Lett. 80(9–10), 816–824 (2010)
57. He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9),
1263–1284 (2009)
58. Ho, K.M., Scott, P.D.: Zeta: A global method for discretization of continuous variables. In:
Proceedings of the Third International Conference on Knowledge Discovery and Data Mining
(KDD), pp. 191–194 (1997)
59. Holte, R.C.: Very simple classification rules perform well on most commonly used datasets.
Mach. Learn. 11, 63–90 (1993)
60. Hong, S.J.: Use of contextual information for feature ranking and discretization. IEEE Trans.
Knowl. Data Eng. 9, 718–730 (1997)
61. Hu, H.W., Chen, Y.L., Tang, K.: A dynamic discretization approach for constructing decision
trees with a continuous label. IEEE Trans. Knowl. Data Eng. 21(11), 1505–1514 (2009)
References 281
62. Ishibuchi, H., Yamamoto, T., Nakashima, T.: Fuzzy data mining: Effect of fuzzy discretization.
In: IEEE International Conference on Data Mining (ICDM), pp. 241–248 (2001)
63. Janssens, D., Brijs, T., Vanhoof, K., Wets, G.: Evaluating the performance of cost-based dis-
cretization versus entropy- and error-based discretization. Comput. Oper. Res. 33(11), 3107–
3123 (2006)
64. Jiang, F., Zhao, Z., Ge, Y.: A supervised and multivariate discretization algorithm for rough sets.
In: Proceedings of the 5th international conference on Rough set and knowledge technology,
RSKT, pp. 596–603 (2010)
65. Jiang, S., Yu, W.: A local density approach for unsupervised feature discretization. In: Proceed-
ings of the 5th International Conference on Advanced Data Mining and Applications, ADMA,
pp. 512–519 (2009)
66. Jin, R., Breitbart, Y., Muoh, C.: Data discretization unification. Knowl. Inf. Syst. 19, 1–29
(2009)
67. Kang, Y., Wang, S., Liu, X., Lai, H., Wang, H., Miao, B.: An ICA-based multivariate discretiza-
tion algorithm. In: Proceedings of the First International Conference on Knowledge Science,
Engineering and Management (KSEM), pp. 556–562 (2006)
68. Kerber, R.: Chimerge: Discretization of numeric attributes. In: National Conference on Artifical
Intelligence American Association for Artificial Intelligence (AAAI), pp. 123–128 (1992)
69. Kononenko, I., Sikonja, M.R.: Discretization of continuous attributes using relieff. In: Pro-
ceedings of Elektrotehnika in Racunalnika Konferenca (ERK) (1995)
70. Kurgan, L.A., Cios, K.J.: CAIM discretization algorithm. IEEE Trans. Knowl. Data Eng. 16(2),
145–153 (2004)
71. Kurtcephe, M., Güvenir, H.A.: A discretization method based on maximizing the area under
receiver operating characteristic curve. Int. J. Pattern Recognit. Artif. Intell. 27(1), 8 (2013)
72. Lee, C.H.: A hellinger-based discretization method for numeric attributes in classification
learning. Knowl. Based Syst. 20, 419–425 (2007)
73. Li, R.P., Wang, Z.O.: An entropy-based discretization method for classification rules with
inconsistency checking. In: Proceedings of the First International Conference on Machine
Learning and Cybernetics (ICMLC), pp. 243–246 (2002)
74. Li, W.L., Yu, R.H., Wang, X.Z.: Discretization of continuous-valued attributes in decision tree
generation. In: Proocedings of the Second International Conference on Machine Learning and
Cybernetics (ICMLC), pp. 194–198 (2010)
75. Liu, H., Hussain, F., Tan, C.L., Dash, M.: Discretization: an enabling technique. Data Min.
Knowl. Disc. 6(4), 393–423 (2002)
76. Liu, H., Setiono, R.: Feature selection via discretization. IEEE Trans. Knowl. Data Eng. 9,
642–645 (1997)
77. Liu, L., Wong, A.K.C., Wang, Y.: A global optimal algorithm for class-dependent discretization
of continuous data. Intell. Data Anal. 8, 151–170 (2004)
78. Liu, X., Wang, H.: A discretization algorithm based on a heterogeneity criterion. IEEE Trans.
Knowl. Data Eng. 17, 1166–1173 (2005)
79. Ludl, M.C., Widmer, G.: Relative unsupervised discretization for association rule mining, In:
Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Dis-
covery, The Fourth European Conference on Principles and Practice of Knowledge Discovery
in Databases (PKDD), pp. 148–158 (2000)
80. Macskassy, S.A., Hirsh, H., Banerjee, A., Dayanik, A.A.: Using text classifiers for numerical
classification. In: Proceedings of the 17th International Joint Conference on Artificial Intelli-
gence, Vol. 2 (IJCAI), pp. 885–890 (2001)
81. Mehta, S., Parthasarathy, S., Yang, H.: Toward unsupervised correlation preserving discretiza-
tion. IEEE Trans. Knowl. Data Eng. 17, 1174–1185 (2005)
82. Monti, S., Cooper, G.: A latent variable model for multivariate discretization. In: Proceedings
of the Seventh International Workshop on AI & Statistics (Uncertainty) (1999)
83. Monti, S., Cooper, G.F.: A multivariate discretization method for learning bayesian networks
from mixed data. In: Proceedings on Uncertainty in Artificial Intelligence (UAI), pp. 404–413
(1998)
282 9 Discretization
105. Tay, F.E.H., Shen, L.: A modified chi2 algorithm for discretization. IEEE Trans. Knowl. Data
Eng. 14, 666–670 (2002)
106. Tsai, C.J., Lee, C.I., Yang, W.P.: A discretization algorithm based on class-attribute contin-
gency coefficient. Inf. Sci. 178, 714–731 (2008)
107. Vannucci, M., Colla, V.: Meaningful discretization of continuous features for association rules
mining by means of a SOM. In: Proocedings of the 12th European Symposium on Artificial
Neural Networks (ESANN), pp. 489–494 (2004)
108. Ventura, D., Martinez, T.R.: BRACE: A paradigm for the discretization of continuously valued
data, In: Proceedings of the Seventh Annual Florida AI Research Symposium (FLAIRS), pp.
117–121 (1994)
109. Ventura, D., Martinez, T.R.: An empirical comparison of discretization methods. In: Proceed-
ings of the 10th International Symposium on Computer and Information Sciences (ISCIS), pp.
443–450 (1995)
110. Wang, K., Liu, B.: Concurrent discretization of multiple attributes. In: Proceedings of the
Pacific Rim International Conference on Artificial Intelligence (PRICAI), pp. 250–259 (1998)
111. Wang, S., Min, F., Wang, Z., Cao, T.: OFFD: Optimal flexible frequency discretization for
naive bayes classification. In: Proceedings of the 5th International Conference on Advanced
Data Mining and Applications, ADMA, pp. 704–712 (2009)
112. Wei, H.: A novel multivariate discretization method for mining association rules. In: 2009
Asia-Pacific Conference on Information Processing (APCIP), pp. 378–381 (2009)
113. Wilcoxon, F.: Individual comparisons by ranking methods. Biometrics 1, 80–83 (1945)
114. Wilson, D.R., Martinez, T.R.: Reduction techniques for instance-based learning algorithms.
Mach. Learn. 38(3), 257–286 (2000)
115. Wong, A.K.C., Chiu, D.K.Y.: Synthesizing statistical knowledge from incomplete mixed-
mode data. IEEE Trans. Pattern Anal. Mach. Intell. 9, 796–805 (1987)
116. Wu, M., Huang, X.C., Luo, X., Yan, P.L.: Discretization algorithm based on difference-
similitude set theory. In: Proceedings of the Fourth International Conference on Machine
Learning and Cybernetics (ICMLC), pp. 1752–1755 (2005)
117. Wu, Q., Bell, D.A., Prasad, G., McGinnity, T.M.: A distribution-index-based discretizer for
decision-making with symbolic ai approaches. IEEE Trans. Knowl. Data Eng. 19, 17–28 (2007)
118. Wu, Q., Cai, J., Prasad, G., McGinnity, T.M., Bell, D.A., Guan, J.: A novel discretizer for
knowledge discovery approaches based on rough sets. In: Proceedings of the First International
Conference on Rough Sets and Knowledge Technology (RSKT), pp. 241–246 (2006)
119. Wu, X.: A bayesian discretizer for real-valued attributes. Comput. J. 39, 688–691 (1996)
120. Wu, X., Kumar, V. (eds.): The Top Ten Algorithms in Data Mining. Data Mining and Knowl-
edge Discovery. Chapman and Hall/CRC, Taylor and Francis, Boca Raton (2009)
121. Yang, P., Li, J.S., Huang, Y.X.: HDD: a hypercube division-based algorithm for discretisation.
Int. J. Syst. Sci. 42(4), 557–566 (2011)
122. Yang, Y., Webb, G.I.: Discretization for naive-bayes learning: managing discretization bias
and variance. Mach. Learn. 74(1), 39–74 (2009)
123. Yang, Y., Webb, G.I., Wu, X.: Discretization methods. In: Data Mining and Knowledge Dis-
covery Handbook, pp. 101–116 (2010)
124. Zhang, G., Hu, L., Jin, W.: Discretization of continuous attributes in rough set theory and
its application. In: Proceedings of the 2004 IEEE Conference on Cybernetics and Intelligent
Systems (CIS), pp. 1020–1026 (2004)
125. Zhu, W., Wang, J., Zhang, Y., Jia, L.: A discretization algorithm based on information distance
criterion and ant colony optimization algorithm for knowledge extracting on industrial database.
In: Proceedings of the 2010 IEEE International Conference on Mechatronics and Automation
(ICMA), pp. 1477–1482 (2010)
126. Zighed, D.A., Rabaséda, S., Rakotomalala, R.: FUSINTER: a method for discretization of
continuous attributes. Int. J. Uncertainty, Fuzziness Knowl. Based Syst. 6, 307–326 (1998)
127. Zighed, D.A., Rakotomalala, R., Feschet, F.: Optimal multiple intervals discretization of con-
tinuous attributes for supervised learning. In: Proceedings of the First International Conference
on Knowledge Discovery and Data Mining (KDD), pp. 295–298 (1997)
Chapter 10
A Data Mining Software Package Including
Data Preparation and Reduction: KEEL
Abstract KEEL software is an open source Data Mining tool widely used in research
and real life applications. Most of the algorithms described, if not all of them, through-
out the book are actually implemented and publicly available in this Data Mining
platform. Since KEEL enables the user to create and run single or concatenated
preprocessing techniques in the data, such software is carefully introduced in this
section, intuitively guiding the reader across the step needed to set up all the data
preparations that might be needed. It is also interesting to note that the experimen-
tal analyses carried out in this book have been created using KEEL, allowing the
consultant to quickly compare and adapt the results presented here. An extensive
revision of Data Mining software tools are presented in Sect. 10.1. Among them,
we will focus on the open source KEEL platform in Sect. 10.2 providing details of
its main features and usage. For the practitioners interest, the most common used
data sources are introduced in Sect. 10.3 and the steps needed to integrate any new
algorithm in it in Sect. 10.4. Once the results have been obtained, the appropriate
comparison guidelines are provided in Sect. 10.5. The most important aspects of the
tool are summarized in Sect. 10.6.
As we have indicated in Chap. 1, Data Mining (DM) is the process for automatic
discovery of high level knowledge by obtaining information from real world, large
and complex data sets [1], and is the core step of a broader process, called KDD. In
addition to the DM step, the KDD process includes application of several preprocess-
ing methods aimed at faciliting application of DM algorithms and postprocessing
methods for refining and improving the discovered knowledge. The evolution of the
available techniques and their wide adoption demands to gather all the steps involved
in the KDD process in the least amount of pieces of software as possible for the sake
of easier application and comparisons among the results obtained, yet allowing non
expert practitioners to have access to KDD techniques.
Many DM software tools have been developed in the last few years due to the
popularization of DM. Although a lot of them are commercially distributed (some
of the leading commercial software are mining suites such as SPSS Clementine,1
Oracle Data Mining2 and KnowledgeSTUDIO3 ), only a few were available as open
source. Fortunately this tendency has changed and free and open source DM tools
have appeared to cover many specialized tasks in the process as well as general tools
that include most of the steps of KDD. Among the latter we can highlight Weka
[2], Orange [3] or Java-ML [4] as the most well-known of a growing family of open
source toolboxes for DM.
Most programming languages have a DM software so any user has the possibility
of performing experiments. While Weka, RapidMiner4 [5], Java-ML and α Miνr
are written in Java, ADaM5 and Orange are written in Python. Statistical languages
also have their software tools as Rattle [6] for R.
It is also common to find libraries for some popular programming languages that
can be added to a particular project. Their aim is not the novel user but an experienced
practitioner who wants to add functionality to real-world cases without dealing with a
multi-purpose GUI or having to rip off the methods they want. A well-known library
written in C++ for fast programs is MLC++,6 and R has their own statistical analysis
package.7 In Java the MLJ library8 is available to be integrated in any project with
ease.
Apart from the aforementioned toolboxes, the reader can find more alternatives
to suit to their needs. Many specialized webpages are devoted to the presentation,
promotion and publishing of DM news and software. We recommend visiting the
KDnuggets software directory9 and the-Data-Mine site.10 In the research field open
source tools are playing an increasingly important role as is pointed out in [7]. To this
regard the link page of the Knowledge Extraction based on Evolutionary Learning
(KEEL) webpage11 contains an extensive list of open source DM tools and related
fields such as metaheuristic optimization.
KEEL [8] is a open source Java software tool which empowers the user to assess
the behavior of ML, evolutionary learning and soft computing based techniques for
different kinds of DM problems: regression, classification, clustering, pattern mining
and so on. This tool can offer several advantages:
1 http://www.spss.com/clementine.
2 http://www.oracle.com/technology/products/bi/odm.
3 http://www.angoss.com/products/studio/index.php.
4 http://sourceforge.net/projects/rapidminer/.
5 http://projects.itsc.uah.edu/datamining/adam/.
6 http://www.sgi.com/tech/mlc/.
7 http://www.r-project.org/.
8 http://www.kddresearch.org/Groups/Machine-Learning/MLJ/.
9 http://www.kdnuggets.com/software.
10 http://the-data-mine.com/bin/view/Software.
11 http://sci2s.ugr.es/keel/links.php.
10.1 Data Mining Softwares and Toolboxes 287
KEEL12 is a software tool that facilitates the analysis of the behaviour of ML in the
different areas of learning and pre-processing tasks, making the management of these
techniques easy for the user. The models correspond with the most well-known and
employed models in each methodology, such as feature and instance selection [9, 10],
decision trees [11], SVMs [12], noise filters [13], lazy learning [14], evolutionary
fuzzy rule learning [15], genetic ANNs [16], Learning Classifier Systems [17], and
many more.
The current available version of KEEL consists of the following function blocks13 :
• Data Management: This part is made up of a set of tools that can be used to build
new data, to export and import data in other formats to or from KEEL format, data
edition and visualization, to apply transformations and partitioning to data, etc…
• Design of Experiments (off-line module): The aim of this part is the design of the
desired experimentation over the selected data sets and providing for many options
in different areas: type of validation, type of learning (classification, regression,
unsupervised learning), etc…
12 http://keel.es.
13 http://www.keel.es/software/prototypes/version1.0/\/ManualKeel.pdf.
288 10 A Data Mining Software Package. . .
14 http://jclec.sourceforge.net/.
10.2 KEEL: Knowledge Extraction Based on Evolutionary Learning 289
15 http://www.keel.es/software/prototypes/version1.0/\/AlgorithmsList.pdf.
290 10 A Data Mining Software Package. . .
Figure 10.1 shows an example window of the Data Management module in the
section of Data Visualization. The module has seven sections, each of which is
accessible through the buttons on the left side of the window. In the following, we
will briefly describe them:
• Creation of a new data set: This option allows us to generate a new data set
compatible with the other KEEL modules.
• Import data to KEEL format: Since KEEL works with a specific data format
(similar to the ARFF format) in all its modules, this section allows us to convert
various data formats to KEEL format, such as CSV, XML, ARFF, extracting data
from data bases, etc.
• Export data from KEEL format: This option is the reverse of the previous one.
It converts the data handled by KEEL procedures in other external formats to
establish compatibility with other software tools.
• Visualization of data: This option is used to represent and visualize the data. With
it, we can see a graphical distribution of each attribute and comparisons between
two attributes.
• Edition of data: This area is dedicated to managing the data manually. The data
set, once loaded, can be edited by modifying values, adding or removing rows and
columns, etc.
• Data Partition: This zone allows us to make the partitions of data needed by the
experiment modules to validate results. It supports k-FCV, 5 × 2-CV and hold-out
validation with stratified partition.
10.2 KEEL: Knowledge Extraction Based on Evolutionary Learning 291
In the last few years, a large number of DM software tools have been developed for
research purposes. Some of them are libraries that allow reductions in programming
work when developing new algorithms: ECJ [21], JCLEC [18], learning classifier
systems [22], etc. Others are DM suites that incorporate learning algorithms (some
of them may use EAs for this task) and provide a mechanism to establish comparisons
among them. Some examples are Weka [2], D2K [23], etc.
This module is a Graphical User Interface (GUI) that allows the design of exper-
iments for solving various problems of regression, classification and unsupervised
learning. Having designed the experiments, it generates the directory structure and
files required for running them in any local machine with Java (see Fig. 10.2).
The experiments are graphically modeled, based on data flow and represented
by graphs with node-edge connections. To design an experiment, we first have to
indicate the type of validation (k-FCV [24] or 5 × 2-CV [25]) and the type of learning
(regression, classification or unsupervised) to be used. Then, we have to select the data
sources, drag the selected methods into the workspace and connect methods and data
sets, combining the evolutionary learning algorithms with different pre-processing
and post-processing techniques, if needed. Finally, we can add statistical tests to
achieve a complete analysis of the methods being studied, and a report box to obtain a
summary of the results. Notice that each component of the experiment is configured
in separate dialogues that can be opened by double-clicking the respective node.
Test
2.- Obtain the directory structure with exe scripts dataset results
the required files
10.3 KEEL-Dataset
The categories of the data sets have been derived from the topics addressed in the
experimental studies. Some of them are usually found in the literature, like supervised
(classification) data sets, unsupervised and regression problems. On the other hand,
new categories which have not been tackled or separated yet are also present. The
categories in which the data sets are divided are the following:
• Classification problems. This category includes all the supervised data sets. All
these data sets contains one or more attributes which label the instances, mapping
them into different classes. We distinguish three subcategories of classification
data sets:
– Standard data sets.
– Imbalanced data sets [29–31]. Imbalanced data sets are standard classification
data sets where the class distribution is highly skewed among the classes.
16 http://keel.es/datasets.php.
10.3 KEEL-Dataset 295
– Multi instance data sets [32]. Multi-Instance data sets represent problems where
there is a many-to-one relationship between feature vectors and their output
attribute.
• Regression problems. These are data sets with a real valued output attribute, and
the objective is to better approximate this output value using the input attributes.
• Unsupervised (Clustering and Associations) problems. Unsupervised data sets
represent a set of data whose examples have been not labeled.
• Low quality data [33]. In this category the data sets which contain imprecise
values in their input attributes are included, caused by noise or restrictions in the
measurements. Therefore these low quality data sets can contain a mixture of crisp
and fuzzy values. This is a unique category.
In Fig. 10.6 the webpage for the classification standard data sets is shown as an
illustrative example of a particular category webpage. These webpages are structured
in two main sections:
296 10 A Data Mining Software Package. . .
• First, the structure of the header of this type of Keel data set file is pointed out.
This description contains the tags used to identify the different attributes, the name
of the data set and indicates the starting point of the data.
• The second part is a enumeration of the different data sets contained in the webpage.
This enumeration is presented in a table. The table shows the characteristics of all
the data sets: the name of the data set, number of attributes, number of examples
and number of classes (if applicable). Moreover the possibility of downloading the
entire data set or different kind of partitions in Keel format in a ZIP file is presented.
A header file is also available with particular information of the data set.
The tables’ columns can be also sorted attending to the different data set’s charac-
teristics, like the number of attributes or examples.
Clicking on the name of the data set in the table will open the specific webpage
for this data set. This webpage is composed of tables which gather all information
available on the data set.
10.3 KEEL-Dataset 297
• The first table will always contain the general information of the data set: name,
number of attributes, number of instances, number of classes, presence of MVs,
etc.
• The second table contains the relation of attributes of the data set. For each attribute,
the domain of the values is given. If it is a numerical attribute, the minimum and
maximum values of the domain are presented. In the case of nominal attributes,
the complete set of values is shown. The class attribute (if applicable) is stressed
with a different color.
Additional information of the data set is also included, indicating its origin, applica-
tions and nature. In a second part of the webpage, the complete data set and a number
of partitions can be downloaded in Keel format.
This section contains the links to the different experimental studies for the respective
data set categories. For each category, a new webpage has been built. See Fig. 10.7 for
the webpage devoted to the experimental studies with standard classification data sets.
These webpages contain published journal publications which use the correspondent
kind of data sets in the repository. The papers are grouped by the publication year.
Each paper can contain up to four links:
Fig. 10.7 Keel-dataset experimental studies with standard classification data sets webpage
298 10 A Data Mining Software Package. . .
In this section the main features that any researcher must take into account to inte-
grate a new algorithm into the KEEL software tool are described. Next, a simple
codification example is provided in order to clarify the integration process.
Next we show a valid example of a Method Configuration file (data files lists are
not fully shown):
algorithm = Genetic Algorithm
inputData = ‘‘../datasets/iris/iris.dat’’ ...
outputData = ‘‘../results/iris/result0.tra’’ ...
Seed = 12345678
Number of Generations = 1000
Crossover Probability = 0.9
Mutation Probability = 0.1
...
A complete description of the parameters file can be found in Sect. 3 of the KEEL
Manual.
• The input data sets follow a specific format that extends the “arff” files by complet-
ing the header with more metadata information about the attributes of the problem.
Next, the list of examples is included, which is given in rows with the attribute
values separated by commas.
For more information about the input data sets files please refer to Sect. 4 of
the KEEL Manual. Furthermore, in order to ease the data management, we have
developed an API data set, the main features of which are described in Sect. 7 of
the Manual.
• The output format consists of a header, which follows the same scheme as the
input data, and two columns with the output values for each example separated by
a whitespace. The first value corresponds to the expected output, and the second
one to the predicted value. All methods must generate two output files: one for
training and another one for testing.
For more information about the obligatory output files please refer to Sect. 5 of
the KEEL Manual.
Although the list of constraints is short, the KEEL development team have created
a simple template that manages all these features. Our KEEL template includes four
classes:
1. Main: This class contains the main instructions for launching the algorithm. It
reads the parameters from the file and builds the “algorithm object”.
public class Main {
2. ParseParameters: This class manages all the parameters, from the input and
output files, to every single parameter stored in the parameters file.
public class parseParameters {
public parseParameters() {
inputFiles = new ArrayList<String> ();
outputFiles = new ArrayList<String> ();
parameters = new ArrayList<String> ();
};
...
}
3. myDataset: This class is an interface between the classes of the API data set
and the algorithm. It contains the basic options related to data access.
private double[][] X;
private double[] outputReal;
private String [] output;
public myDataset() {
IS = new InstanceSet();
}
...
}
}
4. Algorithm: This class is devoted to storing the main variables of the algorithm
and naming the different procedures for the learning stage. It also contains the
functions for writing the obligatory output files.
outputTr = parameters.getTrainingOutputFile();
...
}
}
next section we will explain in detail how to encode a simple algorithm within the
KEEL software tool.
Nowadays, the use of statistical tests to improve the evaluation process of the per-
formance of a new method has become a widespread technique in the field of DM
[34–36]. Usually, they are employed inside the framework of any experimental analy-
sis to decide when an algorithm is better than other one. This task, which may not
be trivial, has become necessary to confirm when a new proposed method offers a
significant improvement over the existing methods for a given problem.
Two kinds of tests exist: parametric and non-parametric, depending on the concrete
type of data employed. As a general rule, a non-parametric test is less restrictive than
a parametric one, although it is less robust than a parametric when data is well
conditioned.
Parametric tests have been commonly used in the analysis of experiments in DM.
For example, a common way to test whether the difference between the results of
two algorithms is non-random is to compute a paired t-test, which checks whether
the average difference in their performance over the data sets is significantly differ-
ent from zero. When comparing a set of multiple algorithms, the common statistical
method for testing the differences between more than two related sample means is the
repeated-measures ANOVA (or within-subjects ANOVA) [37]. Unfortunately, para-
metric tests are based on assumptions which are most probably violated when ana-
lyzing the performance of computational intelligence and DM algorithms [38–40].
These assumpitions are known as independence, normality and homoscedasticity.
Nonparametric tests can be employed in the analysis of experiments, providing
the researcher with a practical tool to use when the previous assumptions can not be
satisfied. Although they are originally designed for dealing with nominal or ordinal
data, it is possible to conduct ranking based transformations to adjust the input data to
the test requirements. Several nonparemetric methods for pairwise and multiple com-
parison are available to contrast adequately the results obtained in any Computational
Intelligence experiment. A wide description about the topic with examples, cases of
studies, bibliographic recommendations can be found in the SCI2S thematic public
website on Statistical Inference in Computational Intelligence and Data Mining.17
KEEL is one of the fewest DM software tools that provides the researcher with a
complete set of statistical procedures for pairwise and multiple comparisons. Inside
the KEEL environment, several parametric and non-parametric procedures have been
coded, which should help to contrast the results obtained in any experiment performed
with the software tool. These tests follow the same methodology that the rest of ele-
ments of KEEL, facilitating both its employment and its integration inside a complete
experimental study.
17 http://sci2s.ugr.es/sicidm/.
304 10 A Data Mining Software Package. . .
Table 10.1 shows the procedures existing in the KEEL statistical package. For
each test, a reference and a brief description is given (an extended description can
be found in the Statistical Inference in Computational Intelligence and Data Mining
website and in the KEEL website18 ).
In this section, we present a case study as an example of the functionality and process
of creating an experiment with the KEEL software tool. This experimental study
is focused on the comparison between the new algorithm imported (SGERD) and
several evolutionary rule-based algorithms, and employs a set of supervised classi-
fication domains available in KEEL-dataset. Several statistical procedures available
in the KEEL software tool will be employed to contrast the results obtained.
Five representative evolutionary rule learning methods have been selected to carry out
the experimental study: Ant-Miner, CO-Evolutionary Rule Extractor (CORE), HIer-
archical DEcision Rules (HIDER), Steady-State Genetic Algorithm for Extracting
Fuzzy Classification Rules From Data (SGERD) and Tree Analysis with Randomly
Generated and Evolved Trees (TARGET) methodology. Table 10.2 shows their ref-
erences and gives a brief description of each one.
18 http://www.keel.es.
10.5 KEEL Statistical Tests 305
On the other hand, we have used 24 well-known classification data sets (they are
publicly available on the KEEL-dataset repository web page,19 including general
information about them, partitions and so on) in order to check the performance of
these methods. Table 10.3 shows their main characteristics where # Ats is the number
of attributes, #I ns is the number of instances and #Cla is the number of Classes. For
each data set the number of examples, attributes and classes of the problem described
are shown. We have employed a 10-FCV procedure as a validation scheme to perform
the experiments.
19 http://www.keel.es/datasets.php.
306 10 A Data Mining Software Package. . .
This subsection describes and discusses the results obtained from the previous exper-
iment configuration. Tables 10.5 and 10.6 show the results obtained in training and
test stages, respectively. For each data set, the average and standard deviations in
accuracy obtained by the module Vis-Clas-Tabular are shown, with the best results
stressed in boldface.
Focusing on the test results, the average accuracy obtained by Hider is the highest
one. However, this estimator does not reflect whether or not the differences among
the methods are significant. For this reason, we have carried out an statistical analysis
based on multiple comparison procedures (see http://sci2s.ugr.es/sicidm/ for a full
Table 10.5 Average results and standard deviations of training accuracy obtained
Data set Ant Miner CORE HIDER SGERD TARGET
Mean SD Mean SD Mean SD Mean SD Mean SD
HAB 79.55 1.80 76.32 1.01 76.58 1.21 74.29 0.81 74.57 1.01
IRI 97.26 0.74 95.48 1.42 97.48 0.36 97.33 0.36 93.50 2.42
BAL 73.65 3.38 68.64 2.57 75.86 0.40 76.96 2.27 77.29 1.57
NTH 99.17 0.58 92.66 1.19 95.97 0.83 90.23 0.87 88.05 2.19
MAM 81.03 1.13 79.04 0.65 83.60 0.75 74.40 1.43 79.91 0.65
BUP 80.38 3.25 61.93 0.89 73.37 2.70 59.13 0.68 68.86 0.89
MON 97.22 0.30 87.72 7.90 97.22 0.30 80.56 0.45 97.98 7.90
CAR 77.95 1.82 79.22 1.29 70.02 0.02 67.19 0.08 77.82 0.29
ECO 87.90 1.27 67.03 3.69 88.59 1.77 73.02 0.86 66.22 4.69
LED 59.42 1.37 28.76 2.55 77.64 0.42 40.22 5.88 34.24 3.55
PIM 71.86 2.84 72.66 2.62 77.82 1.16 73.71 0.40 73.42 2.62
GLA 81.48 6.59 54.26 1.90 90.09 1.64 53.84 2.96 45.07 0.90
WIS 92.58 1.65 94.71 0.64 97.30 0.31 93.00 0.85 96.13 0.64
TAE 69.62 2.21 69.46 1.20 69.94 0.53 69.94 0.53 69.96 2.20
WIN 99.69 0.58 99.06 0.42 97.19 0.98 91.76 1.31 85.19 1.58
CLE 60.25 1.35 56.30 1.97 82.04 1.75 46.62 2.23 55.79 2.97
HOU 94.28 1.84 96.98 0.43 96.98 0.43 96.98 0.43 96.98 0.43
LYM 77.11 5.07 65.99 5.43 83.70 2.52 77.48 3.55 75.84 4.43
VEH 59.52 3.37 36.49 3.52 84.21 1.71 51.47 1.19 51.64 2.52
BAN 67.61 3.21 66.71 2.01 87.13 2.15 63.84 0.74 71.14 2.01
GER 71.14 1.19 70.60 0.63 73.54 0.58 67.07 0.81 70.00 1.37
AUT 69.03 8.21 31.42 7.12 96.58 0.64 52.56 1.67 45.66 6.12
DER 86.18 5.69 31.01 0.19 94.91 1.40 72.69 1.04 66.24 1.81
SON 74.68 0.79 53.37 0.18 98.29 0.40 75.69 1.47 76.87 1.18
Average 79.52 2.51 68.16 2.14 86.09 1.04 71.76 1.37 72.43 2.33
10.5 KEEL Statistical Tests 309
Table 10.6 Average results and standard deviations of test accuracy obtained
Data set Ant Miner CORE HIDER SGERD TARGET
Mean SD Mean SD Mean SD Mean SD Mean SD
HAB 72.55 5.27 72.87 4.16 75.15 4.45 74.16 2.48 71.50 2.52
IRI 96.00 3.27 92.67 4.67 96.67 3.33 96.67 3.33 92.93 4.33
BAL 70.24 6.21 70.08 7.11 69.60 3.77 75.19 6.27 75.62 7.27
NTH 90.76 6.85 90.76 5.00 90.28 7.30 88.44 6.83 86.79 5.83
MAM 81.48 7.38 77.33 3.55 82.30 6.50 74.11 5.11 79.65 2.11
BUP 57.25 7.71 61.97 4.77 65.83 10.04 57.89 3.41 65.97 1.41
MON 97.27 2.65 88.32 8.60 97.27 2.65 80.65 4.15 96.79 5.15
CAR 77.26 2.59 79.40 3.04 70.02 0.16 67.19 0.70 77.71 2.70
ECO 58.58 9.13 64.58 4.28 75.88 6.33 72.08 7.29 65.49 4.29
LED 55.32 4.13 27.40 4.00 68.20 3.28 40.00 6.75 32.64 6.75
PIM 66.28 4.26 73.06 6.03 73.18 6.19 73.71 3.61 73.02 6.61
GLA 53.74 12.92 45.74 9.36 64.35 12.20 48.33 5.37 44.11 5.37
WIS 90.41 2.56 92.38 2.31 96.05 2.76 92.71 3.82 95.75 0.82
TAE 64.61 5.63 70.35 3.77 69.93 4.73 69.93 4.73 69.50 2.73
WIN 92.06 6.37 94.87 4.79 82.61 6.25 87.09 6.57 82.24 7.57
CLE 57.45 5.19 53.59 7.06 55.86 5.52 44.15 4.84 52.99 1.84
HOU 93.56 3.69 97.02 3.59 97.02 3.59 97.02 3.59 96.99 0.59
LYM 73.06 10.98 65.07 15.38 72.45 10.70 72.96 13.59 75.17 10.59
VEH 53.07 4.60 36.41 3.37 63.12 4.48 51.19 4.85 49.81 5.85
BAN 59.18 6.58 64.23 4.23 62.15 8.51 62.71 4.17 67.32 6.17
GER 66.90 3.96 69.30 1.55 70.40 4.29 66.70 1.49 70.00 0.49
AUT 53.74 7.79 32.91 6.10 62.59 13.84 50.67 10.27 42.82 13.27
DER 81.16 7.78 31.03 1.78 87.45 3.26 69.52 4.25 66.15 4.25
SON 71.28 5.67 53.38 1.62 52.90 2.37 73.45 7.34 74.56 8.34
Average 72.22 5.97 66.86 5.01 75.05 5.69 70.27 5.20 71.06 4.87
Table 10.8 Results of the Friedman value p-value Iman-Davenport value p-value
Friedman and
Iman-Davenport tests 8.408 0.0777 2.208 0.0742
• Finally, in Table 10.9 the adjusted p-values are shown considering the best method
(Hider) as the control algorithm and using the three post-hoc procedures explained
above. The following analysis can be made:
– The procedure of Holm verifies that Hider is the best method with α = 0.10,
but it only outperforms CORE considering α = 0.05.
– The procedure of Hochberg checks the supremacy of Hider with α = 0.05. In
this case study, we can see that the Hochberg method is the one with the highest
power.
In this chapter we have introduced a series of non-commercial Java software tools, and
focused on a particular one named KEEL, that provides a platform for the analysis
of ML methods applied to DM problems. This tool relieves researchers of much
technical work and allows them to focus on the analysis of their new learning models
in comparison with the existing ones. Moreover, the tool enables researchers with
little knowledge of evolutionary computation methods to apply evolutionary learning
algorithms to their work.
We have shown the main features of this software tool and we have distinguished
three main parts: a module for data management, a module for designing experiments
with evolutionary learning algorithms, and a module educational goals. We have
also shown some case studies to illustrate functionalities and the experiment set up
processes.
10.6 Summarizing Comments 311
Apart from the presentation of the main software tool, three other complementary
aspects of KEEL have been also described:
• KEEL-dataset, a data set repository that includes the data set partitions in the
KEEL format and shows some results obtained in these data sets. This repository
can free researchers from merely “technical work” and facilitate the comparison
of their models with the existing ones.
• Some basic guidelines that the developer may take into account to facilitate the
implementation and integration of new approaches within the KEEL software tool.
We have shown the simplicity of adding a simple algorithm (SGERD in this case)
into the KEEL software with the aid of a Java template specifically designed for
this purpose. In this manner, the developer only has to focus on the inner functions
of their algorithm itself and not on the specific requirements of the KEEL tool.
• A module of statistical procedures which let researchers contrast the results
obtained in any experimental study using statistical tests. This task, which may not
be trivial, has become necessary to confirm when a new proposed method offers a
significant improvement over the existing methods for a given problem.
References
1. Han, J., Kamber, M., Pei, J.: Data mining: Concepts and techniques, second edition (The
Morgan Kaufmann series in data management systems). Morgan Kaufmann, San Francisco
(2006)
2. Witten, I.H., Frank, E.: Data mining: practical machine learning tools and techniques, second
edition (Morgan Kaufmann series in data management systems). Morgan Kaufmann Publishers
Inc., San Francisco (2005)
3. Demšar, J., Curk, T., Erjavec, A., Gorup, Črt, Hočevar, T., Milutinovič, M., Možina, M., Pola-
jnar, M., Toplak, M., Starič, A., Štajdohar, M., Umek, L., Žagar, L., Žbontar, J., Žitnik, M.,
Zupan, B.: Orange: Data mining toolbox in python. J. Mach. Learn. Res. 14, 2349–2353 (2013)
4. Abeel, T., de Peer, Y.V., Saeys, Y.: Java-ML: A machine learning library. J. Mach. Learn. Res.
10, 931–934 (2009)
5. Hofmann, M., Klinkenberg, R.: RapidMiner: Data mining use cases and business analytics
applications. Chapman and Hall/CRC, Florida (2013)
6. Williams, G.J.: Data mining with rattle and R: The art of excavating data for knowledge
discovery. Use R!. Springer, New York (2011)
7. Sonnenburg, S., Braun, M., Ong, C., Bengio, S., Bottou, L., Holmes, G., LeCun, Y., Müller,
K.R., Pereira, F., Rasmussen, C., Rätsch, G., Schölkopf, B., Smola, A., Vincent, P., Weston, J.,
Williamson, R.: The need for open source software in machine learning. J. Mach. Learn. Res.
8, 2443–2466 (2007)
8. Alcalá-Fdez, J., Sánchez, L., García, S., del Jesus, M., Ventura, S., Garrell, J., Otero, J., Romero,
C., Bacardit, J., Rivas, V., Fernández, J., Herrera, F.: KEEL: A software tool to assess evolu-
tionary algorithms to data mining problems. Soft Comput. 13(3), 307–318 (2009)
9. Derrac, J., García, S., Herrera, F.: A survey on evolutionary instance selection and generation.
Int. J. Appl. Metaheuristic Comput. 1(1), 60–92 (2010)
10. Kudo, M., Sklansky, J.: Comparison of algorithms that select features for pattern classifiers.
Pattern Recognit. 33(1), 25–41 (2000)
11. Quinlan, J.R.: C4.5: programs for machine learning. Morgan Kaufmann Publishers, San Fran-
cisco (1993)
312 10 A Data Mining Software Package. . .
12. Schölkopf, B., Smola, A.J.: Learning with kernels : support vector machines, regularization,
optimization, and beyond. Adaptive computation and machine learning. MIT Press, Cambridge
(2002)
13. Frenay, B., Verleysen, M.: Classification in the presence of label noise: A survey. Neural Netw.
Learn. Syst., IEEE Trans. 25(5), 845–869 (2014)
14. Garcia, E.K., Feldman, S., Gupta, M.R., Srivastava, S.: Completely lazy learning. IEEE Trans.
Knowl. Data Eng. 22(9), 1274–1285 (2010)
15. Alcalá, R., Alcalá-Fdez, J., Casillas, J., Cordón, O., Herrera, F.: Hybrid learning models to get
the interpretability-accuracy trade-off in fuzzy modeling. Soft Comput. 10(9), 717–734 (2006)
16. Rivas, A.J.R., Rojas, I., Ortega, J., del Jesús, M.J.: A new hybrid methodology for cooperative-
coevolutionary optimization of radial basis function networks. Soft Comput. 11(7), 655–668
(2007)
17. Bernadó-Mansilla, E., Ho, T.K.: Domain of competence of xcs classifier system in complexity
measurement space. IEEE Trans. Evol. Comput. 9(1), 82–104 (2005)
18. Ventura, S., Romero, C., Zafra, A., Delgado, J.A., Hervas, C.: Jclec: A java framework for
evolutionary computation. Soft Comput. 12(4), 381–392 (2007)
19. Pyle, D.: Data preparation for data mining. Morgan Kaufmann Publishers Inc., San Francisco
(1999)
20. Zhang, S., Zhang, C., Yang, Q.: Data preparation for data mining. Appl. Artif. Intel. 17(5–6),
375–381 (2003)
21. Luke, S., Panait, L., Balan, G., Paus, S., Skolicki, Z., Bassett, J., Hubley, R., Chircop, A.: ECJ:
A Java based evolutionary computation research system. http://cs.gmu.edu/eclab/projects/ecj
22. Meyer, M., Hufschlag, K.: A generic approach to an object-oriented learning classifier system
library. J. Artif. Soc. Soc. Simul. 9(3) (2006) http://jasss.soc.surrey.ac.uk/9/3/9.html
23. Llorá, X.: E2k: Evolution to knowledge. SIGEVOlution 1(3), 10–17 (2006)
24. Kohavi, R.: A study of cross-validation and bootstrap for accuracy estimation and model
selection. In: Proceedings of the 14th International Joint Conference on Artificial Intelligence.
IJCAI’95, vol. 2, pp. 1137–1143. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA
(1995)
25. Dietterich, T.G.: Approximate statistical tests for comparing supervised classification learning
algorithms. Neural Comput. 10(7), 1895–1923 (1998)
26. Ortega, M., Bravo, J. (eds.): Computers and education in the 21st century. Kluwer, Dordrecht
(2000)
27. Mierswa, I., Wurst, M., Klinkenberg, R., Scholz, M., Euler, T.: Yale: Rapid prototyping for
complex data mining tasks. In: Ungar, L., Craven, M., Gunopulos, D., Eliassi-Rad, T. (eds.)
KDD ’06: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, pp. 935–940. NY, USA, New York (2006)
28. Rakotomalala, R.: Tanagra : un logiciel gratuit pour l’enseignement et la recherche. In: S.
Pinson, N. Vincent (eds.) EGC, Revue des Nouvelles Technologies de l’Information, pp. 697–
702. Cpadus-ditions (2005)
29. Batista, G.E.A.P.A., Prati, R.C., Monard, M.C.: A study of the behaviour of several methods
for balancing machine learning training data. SIGKDD Explor. 6(1), 20–29 (2004)
30. He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9),
1263–1284 (2009)
31. Sun, Y., Wong, A.K.C., Kamel, M.S.: Classification of imbalanced data: A review. Int. J. Pattern
Recognit. Artif. Intel. 23(4), 687–719 (2009)
32. Dietterich, T., Lathrop, R., Lozano-Perez, T.: Solving the multiple instance problem with axis-
parallel rectangles. Artif. Intell. 89(1–2), 31–71 (1997)
33. Sánchez, L., Couso, I.: Advocating the use of imprecisely observed data in genetic fuzzy
systems. IEEE Trans. Fuzzy Syst. 15(4), 551–562 (2007)
34. Děmsar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res.
7, 1–30 (2006)
35. García, S., Fernández, A., Luengo, J., Herrera, F.: Advanced nonparametric tests for multi-
ple comparisons in the design of experiments in computational intelligence and data mining:
Experimental analysis of power. Inf. Sci. 180(10), 2044–2064 (2010)
References 313
36. García, S., Herrera, F.: An extension on statistical comparisons of classifiers over multiple data
sets for all pairwise comparisons. J. Mach. Learn. Res. 9, 2579–2596 (2008)
37. Fisher, R.A.: Statistical methods and scientific inference (2nd edition). Hafner Publishing, New
York (1959)
38. García, S., Fernández, A., Luengo, J., Herrera, F.: A study of statistical techniques and per-
formance measures for genetics-based machine learning: Accuracy and interpretability. Soft
Comput. 13(10), 959–977 (2009)
39. García, S., Molina, D., Lozano, M., Herrera, F.: A study on the use of non-parametric tests
for analyzing the evolutionary algorithms’ behaviour: A case study on the CEC 2005 special
session on real parameter optimization. J. Heuristics 15, 617–644 (2009)
40. Luengo, J., García, S., Herrera, F.: A study on the use of statistical tests for experimentation
with neural networks: Analysis of parametric test conditions and non-parametric tests. Expert
Syst. with Appl. 36, 7798–7808 (2009)
41. Cox, D., Hinkley, D.: Theoretical statistics. Chapman and Hall, London (1974)
42. Snedecor, G.W., Cochran, W.C.: Statistical methods. Iowa State University Press, Ames (1989)
43. Shapiro, S.S.: M.W.: An analysis of variance test for normality (complete samples). Biometrika
52(3–4), 591–611 (1965)
44. Mann, H.B., Whitney, D.R.: On a test of whether one of two random variables is stochastically
larger than the other. Ann. Math. Stat 18, 50–60 (1947)
45. Wilcoxon, F.: Individual comparisons by ranking methods. Biometrics 1, 80–83 (1945)
46. Friedman, M.: The use of ranks to avoid the assumption of normality implicit in the analysis
of variance. J. the Am. Stat. Assoc. 32(200), 675–701 (1937)
47. Iman, R., Davenport, J.: Approximations of the critical region of the friedman statistic. Com-
mun. Stat. 9, 571–595 (1980)
48. Sheskin, D.: Handbook of parametric and nonparametric statistical procedures. Chapman and
Hall/CRC, Boca Raton (2006)
49. Holm, S.: A simple sequentially rejective multiple test procedure. Scand. J. Stat. 6, 65–70
(1979)
50. Hochberg, Y.: A sharper bonferroni procedure for multiple tests of significance. Biometrika
75, 800–803 (1988)
51. Nemenyi, P.B.: Distribution-free multiple comparisons, ph.d. thesis (1963)
52. Bergmann, G., Hommel, G.: Improvements of general multiple test procedures for redundant
systems of hypotheses. In: Bauer, G.H.P., Sonnemann, E. (eds.) Multiple hypotheses testing,
pp. 100–115. Springer, Berlin (1988)
53. Parpinelli, R., Lopes, H., Freitas, A.: Data mining with an ant colony optimization algorithm.
IEEE Trans. Evol. Comput. 6(4), 321–332 (2002)
54. Tan, K.C., Yu, Q., Ang, J.H.: A coevolutionary algorithm for rules discovery in data mining.
Int. J. Syst. Sci. 37(12), 835–864 (2006)
55. Aguilar-Ruiz, J.S., Giráldez, R., Riquelme, J.C.: Natural encoding for evolutionary supervised
learning. IEEE Trans. Evol. Comput. 11(4), 466–479 (2007)
56. Mansoori, E., Zolghadri, M., Katebi, S.: SGERD: A steady-state genetic algorithm for extract-
ing fuzzy classification rules from data. IEEE Trans. Fuzzy Syst. 16(4), 1061–1071 (2008)
57. Gray, J.B., Fan, G.: Classification tree analysis using TARGET. Comput. Stat. Data Anal. 52(3),
1362–1372 (2008)
Index
A splitting, 249
A priori, 74, 81, 108, 123 Attribute selection, 187
Accuracy, 24, 40, 173, 178, 255 AUC, 24
classification performance, 109, 111 Average, 26
classification rate, 24
Activation function, 4
Active learning, 9
AdaBoost, 119 B
ADaM, 286 Bagging, 119, 121, 122
Aggregation, 4, 12 Batch search, 200
methods, 122 Bayesian decision rule, 45
models, 125 Bayesian estimation, 72, 74, 75
operations, 2 Bayesian learning, 4, 276, 277
Anomalies, 8, 30, 289 Bayesian posterior distribution, 69, 70
detection, 8 Binary attributes, 45, 55, 111, 161
ANOVA, 33, 34, 303 Binary classification problems, 112, 120,
AQ, 5, 97 130
Arity, 248, 249, 259 Binning, 55, 148, 161, 253, 259, 276
Artificial neural networks (ANNs), 54, 97, Boosting, 119, 121
197, 287 Bootstrap, 45, 66
genetic, 287 Bottom-up discretizer, 252, 254, 264
Association, 2, 6–8, 10, 295 BrownBoost, 119
Association rules, 6, 7, 288
Attribute, 247
analytic variables, 46
continuous, 53, 161, 246, 250, 254, 264 C
correlated, 40 C4.5, 5, 76, 108, 116–119, 121, 127, 131,
discretized, 254 133, 169, 198, 246, 255, 266, 276
linear combination of, 49 CART, 5, 45, 250
merging, 249 Centroid, 78, 79, 86, 207, 209
modeling variables, 46 Chi-squared, 173, 247
nominal, 4, 6, 12, 40, 41, 43, 63, 222, Chi-squared test, 84, 85, 246, 263
245, 246, 266 ChiMerge, 247, 253, 254, 259, 264
numeric, 6, 41 Class attribute, 6
numerical, 3, 42, 245, 276 Class imbalance, 158
raw, 46 Class label, 90–92, 101, 109, 112–114, 249,
redundant, 39–41 252, 262
© Springer International Publishing Switzerland 2015 315
S. García et al., Data Preprocessing in Data Mining,
Intelligent Systems Reference Library 72, DOI 10.1007/978-3-319-10247-4
316 Index
Q
O Q-grams, 44
Odds ratio, 173 Q-Q graphics, 27
One-class, 72 Qualitative data, 15, 60, 245, 254
One-versus-all, 123 Quantitative data, 15, 60, 245, 254
One-versus-one, 123
Online learning, 200
Orange, 286 R
Ordinal data, see also attributes,numeric, Random forest, 187
303 Random sampling, 158
Ordinal discretizer, 254 Ranking, 173, 176
Outlier detection, 8 ranking methods, 122
Outliers, 44, see also anomalies, 30, see also statistical tests, 25
noisy data, 46–48, 83 transformations, 303
Overfitting, 22 RapidMiner, 286
Rattle, 286
Recall, 172
P Record, see also instance, 247
Pairwise comparisons, 29–31 Redundancy, 4, 41, 84, 186, 250
Parametric discretizer, 253 Regression, 2, 7, 24, 148, 286–288, 291
Parametric statistical tests, 27 models, 3
PART, 5 RIPPER, 5, 127, 266, 276
Partial information, 251, 252 RMSE, 90
Partitioning, 22, 159 Robust learners, 108
clustering based, 188 Robustness, 71, 108, 111, 115, 126
Pattern mining, 7, 286 ROC, 24
Pattern recognition (PR), 186 curves, 25
Pearson correlation coefficient, 87, 171, 188 Rough sets, 187, 246, 250, 264
Positive correlation, 42 Rule induction learning, 97, 277
Posterior distribution, 69, 70, 74, 75, 159 Rule learning, 5, 127
Precision, 24, see also accuracy, 24, 172,
198,
Predictive models, 5, 46, 187, 198 S
Predictive power, 46, 161, 163, 178 Sampling
Predictor, 176 distribution sampling for missing values,
variable, 148, 187 60
Principal components analysis (PCA), 72, Scatter plots, 42, 46
149 Schemas, 41
Prior distribution, 69, 74 Searching evaluation
Prior probabilities, 170, 180 filtering, 201
Probabilistic methods, 225 wrapper, 201
Probability ratio, 173 Self-organizing map, 158
Programming language, 286 Semi-supervised learning, 9, 68, 198
Java, 286 Sensitivity, 24
320 Index