A Comparative Study Between Feature Selection Algorithms - Ok
A Comparative Study Between Feature Selection Algorithms - Ok
algorithms
1 Introduction
Usually, database contain millions of tuples and thousands of attributes, presenting
dependencies among attributes [1]. The essential purpose of data preprocessing is to
manipulate and transform each dataset, making the information contained within them
more accessible and coherent [2, 3]. The data preprocessing process involves choosing
an outcome measure to evaluate, potential influencer variables, cleansing the data,
creating features and generating data sets to provide to beyond core for automated
analysis. Data preprocessing is an important step in the knowledge discovery process,
because quality decisions must be based on quality data. The main tasks of data
preprocessing are: Data Cleaning, Missing Values, Noisy Data, Data Integration, Data
Transformation and Feature Selection. [37]
A feature selection algorithm is a computational solution that is motivated by a
certain definition of relevance. However, the relevance of a feature as seen from the
inductive learning perspective may have several definitions depending on the objective
that is looked for. An irrelevant feature is not useful for induction, but not all relevant
features are necessarily useful for induction [34]. The existing selection algorithms
focus, mainly, on finding relevant features [4]. This is a process, in which the most
relevant characteristics are selected, improving the knowledge discovery database.
This paper is structured as follows: problem, methodology used, feature selection
algorithms, results, and conclusions.
2 Problem
Feature selection, applied as a data preprocessing stage to data mining, proves to be
valuable in that it eliminates the irrelevant features that make algorithms ineffective.
Sometimes the percentage of instances correctly classified is higher if a previous feature
selection is applied, since the data to be mined will be noise-free [5]. This is usually
attributed to the “curse of dimensionality” or to the fact that irrelevant features decrease
the signal to noise ratio. In addition, many algorithms become computationally
intractable when the dimensionality is high. [33]
Feature selection task is divided into four stages [6]: the first one determines the
possible set of attributes to perform the representation of the problem, then the subset
of attributes generated in step one is evaluated. Subsequently, it is examined whether
the selected subset satisfies the search criteria. These processes can be classified
differently depending on the stage in which we focus, in order to make this distinction
in three categories [7]: Filters, Wrappers [8, 9] and hybrids [10].
In Filters methods the selection procedure is performed independently of the
evaluation function. In these can be distinguished four different evaluation measures:
distance, information, dependence and consistency. Respective examples of each one
of these measures can be found in: [11 - 13]. Wrappers methods combine search in the
attribute space with the machine learning evaluating the set of attributes and choosing
the most appropriate. Hybrid models present the advantages that Filters and Wrappers
models provide. Since the feature selection is applicable to dissimilar real situations, it
is difficult to reach a consensus as to which is the best possible choice; this makes
possible multiple algorithms of this type [14].
3 Methodology
This section describes the type of scientific research used in the article along with the
research method and the development methodology. The type of scientific research
applied in this paper is descriptive-exploratory with an experimental approach.
According to the formal research process, a hypothetical - deductive method was used
in which a hypothesis was formulated, which through empirical validation was
validated through deductive reasoning. It was established, based on the
experimentation, a mechanism for weighting the algorithm evaluation indicators in
such a way that it was possible to evaluate said mechanism when changing the dataset.
The following tasks are defined to obtain the results, after applying the appropriate
algorithms in feature selection:
Collection, integration, and data preprocessing. In this phase, the collection and
integration of different datasets, data transformation; as the case may be, and cleaning
in order to eliminate noise.
Definition and application of tests of the algorithms used for feature selection.
Based on the tests performed with the synthetic data, the imputation and the evaluation
of the imputation indicators were applied to apply the algorithms.
Review of test results. The analysis of the data obtained in the allocation with the
algorithms was performed. Likewise, the complexity of the algorithms was calculated,
in order to determine their feasibility of implementation.
The basis for the development of this paper is based on the analysis and choice of
algorithms for feature selection. In this stage the case studies are of vital importance,
the conclusions of this research and machine learning algorithms used in feature
selection.
4 Feature selection algorithms
4.1 Decision trees
A decision trees are a representation in which each set of possible conclusions is
implicitly established by a list of known class samples [15]. Decision tree has a simple
form that efficiently classifies new data [16, 17].
These trees are considered as an important tool for data mining; compared to other
algorithms, decision trees are faster and more accurate [18]. Learning in a decision trees
is a method to approximate an objective function of discrete values, in which the
learning function is represented by a tree. These learning methods are the most popular
inductive inference algorithms and they have thriving application in various machine
learning tasks [19 - 21]. The theory of information provides a mathematical model
(equation 1) to measure the total disorder in a database.
𝑛𝑏 𝑛𝑏𝑐 𝑛𝑏𝑐
𝑖𝑠𝐴𝑣𝑒𝑟 = ∑𝑏1 ∗ ∑𝑐1 − 𝑙𝑜𝑔𝑐 ( ) (1)
𝑛𝑡 𝑛𝑏 𝑛𝑏
Where Dij is the distance between the instances Xij and Yij, and α a parameter
expressed in mathematic terms (equation 3).
−ln(0.5)
𝛼= (3)
𝐷
D is the average distance between the samples in the dataset. In practice it is close
to 0.5. The Euclidean distance are calculated as follows (equation 4).
𝑋𝑖𝑘 −𝑋𝑗𝑘 2
𝐷𝑖𝑗 = √∑𝑛𝑘=1 ( ) (4)
𝑚𝑎𝑥𝑘 −𝑚𝑖𝑛𝑘
Where n is number of attributes, maxk and mink are the maximum and minimum
value used for the normalization of k-attributes. When the attributes are categorical, the
hamming distance is used (equation 5).
|𝑋𝑖𝑘 =𝑋𝑗𝑘 |
𝑆𝑖𝑗 = ∑𝑛𝑘=1 (5)
𝑛
Where |Xik = Xjk| is 1, otherwise it is 0. The distribution of all similarities for a given
data set is a characteristic of the organization and order of data in an n - dimensional
space. This organization may be more or less ordered. Changes in the level of order in
a data set are the main criteria for inclusion or exclusion of a feature from the feature
set; these changes may be measured by entropy [33]. The algorithm in question
compares the entropy for a set of data before and after deleting attributes. For a data set
of N instances, the measure of the entropy is (equation 6).
𝐸 = − ∑𝑁−1 𝑁
𝑖=1 ∑𝑗=𝑖+1(𝑆𝑖𝑗 log(𝑆𝑖𝑗 ) + (1 − 𝑆𝑖𝑗 ) log(1 − 𝑆𝑖𝑗 )) (6)
The steps of the algorithm (table 1) are based on sequential backward ranking, and
they have been successfully tested on several real world applications.
Table 1. Algorithm steps [33]
1: Start with the initial full set of feature F.
2: For each feature f ∈ F, remove one feature f from F and obtain a subset Ff. Find the
difference between entropy for F and entropy for all F f.
3: Let fk be a feature such that the difference between entropy for F and entropy for Ffk is
minimum.
4: Update the set of feature F = F − {f k}, where “ – ” is a difference operation on sets.
5: Repeat steps 2 – 4 until there is only one feature in F.
A ranking process may be stopped in any iteration, and may be transformed into a
process of selecting features, using the additional criterion mentioned in step 4. This
criterion is that the difference between entropy for F and entropy for Ff should be less
than the approved threshold value to reduce feature fk from set F. A computational
complexity is the basic disadvantage of this algorithm, and its parallel implementation
could overcome the problems of working with large data sets and large number of
features sequentially.
4.3 Estimation of distribution algorithms (EDAs)
EDA algorithm is a stochastic search technique based on population, which uses a
probability distribution model to explore candidates (instances) in a search space. [25,
26]. EDAs have been recognized as a strong algorithm to optimize. They have shown
a better performance in comparison with evolutionary algorithm, in problems where
these have not presented satisfactory results. This is mainly due to the explicit nature
of the relations or dependencies between the most important variables associated with
some particular problems that are estimated through probability distributions [27, 28].
Table 2 shows the EDA algorithm; in the first place an initial population of individuals
is generated. These individuals are evaluated according to an objective or aptitude
function. This evaluates how appropriate each individual is as a solution to the problem.
Based on this evaluation, a subset of the best individuals is selected. Thus, from this
subset it is learnt a probability distribution to be used to sample another population [27]
Table 2. EDA Algorithm pseudo-code [29]
Requires: Candidate size n, the number variables l and the cost function f(·).
1: θ ← initialize (l)
2: repeat
3: D ← sample (P(X; θ), n)
4: C ← select (D, f(·))
5: θ ← estimate (θ, C)
6: until P(X; θ) has converged
Outputs: Probability distribution P(X; θ)
This kind of genetic algorithms are stochastic search techniques that evolve a
probability distribution model from a pool of solution candidates, rather than evolving
the pool itself. The distribution is adjusted iteratively with the most promising (sub-
optimal) solutions until convergence. Hence, they are also known as Estimation of
Distribution Algorithms. The generic estimation procedure is shown in Algorithm 1.
Step (1) initializes the model parameters θ. Step (2) is the loop that updates the
parameters θ until convergence. Step (3) samples a pool S of n candidates from the
model. Step (4) ranks the pool according to a cost function f (·) and chooses the top-
ranked into B. Step (5) re-estimates the parameters θ from this subset of promising
solutions [32]. EDA [30, 31] approximately optimizes a cost function by building a
probabilistic model of a pool of promising sub-optimal solutions over a given search
space. For very-high dimensional search spaces, storing and updating a large population
of candidates may imply a computational burden in both time and memory.
4.4 Bootstrapping algorithm
An option for a suitable feature selection of is to perform an exhaustive search, but
this entails a high complexity that makes it inaccessible. The methods of feature
selection must perform a search among the candidates of subsets of attributes, these
may be: complete, sequential or random. The complete search also known as exhaustive
search, helps to find the optimal result according to the evaluation criteria used; the
problem is its high complexity (i.e. O(2n)) that makes it inappropriate in case the
number of attributes (n) is high. The random search does not ensure an optimal result,
starts with a subset of randomly selected attributes and proceeds, from it, with a
sequential search or randomly generating the rest of candidate subsets.
Bootstrapping algorithm replicates all of the classification experiments a large
number of times and estimates the solution using the set of these experiments. Because
this process is based on random selection, there is a probability that there will be some
data from the original set that is not used and others that are involved in more than one
subset. Bootstrapping algorithm is divided into four stages: generate subsets of
attributes, evaluate each subset, update the weights of each attribute and order them by
their weight [35].
In the first stage, the subsets of attributes are generated; each subset will contain a
maximum number of attributes that randomly select among the original dataset (there
cannot be the same attributes). Each subset is generated independently of the previous
one (here can be similar subsets). Both the generated subset number and the maximum
number of attributes contained in them will be established by the user.
The evaluation phase consists of classifying the original dataset with each of the
subsets of attributes generated in the previous stage. The result is to assign to each
subset a goodness, which is the percentage of success of each of the classifications.
In the update, the weights of the attributes of the original data set are assigned. The
weight of an attribute is the average of the benefits of the subsets that contain that
attribute.
The last step is to organize the attributes of form descendant according to their
weight. This phase generates a list that will be the classification returned by the process.
The difference between the bootstrapping algorithm and the exhaustive method is in
the first stage. For example, the exhaustive method with K=1 generates a subset for
each feature of the original set. However, if we choose a value of K=2, a subset is
generated for each pair of possible features. Therefore, this first step for an exhaustive
method depends on the value of K, whereas in the random method it depends on the
number of experiments and the maximum number of features. The table 3 show
bootstrapping algorithm, this algorithm divides into two tasks. First, RankingGenerate,
is the main function and has as input parameters: the classification method that is used
to evaluate each subset of features chosen U; the set of features to be treated X; the
number of experiments to be performed Ne and the maximum number of features that
will intervene in each Na experiment. This function generates a single output parameter
L, which corresponds to the feature ranking obtained by applying the features selection
algorithm. For this purpose, the RankingGenerate function sort the attributes according
to their weight. To calculate this value, for each feature ai, the average of the success
percentages resulting from applying the classification method U on those subsets that
contain the feature ai is calculated. Second, to obtain these subsets of attributes we use
the SubsetGenerate function, where each subset is chosen randomly with an equally
random size (greater than 1 and less than or equal to Na) so that there is no duplicate
attribute in the same subset
Table 3. Bootstrapping algorithm pseudo-code
Requires: Criteria evaluate U, features X, Experiments number Ne, Features number Na,
Feature list L.
Function RankingGenerate
1: S ← SubsetGenerate (X, Ne, Na)
2: while features subset Si ϵ S
3: C ← Evaluate(Si, C)
4: Feature Update Si ϵ S
End while
5: L ← Sort(X)
End RankingGenerate Function
Function SubsetGenerate
1: for i ← 1 until Ne
2: n ← GenerateRandomNumber (1, Na)
3: Si = ChooseFeatures (X, n)
4: L ← L + Si
End for
End SubsetGenerate Function
In [36] the authors motivate by the fact that the variability from randomization can
lead to inaccurate outputs, they propose a deterministic approach. First, they establish
several computational complexity results for the exact bootstrap method, in the case of
the sample mean. Second, it shows the first efficient, deterministic approximation
algorithm (FPTAS) for producing exact bootstrap confidence intervals which, unlike
traditional methods, has guaranteed bounds on the approximation error. Third, develop
a simple exact algorithm for exact bootstrap confidence intervals based on polynomial
multiplication. Last, provide empirical evidence involving several hundred (and in
some cases over one thousand) data points that the proposed deterministic algorithms
can quickly produce confidence intervals that are substantially more accurate compared
to those from randomized methods, and are thus practical alternatives in applications
such as clinical trials.
5 Results
5.1 Decision trees
You can see that they have totally different behaviors and the reasons are:
Soybean dataset: shows behavior in which a low percentage of feature is selected,
the highest percentage of selection and the maximum of selected attributes does not
exceed 70%. This problem has been called the curse of dimensionality.
Chess dataset: the behavior of these data is totally opposite to the Soybean dataset,
since it covers the largest amount of attribute space. Means that most data is relevant
and only begins to be deleted with a high features selection percentage.
5.2 Entropy measure for ranking features
For test, this algorithm we used a dataset of four features (X1, X2, X3, and X4) with
one thousand instances. Features contain categorical data; therefore, we are used the
hamming distance to compute the similarity (equation 5) between the instances and
then the entropy is computed (equation 6). The result generates that the feature X3 is the
least relevant; since in computing the difference between the total entropy and the
entropy without the feature three, it is the one closest to 0. Therefore, X3 must be
removed from the dataset.
5.3 Estimation Distribution Algorithms (EDAs)
We used one dataset applied to solve the classical OneMax problem with 100
variables, using a population size of 30 and 1000 maximum number of evaluations, and
30 candidates per iteration. For the execution of the EDAs algorithm, used Orange Suit
with the widget Goldenberry. The result show in the table 4 [38]. The OneMax is a
traditional test problem for evolutionary computation. It involves binary bitstrings of
fixed length. An initial population is given. The objective is to evolve some bitstrings
to match a prespecified bitstring [39].
Table 4. EDA applied result
Best:{1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,1,1,1,
1,0,0,0,0,1,1,0,1,1,1,1,1,1,1,0,1,1,1,1,1,0,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,1,1,1,1,1,1,1,1,0,1,1,1,
1,1,1,1,1}.
Cost: 89, Evals: 464, Argmin: 22, Argmax: 459, Min val: 44, Max val: 89, Mean:
68.4590517241, Stdev: 10.5445710397
6 Conclusions
Decision trees are used as an algorithm to feature selection are a good option.
However, it must be taken into account that: the data set must be categorical, only
applies for predictive problems, which limits the field of applications, and if the dataset
is incomplete, the selection is not considered as good.
Feature selection based entropy measure for ranking features is applicable only to
descriptive tasks, limiting the field of application just like decision trees. Its main
weakness lies in the high complexity since it makes a comparison combining all the
instances. As for the EDAs, these enjoy a good reputation within the feature selection
algorithms; however, it has some weaknesses, such as: redundancy in generating the
dependency trees, in free code applications has only been done with languages
interpreters, and there is limited evidence of having used multivariate data.
The concept of randomness used by the Bootstrapping algorithm for the selection
of features establishes the possibility that, due to the very concept of randomness, the
algorithm presents unexpected results at some point, as opposed to its speed of
execution. The ideal would be to use algorithms that indicate if the result generated by
the algorithm is acceptable. It is emphasized that the Bootstrapping algorithm has a
good behavior with large volumes of dataset.
References
1. D. Larose, Data Mining: Methods and Models. (USA), Wiley-Interscience (2006) 1- 3.
2. D. Pyle, Data Preparation for Data Mining. (USA), Morgan Kaufmann Publisher (1999) 15
- 19.
3. P. Bradley, O. Mangasarian, Feature selection via concave minimization and support vector
machine. (USA), Journal Machine learning ICML (1998) 82 - 90.
4. Y. Lei, L. Huan, Efficient Feature Selection via Analysis of Relevance and Redundancy.
(USA), Journal Machine Learning Research 5 (2004) 1205 - 1224.
5. I. Guyon, A. Elissee, An introduction to variable and feature selection. (USA), Journal
Machine learning research 3 (2003) 1157 - 1182.
6. M. Dash, H. Liu, Feature Selection for Classification. (USA), Journal Intelligent Data
Analysis 1 (3) (1996) 131 - 156.
7. H. Liu, Y. Lei, Toward Integrating Feature Selection Algorithms for Classification and
Clustering. (USA), IEEE Trans. on Knowledge and Data Engineering 17 (4) (2005) 491 -
502.
8. R. Kohavi, G. John, Wrappers for Feature Subset Selection. (USA), Artificial Intelligence 97
(12) (1997) 273 - 324.
9. G. Jennifer, Feature Selection for Unsupervised Learning. (USA), J. Mach. Learn. Res. 5
(2004) 845 - 889.
10. S. Das, Filters, Wrappers and a Boosting-Based Hybrid for Feature Selection. (USA), Proc.
18th Intl Conf. Machine Learning (2001) 74 - 81.
11. C. Cardie, Using Decision Trees to Improve Case-Based Learning. (USA), Proc. 10th Intll
Conf. Machine Learning, P. Utgo, ed. (1993) 25 - 32.
12. A. Mucciardi, E. Gose, A comparison of Seven Techniques for Choosing Subsets of Pattern
Recognition. (USA), IEEE Trans. Computer 20 (1971) 1023 - 1031.
13. R. Ruiz, J. Riquelme, J. Aguilar-Ruiz, Projection-based measure for efficient feature
selection. (USA), Journal of Intelligent and Fuzzy System 12 (2003) 175 - 183.
14. I. Pérez, R. Sánchez, Adaptación del método de reducción no lineal LLE para la selección
de atributos en WEKA. (Cuba), III Conferencia Internacional en Ciencias Computacionales
e Informáticas (2016) 1 - 7.
15. P. Winston, Inteligencia artificial. (USA), Addison Wesley (1994) 455 - 460.
16. S. Chourasia, Survey paper on improved methods of ID3 decision tree classification.
(USA), International Journal of Scienti c and Research Publications (2013) 1 - 4.
17. J. Rodríguez, Fundamentos de minería de datos. (Colombia), Fondo de publicaciones de la
Universidad Distrital Francisco José de Caldas (2010) 63 - 64.
18. Changala, Ravindra, Annapurna, Gummadi, G. Yedukondalu, U. N. P. G. Raju.,
Classification by decision tree induction algorithm to learn decision trees from the
classlabeled training tuples. (USA), International Journal of Advanced Research in
Computer Science and Software Engineering 2 (4) (2012) 427 - 434.
19. T. Michell, Machine learning. (USA), McGraw Hill (1997) 50 - 56.
20. J. Han, M. Kamber. Data mining: concepts and techniques. (USA), McGraw Hill 3 (2012)
331 - 336.
21. S. Russell, P. Norvig, Artificial Intelligence: A Modern Approach. (USA), Prentice Hall
(2012) 531 - 540.
22. M. Kantardzic. Data mining: concepts, models, methods and algorithms. (USA), IEEE
Press Wiley-Interscience (2003) 46 - 48.
23. H. Liu, H. Motoda, Feature selection for knowledge discovery and data mining. (USA),
Kluwer Academic Publisher 2 (2000) 30 - 35.
24. H. Liu, H. Motoda, Feature extraction, construction and selection. A data mining
perspective. (USA), Kluwer Academic Publisher (2000) 20 - 28.
25. P. Larrañaga, J. Lozano, Estimation of Distribution Algorithms. A New Tool for
Evolutionary Computation. (USA), Kluwer Academic Publishers (2002) 1 - 2.
26. M. Pelikan, K. Sastry, Initial-population bias in the univariate estimation of distribution
algorithm. (USA), Proceedings of the 11th Annual Conference on Genetic and
Evolutionary Computation, GECCO '09 11 (2002) 429 - 436.
27. R. Pérez, A. Hernández, Un algoritmo de estimación de distribuciones para el problema de
secuencia-miento en configuración jobshop. (Mexico), Communication Del CIMAT 1
(2015) 1 - 4.
28. H. Mûhlenbein, G. Paa, From recombination of genes to the estimation of distributions I.
Binary parameters. (USA), Parallel Problem Solving from Nature - PPSN 4. Doi: 178–187
29. N. Rodríguez, Feature Relevance Estimation by Evolving Probabilistic Dependency
Networks and Weighted Kernel Machine. (Colombia), A thesis submitted to the District
University Francisco José de Caldas in fulfillment of the requirements for the degree of
Master of Science in Information and Communications (2013) 3 – 4.
30. E. Bengoetxea, P. Larrañaga, I. Bloch, and A. Perchant, Estimation of distribution
algorithms: A new evolutionary computation approach for graph matching problems. In
Energy Minimization Methods in Computer Vision and Pattern Recognition, (Germany),
volume 2134 of Lecture Notes in Computer Science (2001) 454 - 469.
31. M. Pelikan, K. Sastry, and E. Cantú-Paz, Scalable Optimization via Probabilistic Modeling:
From Algorithms to Applications. (USA), Springer-Overflag (2006).
32. N. Rodríguez and S. Rojas–Galeano. Discovering feature relevancy and dependency by
kernel-guided probabilistic model-building evolution. (USA). BioData Mining (2017)
10:12 DOI 10.1186/s13040-017-0131-y
33. M. Kantardzic, Data mining: concepts, models, methods and algorithms. Second edition.
(USA), IEEE Press Wiley-Interscience (2003) 68 - 70.
34. R. A. Caruana and D. Freitag. How useful is Relevance? Technical report, fall’94 AAAI
Symposium on Relevance, New Orleans, (1994).
35. J. Aguilar, and N. Díaz, Selección de atributos relevantes basada en bootstrapping.
(España), Actas del III Taller de Minería de Datos y Aprendizaje (TAMIDA’2005). (2005),
21 – 30.
36. D. Bertsimas and B. Sturt, Computation of exact bootstrap confidence intervals:
complexity and deterministic algorithms. Optimization Online an e-print site for the
optimization community. (2017).
37. H. Barrera, J. Correa y J. Rodríguez, Prototipo de software para el preprocesamiento de
datos - UDClear. IV Simposio Internacional de Sistemas de Información e Ingeniería de
Software en la Sociedad del Conocimiento, libro de actas volumen 1, ISBN 84-690-0258-
9. (2006).
38. N, Rodríguez and S. Rojas, Goldenberry: EDA Visual Programming in Orange Proceeding
of the fifteenth annual conference companion on Genetic and evolutionary computation
conference companion (GECCO). (2013) pp. 1325-1332.
39. D. H. Wood, J. Chen, E. Antipov, B. Lemieux and W. Cedeño, A design for DNA
computation of the OneMax problem. Springer-Verlag, Soft Computing, (2001) Volume
5, Issue 1, 19–24.