Complete Search For Feature Selection in Decision Trees
Complete Search For Feature Selection in Decision Trees
Complete Search For Feature Selection in Decision Trees
Abstract
The search space for the feature selection problem in decision tree learning is the lattice of
subsets of the available features. We design an exact enumeration procedure of the subsets
of features that lead to all and only the distinct decision trees built by a greedy top-down
decision tree induction algorithm. The procedure stores, in the worst case, a number of
trees linear in the number of features. By exploiting a further pruning of the search space,
we design a complete procedure for finding δ-acceptable feature subsets, which depart by
at most δ from the best estimated error over any feature subset. Feature subsets with the
best estimated error are called best feature subsets. Our results apply to any error estima-
tor function, but experiments are mainly conducted under the wrapper model, in which
the misclassification error over a search set is used as an estimator. The approach is also
adapted to the design of a computational optimization of the sequential backward elimi-
nation heuristic, extending its applicability to large dimensional datasets. The procedures
of this paper are implemented in a multi-core data parallel C++ system. We investi-
gate experimentally the properties and limitations of the procedures on a collection of 20
benchmark datasets, showing that oversearching increases both overfitting and instability.
Keywords: Feature Selection, Decision Trees, Wrapper models, Complete Search
1. Introduction
Feature selection is essential for optimizing the accuracy of classifiers, for reducing the data
collection effort, for enhancing model interpretability, and for speeding up prediction time
(Guyon et al., 2006b). In this paper, we will consider decision tree classifiers DT (S) built on
a subset S of available features. Our results will hold for any top-down tree induction algo-
rithm that greedily selects a split attribute at every node by maximizing a quality measure.
The well-known C4.5 (Quinlan, 1993) and CART systems (Breiman et al., 1984) belong to
this class of algorithms. Although more advanced learning models achieve better predictive
performance (Caruana and Niculescu-Mizil, 2006; Delgado et al., 2014), decision trees are
worth investigation, because they are building-blocks of the advanced models, e.g., random
forests, or because they represent a good trade-off between accuracy and interpretability
(Guidotti et al., 2018; Huysmans et al., 2011).
In this paper, we will consider the search for a best feature subset S, i.e., such that
the estimated error err (DT (S)) on DT (S) is minimum among all possible feature subsets.
The size of the lattice of feature subsets is exponential in the number of available features.
The complete search of the lattice is known to be an NP-hard problem (Amaldi and Kann,
2019
c Salvatore Ruggieri.
License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided
at http://jmlr.org/papers/v20/18-035.html.
Ruggieri
1998). For this reason, heuristic searches are typically adopted in practice. For instance,
the sequential backward elimination (SBE) heuristic starts with all features and repeat-
edly eliminates one feature at a time while error estimation does not increase. However,
complete strategies do not have to be exhaustive. In particular, feature subsets that lead
to duplicate decision trees can be pruned from the search space. A naı̈ve approach that
stores all distinct trees found during the search is, however, unfeasible, since there may be
an exponential number of such trees. Our first contribution is a non-trivial enumeration
algorithm DTdistinct of all distinct decision trees built using subsets of the available fea-
tures. The procedure requires the storage of a linear number of decision trees in the worst
case. The starting point is a recursive procedure for the visit of the lattice of all subsets of
features. The key idea is that a subset of features is denoted by the union R ∪ S of two sets,
where elements in R must necessarily be used as split attributes, and elements in S may
be used or not. Pruning of the search space is driven by the observation that if a feature
a ∈ S is not used as split attribute by a decision tree built on R ∪ S, then the feature subset
R ∪ S \ {a} leads to the same decision tree. Duplicate decision trees that still pass such a
(necessary but not sufficient) pruning condition can be identified through a test on whether
or not they use all features in R. An intruiguing contribution of this paper consists in a
specific order of visit of the search space, for which a negligible fraction of trees actually
built are duplicates.
Enumeration of distinct decision trees can be used for finding the best feature subsets
with reference to an error estimation function. Our results will hold for any error estimation
function err (DT (S)). In experiments, we mainly adhere to the wrapper model (John et al.,
1994; Kohavi and John, 1997), and consider the misclassification error on a search set that
is not used for building the decision tree. The wrapper model for feature selection has
shown superior performance in many contexts (Doak, 1992; Bolón-Canedo et al., 2013).
We introduce the notion of a δ-acceptable feature subset, which leads to a decision tree
with an estimated error that departs by at most δ from the minimum estimated error over
any feature subset. Our second contribution is a complete search procedure DTacceptδ of
δ-acceptable and best (for δ = 0) feature subsets. The search builds on the enumeration of
distinct decision trees. It relies on a key pruning condition that is a conservative extension
of the condition above. If, for a feature a ∈ S, we have that err (DT (R ∪ S)) ≤ δ +
err (DT (R ∪ S \ {a})), then R ∪ S \ {a} can be pruned from the search with the guarantee
of only missing decision trees whose error is at most δ from the best error of visited trees.
Hence, visited feature subsets include acceptable ones.
2
Complete Search for Feature Selection in Decision Trees
Both DTacceptδ and DTsbe are implemented in a multi-core data parallel C++ sys-
tem, which is made publicly available. We report experiments on 20 benchmark datasets of
small-to-large dimensionality. Results confirm previous studies that oversearching increases
overfitting. In addition, they also highlight that oversearching increases instability, namely
variability of the subset of selected features due to perturbation of the training set. More-
over, we show that sequential backward elimination can improve the generalization error
of random forests for medium to large dimensional datasets. Such an experiment is made
possible only thanks to the computational speedup of DTsbe over SBE.
This paper is organized as follows. First, we recall related work in Section 2. The visit
of the lattice of feature subsets is based on a generalization of binary counting enumeration
of subsets devised in Section 3. Next, Section 4 introduces a procedure for the enumeration
of distinct decision trees as a pruning of the feature subset lattice. Complete search of
best and acceptable feature subset is then presented in Section 5. Optimization of the
sequential backward elimination heuristic is discussed in Section 6. Experimental results
are presented in Section 7, with additional tables reported in Appendix A. Finally, we
summarize the contribution of the paper in the conclusions.
2. Related Work
Blum and Langley (1997); Dash and Liu (1997); Guyon and Elisseeff (2003); Liu and Yu
(2005); Bolón-Canedo et al. (2013) provide a categorization of approaches of feature subset
selection along the orthogonal axes of the evaluation criteria, the search strategies, and
the machine learning tasks. Common evaluation criteria include filter models, embedded
approaches, and wrapper approaches. Filters are pre-processing algorithms that select a
subset of features by looking at the data distribution, independently from the induction
algorithm (Cover, 1977). Embedded approaches perform feature selection in the process of
training and are specific to the learning algorithm (Lal et al., 2006). Wrappers approaches
optimize induction algorithm performances as part of feature selection (Kohavi and John,
1997). In particular, training data is split into a building set and a search set, and the space
of feature subsets is explored. For each feature subset considered, the building set is used
to train a classifier, which is then evaluated on the search set. Search space exploration
strategies include (Doak, 1992): hill-climbing search (forward selection, backward elimina-
tion, bidirectional selection, beam search, genetic search), random search (random start
hill-climbing, simulated annealing, Las Vegas), and complete search. The aim of complete
search is to find a feature subset that optimizes an evaluation metric. Typical objectives
include minimizing the size of the feature subset provided that the classifier built from it
has an accuracy greater or equal to a given threshold (dimensionality reduction), or mini-
mizing the empirical misclassification error of the classifier on the search set (performance
maximization). Finally, feature subset selection has been considered for classification, re-
gression, and clustering tasks. Machine learning models and algorithms can be either treated
as black-boxes or, instead, feature selection methods can be specific to the model and/or
algorithm at hand (white-box ). White-box approaches are less general, but can exploit as-
sumptions on the model or algorithm to direct and speed up the search. For instance, the
best k-subset problem for linear regression (Miller, 2002) smoothly generalizes the linear re-
3
Ruggieri
gression problem to find out the subset of up to k features that best predict an independent
variable.
Only complete space exploration can provide the guarantee of finding best feature sub-
sets with respect to a given error estimation function. Several estimators have been proposed
in the literature, including: the empirical misclassification error on the training set or in the
search dataset; estimators adopted for tree simplification (Breslow and Aha, 1997; Esposito
et al., 1997); bootstrap and cross-validation (Kohavi, 1995; Stone, 1997); and the recent
jeff method (Fan, 2016), which is specific to decision tree models. Heuristic search ap-
proaches can lead to results arbitrarily worse than the best feature subset (Murthy, 1998).
Complete search is known to be NP-hard (Amaldi and Kann, 1998). However, complete
strategies do not need to be exhaustive in order to find a best feature subset. For instance,
filter models can rely on monotonic evaluation metrics to support Branch & Bound search
(Liu et al., 1998). Regarding wrapper approaches,the empirical misclassification error lacks
the monotonicity property that would allow for pruning the search space in a complete
search. Approximate Monotonicity with Branch & Bound (AMB&B) (Foroutan and Sklan-
sky, 1987) tries and tackles this limitation, but it provides no formal guarantee that a
best feature subset is found. Another form of search space pruning in wrapper approaches
for decision trees has been pointed out by Caruana and Freitag (1994), who examine five
hillclimbing procedures. They adopt a caching approach to prevent re-building duplicate
decision trees. The basic property they observe is reported in a generalized form in this
paper as Remark 6. While caching improves on the efficiency of a limited search, in the
case of a complete search, it requires an exponential number of decision trees to be stored
in cache, while our approach requires a linear number of them. We will also observe that
Remark 6 may still leave duplicate trees in the search space, i.e., it is a necessary but not
sufficient condition for enumerating distinct decision trees, while we will provide an exact
enumeration and, in addition, a further pruning of trees that cannot lead to best/acceptable
feature subsets.
A problem related to the focus of this paper regards the construction of optimal decision
trees using non-greedy algorithms. In such a problem, the structure of a decision tree and
the split attributes are determined at once as a global optimization problem. Bertsimas
and Dunn (2017) and Menickelly et al. (2016); Verwer and Zhang (2017) formulate tree
induction as a mixed-integer optimization problem and as an integer programming problem
respectively. The optimization function is the misclassification error on the training set,
possibly regularized with respect to decision tree size. Other approaches, e.g. Narodytska
et al. (2018), encode tree construction as a constraint solving problem, with the aim of
minimizing tree size. The search space of the optimal decision tree problem is larger than
in the best feature subset problem. The former is exponential in the product of the number
of features and the maximal tree depth, while the latter is exponential only in the number
of features. An optimal decision tree may not be produceable by a fixed greedy algorithm,
for any feature subset.
This paper significantly extends the preliminary results that appeared in Ruggieri (2017)
in several directions. First, it improves on the enumeration procedure of distinct decision
trees. The new ordering of visits in the search space has a clear theoretical justification, and
an overhead (duplicated trees built) close to zero for all experimental datasets. Second, the
paper introduces the notion of δ-acceptable feature subsets, which depart from best feature
4
Complete Search for Feature Selection in Decision Trees
subsets by at most δ in estimated error, and a novel algorithm that further prunes the
enumeration of distinct decision trees to find out δ-acceptable feature subsets. Moreover,
our approach applies to any error estimation function. Third, the experimental section
(and an appendix with additional tables) now includes a comprehensive set of results on
a larger collection of benchmark datasets. Fourth, the implementation of all proposed
algorithms is now multi-core parallel, and it is publicly available. It reaches run-time
efficiency improvements of up to 7× on an 8-core computer.
3. Enumerating Subsets
Let S = {a1 , . . . , an } be a set of n elements, with n ≥ 0. The powerset of S is the set of
its subsets: Pow (S) = {S 0 | S 0 ⊆ S}. There are 2n subsets of S, and, for 0 ≤ k ≤ n, there
n
are k subsets of size k. Figure 1 (left) shows the lattice (w.r.t. set inclusion) of subsets
for n = 3. The order of visit of the lattice, or, equivalently, the order of enumeration of
elements in Pow (S), can be of primary importance for problems that explore the lattice as a
search space. Well-known algorithms for subset generation produce lexicographic ordering,
Grey code ordering, or binary counting ordering (Skiena, 2008). Binary counting maps each
subset into a binary number with n bits by setting the ith bit to 1 iff ai belongs to the subset,
and generating subsets by counting from 0 to 2n − 1. Subsets for n = 3 are generated as
{}, {a3 }, {a2 }, {a2 , a3 }, {a1 }, {a1 , a3 }, {a1 , a2 }, {a1 , a2 , a3 }. In this section, we introduce a
recursive algorithm for a generalization of reverse binary counting (namely, counting from
2n − 1 down to 0) that will be the building block for solving the problem of generating
distinct decision trees. Let us start by introducing the notation R 1 P = ∪S 0 ∈P {R ∪ S 0 } to
denote sets obtained by the union of R with elements of P . In particular:
R 1 Pow (S) = ∪S 0 ⊆S {R ∪ S 0 }
Pow ({a1 , a2 , a3 }) = ({a1 , a2 } 1 Pow ({a3 })) ∪ ({a1 } 1 Pow ({a3 })) ∪ (∅ 1 Pow ({a2 , a3 })).
Since R 1 Pow (∅) = {R}, the leftmost set in the above union is {{a1 , a2 , a3 }}. In general,
the following recurrence relation holds.
Lemma 1 Let S = {a1 , . . . , an }. We have:
[
R 1 Pow (S) = {R ∪ S} ∪ (R ∪ {a1 , . . . , ai−1 }) 1 Pow ({ai+1 , . . . , an })
i=n,...,1
5
Ruggieri
{a1 , a2 , a3 } ∅ 1 P ow({a1 , a2 , a3 })
∅ ∅ 1 P ow(∅)
Proof The proof is by induction on n. The base case n = 0 is trivial: R 1 Pow (∅) = {R} by
definition. Consider now n > 0. Since Pow (S) = ({a1 } 1 Pow (S \ {a1 })) ∪ Pow (S \ {a1 }),
we have: R 1 Pow (S) = R 1 (({a1 } 1 Pow ({a2 , . . . , an })) ∪ (∅ 1 Pow ({a2 , . . . , an }))).
Since the 1 operator satisfies:
R 1 (P1 ∪ P2 ) = (R 1 P1 ) ∪ (R 1 P2 ) and R1 1 (R2 1 P ) = (R1 ∪ R2 ) 1 P
we have: R 1 Pow (S) = ((R ∪ {a1 }) 1 Pow ({a2 , . . . , an })) ∪ (R 1 Pow ({a2 , . . . , an })). By
induction hypothesis on the leftmost occurrence of 1:
R 1 Pow (S) = {R ∪ {a1 } ∪ {a2 , . . . , an }} ∪
[
(R ∪ {a1 } ∪ {a2 , . . . , ai−1 }) 1 Pow ({ai+1 , . . . , an }) ∪
i=n,...,2
R 1 Pow ({a2 , . . . , an })
[
= {R ∪ S} ∪ (R ∪ {a1 , . . . , ai−1 }) 1 Pow ({ai+1 , . . . , an })
i=n,...,1
This result can be readily translated into a procedure subset(R, S) for the enumeration
of elements in R 1 P ow(S). In particular, since ∅ 1 P ow(S) = P ow(S), subset(∅, S)
generates all subsets of S. The procedure is shown as Algorithm 1. The search space of the
procedure is the tree of the recursive calls of the procedure. The search space for n = 3 is
reported in Figure 1 (right). According to line 1 of Algorithm 1, the subset outputted at a
node labelled as R 1 P ow(S) is R ∪ S. Hence, the output for n = 3 is the reverse counting
ordering: {a1 , a2 , a3 }, {a1 , a2 }, {a1 , a3 }, {a1 }, {a2 , a3 }, {a2 }, {a3 }, {}. Two key properties
of Algorithm 1 will be relevant for the rest of the paper.
Remark 2 A set R0 ∪ S 0 generated at a non-root node of the search tree of Algorithm 1
is obtained by removing an element from the set R ∪ S generated at its father node. In
particular, R0 ∪ S 0 = R ∪ S \ {a} for some a ∈ S.
The invariant |R0 ∪ S 0 | = |R ∪ S| readily holds for the loop at lines 4–8 of Algorithm 1.
Before the recursive call at line 6, an element of S is removed from R0 , hence the set
R0 ∪ S 0 outputted at a child node has one element less than the set R ∪ S outputted at its
father node.
6
Complete Search for Feature Selection in Decision Trees
While the assumption regards univariate splits, it can be restated for bi-variate or multi-
variate split conditions, and the theoretical results in this paper can be adapted to such
general cases. However, since we build on a software that deals with univariate splits only
(see Section 7), experiments are restricted to such a case. Moreover, the results will hold for
any quality measure f () as far as the split attributes are chosen as the ones that maximize
f (). Examples of quality measures used in this way include Information Gain (IG), Gain
Ratio1 (GR), and the Gini index, which are adopted in the C4.5 (Quinlan, 1993) and in the
CART systems (Breiman et al., 1984). A second assumption regards the stopping criterion
1. Gain Ratio normalizes Information Gain over the Split Information (SI) of an attribute, i.e., GR =
IG/SI. This definition does not work well for attributes which are (almost) constants over the cases C,
7
Ruggieri
in top-down decision tree construction. Let stop(S, C) be the boolean result of the stopping
criterion at a node with cases C and predictive features S.
Assumption 5 If stop(S, C) = true then stop(S 0 , C) = true for every S 0 ⊆ S.
The assumption states that either: (1) the stopping criterion does not depend on S; or,
if it does, then (2) stopping is monotonic with regard to the set of predictive features. (1)
is a fairly general assumption, since typical stopping criteria are based on the size of cases
C at a node and/or on the purity of the class attribute in C. We will later on consider the
stopping criterion of C4.5 which halts tree construction if the number of cases of the training
set reaching the current node is lower than a minimum threshold m (formally, stop(S, C) is
true iff |C| < m). Another widely used stopping criterion satisfying (1) consists of setting a
maximum depth of the decision tree. (2) applies to criteria which require minimum quality
of features for splitting a node. E.g., the C4.5 additional criterion of stopping if IG of
all features is below a minimum threshold satisfies the assumption. The following remark,
which is part of the decision tree folklore (see e.g., Caruana and Freitag (1994)), states a
useful consequence of Assumptions 4 and 5. Removing any feature not used in a decision
tree from the initial set of features does not affect the result of tree building.
Lemma 6 Let features(T ) denote the set of split attributes in a decision tree T = DT (S).
For every S 0 such that S ⊇ S 0 ⊇ features(T ), DT (S 0 ) = T .
Proof If a decision tree T built from S uses only features from U = features(T ) ⊆ S, then
at any decision node of T it must be true that argmax a∈S f (a, C) = argmax a∈U f (a, C).
Hence, removing any unused attribute in S \ U will not change the result of maximizing
the quality measure and then, by Assumption 4, the split attribute at a decision node.
Moreover, by Assumption 5, a leaf node in T will remain a leaf node for any subset of S.
i.e., when SI ≈ 0. Quinlan (1986) proposed the heuristic of restricting the evaluation of GR only to
attributes with above average IG. The heuristic is implemented in the C4.5 system (Quinlan, 1993). It
clearly breaks Assumption 4, making the selection of the split attribute dependent on the set S. An
heuristic that satisfies Assumption 4 consists of restricting the evaluation of GR only for attributes with
IG higher than a minimum threshold.
8
Complete Search for Feature Selection in Decision Trees
The simplified recurrence relation prunes from the the search space feature subsets that
lead to duplicated decision trees. However, we will show in Example 1 that such a pruning
alone is not sufficient to generate distinct decision trees only, i.e., duplicates may still exist.
Algorithm 2 provides an enumeration of all and only the distinct decision trees. It builds
on the subset generation procedure. Line 1 constructs a tree T from features R∪S. Features
in the set S \ U of unused features in T are not iterated over in the loop at lines 8–12, since
those iterations would yield the same tree as T . This is formally justified by the modified
recurrence relation above. The tree T is outputted at line 4 only if R ⊆ U , namely features
required to be used (i.e., R) are actually used in decision node splits. We will shows that
such a test characterizes a uniqueness condition for all feature subsets that lead to a same
decision tree. Hence, it prevents outputting more than once a decision tree that can be
obtained from multiple paths of the search tree.
Example 1 Let F = {a1 , a2 , a3 }. Assume that a1 has no discriminatory power unless data
has been split by a3 . More formally, DT (S) = DT (S \ {a1 }) if a3 6∈ S. The visit of feature
subsets of Figure 1 (right) gives rise to the trees built by DTdistinct(∅, F ) as shown in
Figure 2 (left). For instance, the subset {a1 , a2 } visited at the node labelled {a1 , a2 } 1 ∅
in Figure 1 (right), produces the decision tree DT ({a1 , a2 }). By assumption, such a tree
9
Ruggieri
DT ({a1 , a2 , a3 })
DT ({a1 , a2 , a3 })
DT ({a1 , a3 }) DT ({a2 , a3 })
DT ({a1 , a3 }) DT ({a1 , a2 }) = DT ({a2 })
DT ({a1 , a2 }) = DT ({a2 })
DT ({a2 , a3 })
DT ({a2 }) DT ({a3 })
DT ({a1 }) = DT (∅)
DT ({a1 }) = DT (∅)
DT ({a3 })
DT (∅)
is equal to DT ({a2 }), which is a duplicate tree produced in another node – underlined in
Figure 2 (left) – corresponding to the feature set visited at the node labelled {a2 } 1 ∅.
Another example regarding DT ({a1 }) = DT (∅) is shown in Figure 2 (left), together with its
underlined duplicated tree. Unique trees for two or more duplicates are characterized by the
fact that features appearing to the left of 1 must necessarily be used as split features by the
constructed decision tree. In the two previous example cases, the nodes underlined output
their decision trees, while the other duplicates do not pass the test at line 3 of Algorithm 2.
Theorem 7 DTdistinct(R, S) outputs the distinct decision trees built on sets of features
in R 1 P ow(S).
Proof The search space of DTdistinct is a pruning of the search space of subset. Every
tree built at a node and outputted is then constructed from a subset in R 1 P ow(S). By
Remark 3, the order of selection of ai ∈ S ∩ U at line 8 is irrelevant, since any order will
lead to the same space R 1 P ow(S).
Let us first show that decision trees in output are all distinct. The key observation
here is that, by line 4, all features in R are used as split features in the outputted de-
cision tree. The proof proceeds by induction on the size of S. If |S| = 0, then there
is at most one decision tree in output, hence the conclusion. Assume now |S| > 0, and
let S = {a1 , . . . , an }. By Lemma 1, any two recursive calls at line 10 have parameters
(R ∪ {a1 , . . . , ai−1 }, {ai+1 , . . . , an }) and (R ∪ {a1 , . . . , aj−1 }, {aj+1 , . . . , an }), for some i < j.
Observe that ai is missing as a predictive attribute in the trees in output from the first
call, while by inductive hypothesis it must be a split attribute in the trees in output by the
second call. Hence, the trees in output from recursive calls are all distinct among them.
Moreover, they are all different from T = DT (R ∪ S), because recursive calls do not include
some feature ai ∈ S ∩ U that is by definition used in T .
Let us now show that trees pruned at line 8 or at line 4 are already outputted else-
where, which implies that every distinct decision tree is outputted at least once. First,
by Lemma 6, the trees that would have been outputted in the pruned iterations at line 8
(i.e., for ai ∈ S \ U ) are equal to the tree of T = DT (R ∪ S). Second, if the tree T is not
outputted at line 4, because R 6⊆ U , we have that it is outputted at another node of the
search tree. The proof is by induction on |R|. For |R| = 0 it is trivial, because the the
premise R 6⊆ U does not hold. Let R = {a1 , . . . , an }, with n > 0, and let a1 , . . . , an be
10
Complete Search for Feature Selection in Decision Trees
in the order they have been added by recursive calls. Fix R0 = {a1 , . . . , ai−1 } such that
ai 6∈ U and R0 ⊆ U . There is a sibling node or a sibling of an ancestor node in the search
tree corresponding to a call with parameters R0 and S 0 ⊇ {ai+1 , . . . , an } ∪ S. By inductive
hypothesis on |R0 | < |R|, the distinct decision trees with features in R0 1 P ow(S 0 ) are
all outputted, including T because T has split features in R ∪ S \ {ai } which belongs to
R0 1 P ow(S 0 ).
The proof of Theorem 7 does not assume any specific order at line 8 of Algorithm 2.
Any order would produce the enumeration of distinct decision trees – this is a consequence
of Remark 3. However, the order may impact on the size of the search space.
Algorithm 2 adopts a specific order intended to be effective in pruning the search space.
In particular, our objective is to minimize the number of duplicated trees built. In fact,
even though duplicates are not outputted, building them has a computational cost that
should be minimized. Duplicated decision trees are detected through the test R ⊆ U at line
4 of Algorithm 2. Thus, we want to minimize the number of recursive calls DTdistinct(R0 ,
S 0 ) where attributes in R0 have lower chances of being used. Since required attributes are
removed from R0 one at a time at line 9 (and added to the set of possibly used attributes
S 0 at line 11), this means ranking attributes in S ∩ U by increasing chances of being used
in decision trees built in recursive calls. How do we estimate such chances?
Example 3 Consider the sample decision tree at Figure 3(a). It is built on the set of
features F = {a1 , a2 , a3 , a4 }. Which attributes have the highest chance of being used if
included in a subset of F ? Whenever included in the subset, a1 will be certainly used at the
root node. In fact, it already maximizes the quality measure on F , hence by Assumption 4 it
will also maximizes the quality measure over any subset of F . Assume now a1 is selected for
inclusion. Both a2 and a3 will be certainly selected as split attributes, for the same reason
as above. Which one should be preferred first? Let us look at the sub-trees rooted at a2 and
a3 . If a2 is selected, then the sub-tree rooted at a2 uses no further attribute. In fact, a1
cannot be counted as a further attribute because it is known to be already used. Conversely,
if a3 is selected, the sub-tree rooted at a3 ensures that attribute a4 can be used. Therefore,
having a3 gives more chances of using further attributes in decision tree building. Therefore,
a3 should be selected for inclusion before a2 . Finally, sub-trees rooted at a2 and a4 use no
11
Ruggieri
a1 a1 a1
a2 a3 a2 a3 a2
∆
a1 0 1 a4 a1 0 1 a1 0
∆
1 0 1 0 1 0 1 0
(a) (b) (c)
Figure 3: A sample decision tree, and two subtrees replaced with an oracle ∆. Internal
nodes are labelled with split attributes, and leaves are labelled with class value.
further attributes, and we break the tie by selecting a2 before a4 . In summary, the rank of
attributes in F with increasing chances of being used is a4 , a2 , a3 , a1 .
Let us formalize the intuition of this example. We define an a-frontier node of a decision
tree T w.r.t. a set R0 of features, as a decision node that uses a 6∈ R0 for the first time in
a path from the root to a node. A frontier node is any a-frontier node, for any feature a.
Based on the previous example, we should count the number of attributes that are used in
sub-trees rooted at frontier nodes.
Definition 8 We define frontier (T, R0 , a) as the number of distinct features not in R0 that
are used in sub-trees of T rooted at a-frontier nodes of T w.r.t. R0 .
Notice that attributes in R0 are excluded from the counting in frontier (). The idea is
that R0 will include attributes that must already appear somewhere in a decision tree (in
between the root and frontier nodes), and thus their presence in the sub-tree rooted at a
does not imply further usability of such attributes. We are now in the position to introduce
the ranking rk frontier based on ascending frontier ().
The definition of the ranking iterates from the last to the first position, and at each step
selects the feature which maximizes the frontier () measure. Iteration is necessary due to the
fact that the frontier nodes depend on the features selected at the previous step. Intuitively,
the ordering tries to keep as much as possible sub-trees with large sets of not yet ranked
attributes. This is in line with the objective of ranking features based on increasing chances
of being used in decision trees of recursive calls.
Example 4 Reconsider Example 3 and the decision tree T = DT (F ) in Figure 3(a). As-
sume that R = ∅ and S = F = {a1 , a2 , a3 , a4 }. The set of used features is U = F .
12
Complete Search for Feature Selection in Decision Trees
Adult, IG Adult, IG
3500
2.4
2.2
2500 2
2000 1.8
1500 1.6
1000 1.4
500 1.2
1
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 16 32 48 64 80 96 112 128
Subset size m
binomial m=32 DTdistinct reverse
m=8 m=128 fixed exhaustive
(a) Distribution of distinct decision trees. (b) Ratio duplicate/distinct decision trees.
Figure 4: Distinct decision trees and overhead of DTdistinct on the Adult dataset.
The rank of attributes starts by defining r4 = argmax a∈{a1 ,a2 ,a3 ,a4 } frontier (T, ∅, a) which
is trivially the split attribute a1 at root node of T – the only frontier node of T w.r.t. ∅.
Next, r3 = argmax a∈{a2 ,a3 ,a4 } frontier (T, {a1 }, a) is defined by looking at the frontier nodes,
which are those using a2 and a3 . As discussed in Example 3, frontier (T, {a1 }, a2 ) = 1 and
frontier (T, {a1 }, a3 ) = 2. Thus, we have r3 = a3 . At the third step, r2 = argmax a∈{a2 ,a4 }
frontier (T, {a1 , a2 }, a) is defined by looking at the frontier nodes using a2 and a4 . Both have
a frontier of 1, so we fix r2 = a2 . Finally, r1 must be necessarily be a4 . Summarizing, the
ordering provided by rk frontier (T ) is a4 , a2 , a3 , a1 as stated in Example 3.
Property 1: linear space complexity. Consider the set F of all features, with N
elements. DTdistinct(∅, F ) is computationally linear in space (per number of trees built)
in N . In fact, there are at most N nested calls, since the size of the second parameter
decreases at each call. Summarizing, at most N decision trees are built and stored in the
nested calls, i.e., space complexity is linear per number of trees built. An exhaustive search
would instead keep in memory the distinct decision trees built in order to check whether a
new decision tree is a duplicate. Similarly, so will do applications based on complete search
that exploit duplicate pruning through caching of duplicates (Caruana and Freitag, 1994).
Those approaches would require exponential space, since the number of distinct trees can
be exponential as shown in the next example.
Example 5 Let us consider the well-known Adult dataset2 (Lichman, 2013), consisting of
48,842 cases, and with N = 14 predictive features and a binary class. Figure 4(a) shows, for
the IG split criterion, the distributions of the number of distinct decision trees w.r.t. the size
of feature subset. The distributions are plotted for various values of the stopping parameter
m (formally, stop(S, C) is true iff |C| < m). For low m values, the distribution approaches
the binomial; hence, the number of distinct decision trees approaches 2N .
2. See Section 7 for the experimental settings.
13
Ruggieri
0.014 80
(a) Average elapsed time for tree building. (b) Total elapsed time.
Example 6 (Ctd.) Figure 4(b) shows the overhead at the variation of m for three possible
orderings of selection at line 8 of Algorithm 2. One is the the ordering stated by DTdistinct,
based on rk frontier (). The second one is the reversed order, namely an , . . . , a1 for rk frontier ()
being a1 , . . . , an . The third one is based on assigning a static index i ∈ [1, N ] to features
ai ’s, and then ordering over i. The rk frontier () ordering used by DTdistinct is impressively
effective, with a ratio of almost 1 everywhere.
The effectiveness of the rk frontier () ordering will be confirmed in the experimental sec-
tion. Figure 4(b) also reports the ratio of the number of trees in an exhaustive search (which
are 2N for N features) over the number of distinct trees. Smaller m’s lead to a smaller ratio,
because built trees are larger in size and hence there are more distinct ones. Thus, for small
m values, pruning duplicate trees does not guarantee alone a considerably more efficient
enumeration than exhaustive search. The next property will help in such cases.
14
Complete Search for Feature Selection in Decision Trees
Example 7 (Ctd.) Figure 5(a) shows the average elapsed time required to built a decision
tree w.r.t. the size of the feature subset, for the fixed parameter m = 8. The exhaustive
search requires an average time linear in the size of the feature subset. Due to incremental
tree building, DTdistinct requires instead a sub-linear time.
Example 8 (Ctd.) Figure 5(b) contrasts the total elapsed times of exhaustive search and
DTdistinct. For small values of m, the number of trees built by exhaustive search ap-
proaches the number of distinct decision trees (see Figure 4(b)). Nevertheless, the running
time of DTdistinct is constantly better than the exhaustive search. This improvement is
due to the incremental building of decision trees. The computational efficiency in terms
of absolute elapsed times is, in addition, due the effectiveness of parallel implementation,
which in the example at hand runs on an low-cost 8-core machine reaching a 7× speedup.
Definition 10 Let F be a set of features. We define err bst = min S⊆F err (DT (S)), and
call it the best estimated error w.r.t. feature subsets. For δ ≥ 0, a feature subset S ⊆ F is
δ-acceptable if:
err (DT (S)) ≤ δ + err bst .
When δ = 0, we call S a best feature subset. Finally, DT (S) is called a δ-acceptable decision
tree.
The δ-acceptable feature subset problem consists of finding a δ-acceptable feature subset.
In particular, for δ = 0, it consists of finding a best feature subset.
We make no assumption on the error estimation function err (T ). In experiments, unless
otherwise stated, we adhere to the wrapper model (John et al., 1994; Kohavi and John,
1997), by assuming that the available training set is split into a building set, used to build
the decision trees T = DT (S) on a subset S of features F , and a search set. Error is
15
Ruggieri
Algorithm 4 greedyδ (U , R, S)
1: W ← U
2: for ai ∈ S ∩ U order by rk frontier (T ) do
3: Ŵ ← W \ {ai }
4: if eacc ≤ lberr (R, S ∩ Ŵ ) + δ then
5: W ← Ŵ
6: end if
7: end for
8: return W
estimated as the empirical misclassification error on the search set, and it is computed
using the C4.5’s distribution imputation method3 .
Algorithm 3 builds on the procedure for the enumeration of distinct decision trees by
implementing a further pruning of the search space. In particular, a call DTacceptδ (R,
S) searches for a δ-acceptable feature subset Sacc among all subsets in R 1 P ow(S). The
global variable eacc stores the best error estimate found so far, and it is initialized outside
the call to ∞. The structure of DTacceptδ follows the one of Algorithm 2, from which it
differs in two main points.
The first difference regards lines 4-8, which instead of just outputting the feature subset,
they update the best error estimation found so far in case the estimated error of T is lower
or equal4 than it. The set of features Sacc is also updated.
3. Predictions of instances with no missing value follows a path from the decision tree root to a leaf node.
For instances with missing value of the attribute tested at a decision node, several options are available
(Saar-Tsechansky and Provost, 2007; Twala, 2009). In C4.5, all branches of the decision node are
followed, and the prediction of a leaf in a branch contributes in proportion to the weight of the branch’s
child node (fraction of cases reaching the decision node that satisfy the test outcome of the child). The
class value predicted for the instance is the one with the largest total contribution.
4. We break ties in favor of smaller feature subsets.
16
Complete Search for Feature Selection in Decision Trees
The second difference regards the set U of used features which, at line 8, is possibly
pruned by the call greedyδ (U , R, S). Such a function tries and relax the pruning of features
subsets in formula (2) in order to include additional attributes. Let S = {a1 , . . . , an }. In
particular, we aim at finding a minimal set W = {a1 , . . . , ak } ⊆ S ∩ U such that:
If such a condition holds5 , we can prune from the search space the feature subsets in the
quantifier of the condition, because even in the case that a feature subset with the global
best error is pruned this way, the best error found so far eacc is within the δ bound from
the error of the pruned decision tree. Practically, this means that we can continue using W
in the place of U in the rest of the search – and, in fact, lines 9–15 of DTacceptδ coincide
with lines 6–12 of DTdistinct. The search space is pruned if W ⊂ U , because the loop
at lines 11-15 iterates over a smaller set of features. The approach that greedyδ adopts
for determining {a1 , . . . , ak } is a greedy one, which tries and removes one candidate feature
from S ∩ U at a time. The order of removal is by6 rk frontier (T ) as in the main loop of
DTdistinct. The function greedyδ relies on a lower-bound function lberr () for which the
test condition7 eacc ≤ lberr (R, {a1 , . . . , ak }) + δ is required to imply that (3) holds. This is
obviously true when:
hence, the adjective “lower-bound” function for lberr (). Our lower-bound function is defined
as follows for a candidate {a1 , . . . , ak }. We start from the decision tree T = DT (R ∪ S) =
DT (R ∪ (S ∩ U )) and remove sub-trees in T rooted at frontier nodes with split features in
(S ∩ U ) \ {a1 , . . . , ak }. Let T 0 be the partial tree obtained and call the removed frontier
nodes the “to-be-expanded” nodes. The features used in T 0 belong to R ∪ {a1 , . . . , ak },
which is included in any V quantified over in (4). Due to the Assumptions 4-5, any tree
DT (V ) may differ from T only at the nodes to-be-expanded and their sub-trees, i.e., T 0 is
a sub-tree of DT (V ). At the best, the error of the sub-trees in V that expand T 0 will be
zero8 . Thus, we have err (T 0 ) ≤ err (DT (V )), where err (T 0 ) adds 0 as estimated error at
nodes to-be-expanded. Since T 0 is defined only starting from T , and not from any specific
5. A direct extension of (2) is to require err (DT (R ∪ S)) ≤ err (DT (V )) + δ, i.e., that the error of the last
built tree is within the δ bound from the error of any tree over quantified attributes V . The relaxed
condition (3) allows for a better pruning, namely that the best error found so far is within the δ bound.
6. The rationale is to try and remove first features that lead to minimal changes in the decision tree, and,
consequently, to minimal differences in its misclassification error. This is a direct generalization of the
approach of DTdistinct of removing unused features, which lead to no change in misclassification.
7. Cf. line 4 of Algorithm 4, where {a1 , . . . , ak } is S ∩ Ŵ .
8. Consider the wrapper model, namely error is estimated as the empirical misclassification error over a
search set. Since decision tree error is always not lower than the Bayes error P (Devroye et al., 1996), a
1
better lower bound can be obtained at each node to-be-expanded as: 1 − |I| x∈I g(x). Here, I includes
the instances in the search set reaching the node to-be-expanded, and g(x) = maxc kxc /nx , where nx
is the number of instances in I whose predictive attribute values are the same as for x, and kxc is the
number of such instances that also have class value equal to c. Unfortunately, the computational cost of
calculating this lower bound function makes it impractical.
17
Ruggieri
Adult, IG Adult, IG
30 15.8
15.6
Elapsed time (secs)
Figure 6: DTacceptδ elapsed times and estimated errors on the Adult dataset.
V , we then define lberr (R, {a1 , . . . , ak }) = err (T 0 ) and have that (4) holds. In summary,
we have the following result.
Example 9 Consider again the sample decision tree T at Figure 3(a). Assume R = ∅ and
S = S ∩ U = {a1 , a2 , a3 , a4 }. Also, let eacc the best error estimate found so far. greedyδ
will consider removing attributes in the order provided by rk frontier (T ), which is a4 , a2 , a3 , a1
(see Example 4). At the first step, the sub-tree rooted at a4 is tentatively removed, and
replaced with an oracle with zero error estimate, as shown in Figure 3(b). If the error
lb = lberr (∅, {a4 }) of such a tree is such that eacc ≤ lb +δ, we can commit the removal of a4 .
Assume this is the case. In the next step, the sub-tree rooted at a2 is also tentatively removed
and replaced with an oracle. Again, if the estimated error lb = lberr (∅, {a4 , a2 }) of such a
tree is such that eacc ≤ lb +δ, we can commit the removal of a2 . Assume this is not the case,
and a2 is not removed. In the third step, the sub-tree rooted at a3 is tentatively removed and
replaced with an oracle, as shown in Figure 3(c). If the estimated error lb = lberr (∅, {a4 , a3 })
of such a tree is such that eacc ≤ lb + δ, we can commit the removal of a3 . Assume this is
the case. In the last step, we try and remove the sub-tree rooted at a1 , and replace it with
an oracle. The estimated error of such a tree is lb = lberr (∅, {a4 , a3 , a1 }) = 0. Condition
eacc ≤ lb + δ does not hold (otherwise, it would have been satisfied at the second step as
well). In summary, greedyδ returns {a2 , a1 }, whilst {a4 , a3 } can be safely not iterated over
at step 11 of Algorithm 3.
Example 10 Reconsider Example 5. Figure 6(a) shows the elapsed running times of
DTacceptδ (∅, F ) where F is the set of all features of the Adult dataset. Results are
shown for several values of the parameter δ. It is worth noting that, for δ = 0, the elapsed
time is smaller than the enumeration of distinct trees by DTdistinct. In fact, when we
18
Complete Search for Feature Selection in Decision Trees
Adult, IG Adult, IG
15.8
1
15.6
0.8 15.4
15.2
0.6 15
14.8
0.4
14.6
0.2 14.4
14.2
0
0 16 32 48 64 80 96 112 128
0 16 32 48 64 80 96 112 128 m
m
all features DTsbe,SBE
SBE DTsbe DTaccept0
Figure 7: SBE and DTsbe elapsed times and estimated errors on the Adult dataset.
search for a best feature subset, DTaccept0 prunes from the search space those (distinct)
decision trees for which the lower bound on the estimated error is higher than the best error
estimation found during the search. Figure 6(b) shows the estimated error (misclassification
error on the search set) of the decision tree built on features returned by DTacceptδ and
on all features F . The difference between the estimated error of DT (F ) and the estimated
error of a best feature subset (δ = 0) provides the range of error estimations that may be
returned by heuristic feature selection approaches adhering to the wrapper model. Notice
that the estimated error of δ-acceptable feature subsets for δ = 0.02 and δ = 0.04 is very
close to the estimated error of a best feature subset (only +2% and +4% respectively).
The example shows that the pruning strategy of DTacceptδ is effective in the specific
case of the Adult dataset. In the worst case, however, the search space remains the one of
distinct decision trees, which, for low m values, is exponential in the number of features. In
Section 7, we will test performances on datasets of larger dimensionalities.
As a final note, we observe that our approach can be easily adapted to other variants of
the feature selection problem. One variant consists of regularizing the error estimation with
a penalty for every feature in a subset. In such a case, the only changes in Algorithm 3
would be: the test at line 4 becomes err (T )+|U |· ≤ eacc ; the assignment at line 5 becomes
eacc ← err (T ) + |U | · . Regarding the lower bound function lberr (R, S ∩ Ŵ ) called at line
4 of greedyδ , by adding the penalty |S ∩ Ŵ | · we obtain a lower bound on the regularized
error estimation of trees built from subsets V pruned in (4).
19
Ruggieri
fier T using the set S of all features, i.e., S = F . For every a ∈ S, a classifier Ta is built using
features in S\{a}. If no Ta ’s has a lower or equal estimated error than T , the algorithm stops
returning S as the subset of selected features. Otherwise, the procedure is repeated removing
â from S, where Tâ is the classifier with the smallest estimated error. In summary, features
are eliminated one at a time in a greedy way. SBE is a black-box approach. The procedure
applies to any type of classifier. A white-box optimization can be devised for decision tree
classifiers that satisfy the assumptions of Section 4.1. Let U = features(T ) be the set of
features used in the current decision tree T = DT (S). By Lemma 6, for a non-used feature
a ∈ S \ U , it turns out that Ta = DT (S \ {a}) = DT (S) = T . Thus, only trees Ta for a ∈ U
need to be built and evaluated at each step, saving the construction of |S \ U | decision trees.
The following other pruning can be devised. Let b̂ be the feature removed at the current step,
and a ∈ U = features(T ) such that b̂ 6∈ features(Ta ), where features(Ta ) is known from the
previous step. By Lemma 6, it turns out that (Tb̂ )a = DT (S \{b̂, a}) = DT (S \{a}) = Ta . In
summary, (Tb̂ )a can be pruned if a 6∈ features(Tb̂ ) or if b̂ 6∈ features(Ta ). The optimizations
of incremental tree building and parallelization discussed for DTdistinct readily apply for
such heuristic as well. We call the resulting white-box optimized algorithm DTsbe.
Example 11 Figure 7(a) contrasts the elapsed running times of SBE and DTsbe on the
Adult dataset. The efficiency improvement of DTsbe over SBE is consistently in the order
of 2×. Figure 7(b) shows instead their estimated error (misclassification error on the search
set), which are obviously the same. Estimates are very close to the estimated error of a best
feature subset provided by DTaccept0 .
7. Experiments
7.1. Datasets and Experimental Settings
We perform experiments on 20 small and large dimensional benchmark datasets publicly
available from the UCI ML repository (Lichman, 2013). Some of the datasets have been
used in the NIPS 2003 challenge on feature selection, and are described in detail by Guyon
et al. (2006a). Table 1 reports the number of instances and features of the datasets.
Following (Kohavi, 1995; Reunanen, 2003), the generalization error of a classifier is esti-
mated by repeated stratified 10-fold cross validation9 . Cross-validation is repeated 10 times.
At each repetition, the available dataset is split into 10 folds, using stratified random sam-
pling. Each fold is used to compute the misclassification error of the classifier built on the
remaining 9 folds used as training set for building classification models. The generalization
error is then the average misclassification error over the 100 classification models (10 models
times 10 repetitions). The following classification models will be considered:
9. Cross-validation is a nearly unbiased estimator (Kohavi, 1995), yet highly variable for small datasets.
Kohavi’s recommendation is to adopt a stratified version of it. Variability of the estimator is accounted
for by adopting repetitions (Kim, 2009).
20
Complete Search for Feature Selection in Decision Trees
DTaccept0 : the decision tree built on the feature subset selected by DTaccept0 (i.e, a
best feature subset);
RF: a random forest10 of 100 decision trees;
DTsbe+RF: a random forest of 100 decision trees, where the available set of features is
restricted to those selected by DTsbe.
21
Ruggieri
Table 1: Experimental datasets, and average elapsed running times (seconds). IG and m=2.
2× - 100×. For large dimensional datasets, the black-box SBE does not even terminate
within a time-out of 1h on the first fold of the first iteration of repeated cross-validation.
This is a relevant result for machine learning practitioners, extending the applicability of
the SBE heuristic, a key reference in feature selection strategies.
Table 1 also reports the elapsed times for DTaccept0 . It makes it possibile to search for
best feature subsets in a reasonable amount of time for datasets with up to 60 features. In
general, the elapsed time depends on: (1) the size of the search space, which is bounded by
the number of distinct decision trees; and on: (2) the size of the dataset and the stopping
parameter m, which affect the time to build single decision trees. In turn, (1) depends on
the number of available features and on (2) itself – in fact, decision trees for small datasets or
for large m’s values are necessarily small in size. In summary, efficiency of DTaccept0 can
be achieved for datasets with limited number of features and (limited number of instances
or large values of the stopping parameter m). For example, time-out bounds in Table 1 are
reached for medium-to-large number of features (>60) or for medium-to-large number of
instances (e.g., the Kr-vs-kp, Census, and Spambase datasets). Notice that while large m’s
values can speed up the search, they negatively affect the predictive accuracy of decision
trees (see next subsection).
Finally, Table 1 also includes the elapsed times of building random forests. They are very
low compared to all the other times. In fact, the number of trees in a forest (set to 100) is
much smaller than the number of trees built by sequential backward elimination (quadratic
in the number of features, in the worst case) and by complete search (exponential, in the
worst case).
22
Complete Search for Feature Selection in Decision Trees
IG Ionosphere, IG
1.0004 40
Ratio built/distinct trees
Adult, IG Adult, IG
15.8 15.8
Generalization error (%)
Best feature subsets found by complete search minimize the estimated error (misclassifica-
tion on the search set), but there is no guarantee that this extends to unseen cases, i.e., to
the generalization error evaluated with cross-validation.
Figures 8(c) and 8(d) show generalization errors on the Adult dataset for decision trees
constructed on all available features (baseline), and on features selected by DTaccept0
(best feature subset), DTaccept0.02 and DTaccept0.04 (0.02 and 0.04-acceptable feature
subsets), and SBE (same as DTsbe). For such a dataset, the best feature subsets have the
best generalization error, the 0.02- and 0.04-acceptable feature subsets are very close to it,
and SBE has a similar performance except for low m values.
Figures 8(b) highlights a different result. For the Ionosphere dataset and large m values,
the best generalization error is for the baseline, then for SBE, and finally for the best feature
subsets. Ionosphere is a small dataset, hence using all instances for training (particularly
for large m values) results in a better strategy than splitting training into building and
search sets for feature selection.
23
Ruggieri
Table 2 reports the estimated errors and the generalization errors for all datasets. Here
and in the following tables, the best method is shown in bold. Other methods are labelled
with “? ” if the null hypothesis in a paired t-test with the best method at 95% significance
level cannot be rejected. Three conclusions can be made from the table.
First, heuristic search performs well in comparison to complete search as far as estimated
error is concerned with. In fact, the estimated error of DTsbe (or equivalently, of SBE) is
close to the estimated error of best feature subsets in 3 cases out of 7, and is always better
than the baseline. Only for Ionosphere and Sonar, it departs from the estimated error of
the best feature subsets.
Second, heuristic search generalizes to unseen cases better than baseline and complete
search. Globally, DTsbe wins in 12 cases and it is not statistically different from the winner
in another case. The larger the dimensionality of datasets, the more clear is the advantage
of DTsbe over the baseline. Comparing DTsbe and DTaccept0 between them, DTsbe
wins in 2 cases (Ionosphere and Anneal), loses in 2 cases, and it is not statistically different
in 3 cases (we count here Hypo, where the baseline wins). Thus we cannot conclude that
complete search leads to better generalization errors than heuristic search.
Third, the difference between estimated and generalization error is greater for heuristic
search than for the baseline, and for complete search compared to heuristic search. In the
24
Complete Search for Feature Selection in Decision Trees
case of Sonar, for instance, the differences are considerably large. This implies that over-
searching may incur in feature subsets that overfit the data, thus reinforcing the conclusions
of Doak (1992); Quinlan and Cameron-Jones (1995); Reunanen (2003) that oversearching
increases overfitting.
25
Ruggieri
dom forests are more robust to large dimensionality effects than single decision trees due
to the (random) selection of a logarithmic number of features to be considered in splits at
decision nodes. However, when the number of available features increases, the logarithmic
reduction is not sufficient anymore. It is worth noting that such an experimental analysis
can only be made due to the efficiency improvements of DTsbe over SBE. In fact, from
Table 1, we know that SBE does not terminate within a reasonable time-out.
26
Complete Search for Feature Selection in Decision Trees
Table 5: Elapsed running times (seconds) and generalization (cross-validation) errors (mean
± stdev). IG, m=2, max depth = 5, and misclassification on training set as error
estimator in DTsbe. (1) one-hot encoding of discrete features required by OCT.
out of 7. While the differences with DTsbe are statistically significant, they are not large
– about 1-2 features.
Table 4 reports on the right hand side the mean and standard deviation of the sample
Pearson’s correlation coefficient over feature subsets used by decision trees12 built during
cross-validation. The mean value corresponds to the Φ̂P earson measure of stability in-
troduced by Nogueira and Brown (2016). Differently from other measures of stability, it
is unbiased w.r.t. dimensionality of the dataset. From the table, we can conclude that
the baseline method has the highest stability of selected features (also with the smallest
variance), where the value 1 means that any two distinct folds in cross-validation produce
decision trees that use the same subset of features. However, such a stability is obtained
at the expenses of a greater number of used features. Feature selection strategies have a
considerably lower stability, in some cases half of the values of the baseline. DTsbe wins
over DTaccept0 in 4 cases out of 7 and is equivalent in another. DTaccept0 has a better
stability only for the 2 lower dimensional datasets. In summary, we can conclude that, in
the case of decision tree classifiers, oversearching increases instability as well.
12. Stability is calculated on the subset of features used by decision trees, not on the feature subsets selected
by a strategy. This allows for measuring variability of the baseline approach, which otherwise would
result to have zero variability, and to contrast it to the feature selection strategies.
27
Ruggieri
8. Conclusions
We have introduced an original pruning algorithm of the search space of feature subsets
which allows for enumerating all and only the distinct decision trees. The lattice of feature
subsets is explored by distinguishing features that must be necessarily used from those that
may be possibly used. The order of the visit is impressively effective, and it relies on an
estimation of the chances of generating distinct trees. Based on the enumeration of dis-
tinct decision trees, we introduced an algorithm for finding δ-acceptable feature subsets,
which depart by at most δ from the best estimated error of decision trees built from any
feature subset. The framework is stated in general terms for any top-down greedy decision
tree induction algorithm, any quality measure used to select split attributes, and any error
estimation function. Coupled with a few computational optimizations and a multi-core
parallel implementation, this makes it possible to investigate properties of complete search
for datasets of up to 60 features. Beyond such a limit, we contributed by exploiting ideas
and optimizations proposed in the paper to a white-box computational improvement of the
sequential backward elimination heuristic, which extends its practical applicability to large
13. OCT parameters: 8 core processes in Julia, max depth = 5, other parameters set to default (min bucket
= 1, local search = true, cp = 0.01). No focusing of parameters performed. Non-binary discrete features
of experimental datasets are processed with one-hot encoding as required by OCT.
28
Complete Search for Feature Selection in Decision Trees
dimensional datasets. Experimental results reinforce, in the case of decision trees, previ-
ous findings that oversearching increases overfitting, and, in addition, they highlight that
oversearching also increases instability. Sequential backward elimination performs better
than complete search over the feature subsets, and also better than the OCT optimal (non-
greedy) decision tree algorithm. It also improves the generalization error of random forest
models for medium-to-large dimensional datasets.
References
M. Aldinucci, S. Ruggieri, and M. Torquati. Decision tree building on multi-core using
FastFlow. Concurrency and Computation: Practice and Experience, 26(3):800–820, 2014.
E. Amaldi and V. Kann. On the approximation of minimizing non zero variables or unsat-
isfied relations in linear systems. Theoretical Computer Science, 209:237–260, 1998.
D. Bertsimas and J. Dunn. Optimal classification trees. Machine Learning, 106(7):1039–
1082, 2017.
A. Blum and P. Langley. Selection of relevant features and examples in machine learning.
Artificial Intelligence, 97(1-2):245–271, 1997.
V. Bolón-Canedo, N. Sánchez-Maroño, and A. Alonso-Betanzos. A review of feature se-
lection methods on synthetic data. Knowledge and Information Systems, 34(3):483–519,
2013.
L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees.
Wadsworth Publishing Company, 1984.
L. A. Breslow and D. W. Aha. Simplifying decision trees: A survey. The Knowledge
Engineering Review, 12:1–40, 1997.
R. Caruana and D. Freitag. Greedy attribute selection. In Proc. of the Int. Conf. on
Machine Learning (ICML 1994), pages 28–36. Morgan Kaufmann, 1994.
R. Caruana and A. Niculescu-Mizil. An empirical comparison of supervised learning algo-
rithms. In Proc. of the Int. Conf. on Machine Learning (ICML 2006), volume 148, pages
161–168. ACM, 2006.
T. M. Cover. On the possible ordering on the measurement selection problem. Trans.
Systems, Man, and Cybernetics, 9:657–661, 1977.
M. Dash and H. Liu. Feature selection for classification. Intelligent Data Analysis, 1(1-4):
131–156, 1997.
M. Fernández Delgado, E. Cernadas, S. Barro, and D. Gomes Amorim. Do we need hundreds
of classifiers to solve real world classification problems? Journal of Machine Learning
Research, 15(1):3133–3181, 2014.
29
Ruggieri
L. Fan. Accurate robust and efficient error estimation for decision trees. In Proc. of the Int.
Conf. on Machine Learning (ICML 2016), volume 48 of JMLR Workshop and Conference
Proceedings, pages 239–247, 2016.
I. Guyon, S. Gunn, A. Ben-Hur, and G. Dror. Design and analysis of the NIPS2003 chal-
lenge. In I. Guyon, M. Nikravesh, S. Gunn, and L. A. Zadeh, editors, Feature Extraction:
Foundations and Applications, pages 237–263. Springer, 2006a.
G. H. John, R. Kohavi, and K. Pfleger. Irrelevant features and the subset selection problem.
In Proc. of the Int. Conf. on Machine Learning (ICML 1994), pages 121–129. Morgan
Kaufmann, 1994.
J.-H. Kim. Estimating classification error rate: Repeated cross-validation, repeated hold-
out and bootstrap. Computational Statistics & Data Analysis, 53(11):3735–3745, 2009.
R. Kohavi. A study of cross-validation and bootstrap for accuracy estimation and model
selection. In Proc. Int. Joint Conf. on Artificial Intelligence (IJCAI 1995), pages 1137–
1145. Morgan Kaufmann, 1995.
R. Kohavi and G. H. John. Wrappers for feature subset selection. Artificial Intelligence, 97
(1-2):273–324, 1997.
30
Complete Search for Feature Selection in Decision Trees
H. Liu and L. Yu. Toward integrating feature selection algorithms for classification and
clustering. IEEE Transactions on Knowledge and Data Engineering, 17(4):491–502, 2005.
H. Liu, H. Motoda, and M. Dash. A monotonic measure for optimal feature selection. In
Proc. of the European Conf. on Machine Learning (ECML 1998), volume 1398 of Lecture
Notes in Computer Science, pages 101–106. Springer, 1998.
S. Nogueira and G. Brown. Measuring the stability of feature selection. In Proc. of Machine
Learning and Knowledge Discovery in Databases (ECML-PKDD 2016) Part II, volume
9852 of LNCS, pages 442–457, 2016.
J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA,
1993.
S. Ruggieri. Efficient C4.5. IEEE Transactions on Knowledge and Data Engineering, 14:
438–444, 2002.
S. Ruggieri. YaDT: Yet another Decision tree Builder. In Proc. of Int. Conf. on Tools with
Artificial Intelligence (ICTAI 2004), pages 260–265. IEEE, 2004.
S. Ruggieri. Subtree replacement in decision tree simplification. In Proc. of the SIAM Conf.
on Data Mining (SDM 2012), pages 379–390. SIAM, 2012.
31
Ruggieri
S. Ruggieri. Enumerating distinct decision trees. In Proc. of the Int. Conf. on Machine
Learning (ICML 2017), number 70 in JMLR Workshop and Conference Proceedings,
pages 2960–2968, 2017.
B. Twala. An empirical comparison of techniques for handling incomplete data using deci-
sion trees. Applied Artificial Intelligence, 23(5):373–405, 2009.
S. Verwer and Y. Zhang. Learning decision trees with flexible constraints and objectives
using integer optimization. In Proc. of Int. Conf. on Integration of AI and OR Techniques
in Constraint Programming (CPAIOR 2017), volume 10335 of Lecture Notes in Computer
Science, pages 94–103. Springer, 2017.
32
Complete Search for Feature Selection in Decision Trees
33
Ruggieri
34