One of the main advantages of this type of constraint (see Figure
5), according to Piltaver et al. [
2016], is that it directly influences the understandability of the decision tree. In what follows, we present approaches developed to enforce structure-level constraints in the learning algorithm and their impact on trees. Section
3.1.1 analyses methods that enforce the size of trees as structural constraints, Section
3.1.2 discusses methods related to the depth constraint, and Section
3.1.3 details methods used to restrict the number of leaves.
3.1.1 Size of the Tree.
As defined in Section
2.1, the size of a decision tree is the number of nodes
\( |V| \) of the tree and is related to the readability of the entire tree [Piltaver et al.
2016]. Several works in the literature, presented hereafter, attempt to impose the size constraint on decision trees. We classify them into three types of methods (see Section
4 for a detailed discussion about optimisation methods): top-down greedy, safe enumeration, and linear (and constraint) programming methods.
Top-down Greedy Algorithms. Top-down greedy algorithms aim to optimise a local heuristic. Here, we focus on approaches that try to learn, in a top-down fashion, trees that do not exceed a maximum number of nodes. Quinlan and Rivest [
1989] propose one of the first works in this direction by introducing the
minimum description length (MDL). The MDL principle comes from information theory and consists in adding a prior to the optimal tree formulation so as to have the
maximum a posteriori (MAP) tree. This prior represents the belief of the length of encoding the tree. Since learning the exact MAP is NP-Complete, they propose an approximation based on a greedy top-down method. One might think naively that it could be sufficient to learn a complex (accurate) tree and then prune it to satisfy the size constraint. In contrast to this approach, Garofalakis et al. [
2000] introduce a tree algorithm that
pushes the size constraint into the building phase of the tree. The algorithm estimates a lower bound of the inaccuracy when deciding to split on a node using a top-down fashion. Interestingly, this algorithm can find an optimal tree given a maximum number of nodes, or the other way around: Given an accuracy, find the smallest tree. The minimum accuracy or the maximum size has to be defined by users.
Several works have also proposed to enforce the size constraint on decision trees used as proxy models. Proxy models [Gilpin et al.
2018] are machine learning models that are used for approximating and explaining predictions of black-box models (for instance, an SVM, a neural network, or a random forest). An early work is TREPAN [Craven and Shavlik
1995], which tries to approximate a neural network by a decision tree. To control the comprehensibility of the tree, TREPAN can accept a constraint on the number of internal nodes. Also, the learned rules are
\( m-of-n \) rules, which are chosen to maximise the information gain ratio of C4.5. Boz [
2002] uses genetic algorithms to find interesting inputs of the neural network considered as a black-box classifier, and thereafter learn decision trees via a C4.5-like algorithm. Controlled by the user, the size of the final tree is enforced using post pruning. Yang et al. [
2018] recently proposed
GIRP (global model interpretation via recursive partitioning). GIRP also uses a CART-like algorithm to learn binary decision trees through a contribution matrix that represents the contribution of input variables [Choi et al.
2016; Ribeiro et al.
2016b] of a black-box machine learning model. To control the size of the tree, authors use a pruning mechanism that adds the size of the tree as a penalising term to the average gain of the tree.
In summary, top-down greedy algorithms that aim to learn decision trees under a size constraint have the advantage to leverage well-known pruning methods and ultimately learn decision trees that meet the size constraint (even if they may grow large at first). Proxy models often apply these top-down greedy approaches to control the size of the tree to learn clearer explanations of a black-box machine learning model. However, despite the ease of designing a top-down algorithm, these works suffer from the potential sub-optimality of the solution, which is inherently due to the usage of a (local) heuristic. Safe Enumeration Methods. Safe enumeration methods are designed to enumerate all (or a subset of) possible trees by identifying possible splitting rules with a specific attention to complexity. This allows the exploration of richer trees in terms of accuracy or constraints.
In this direction, Bennett and Blue [
1996] propose an algorithm, called global tree optimisation, which uses multivariate splits and models decision tree encoding as disjunctive inequalities with a fixed structure. By showing that they can use various types of objective functions, they present a search method (called extreme point tabu search) based on tabu search [Glover and Laguna
1998] (a heuristic method that uses a local search over the search-space by checking the immediate neighbors of a solution) to heuristically find a good solution. They also showed that it is possible to solve the problem with the Frank Wolfe algorithm and the Simplex method. While the former has the disadvantage of getting stuck into local optima, the latter is costly in terms of computations.
Garofalakis et al. [
2003] propose another approach to add knowledge constraints (size of the tree, inaccuracy cost) into the learning phase based on a branch-and-bound algorithm. The authors also use a dynamic programming algorithm whose goal is to prune an accurate and large tree such that it will satisfy the constraints (size constraint). Struyf and Džeroski [
2006] extend previous pruning methods to learn multi-objective regression trees with size and accuracy constraints. Fromont et al. [
2007] use the analogy of itemset mining to learn decision trees with constraints (size of the tree, errors of the tree, syntactic constraints, i.e., the way attributes are ordered). They proposed two methods: CLUS, which is a greedy method, and CLUS-EX, which is based on an exhaustive search that enumerates possibilities expecting that user constraints are restrictive enough to limit the search space . Nijssen and Fromont [
2007] present an algorithm called DL8 for optimal decision trees using dynamic programming. A more general framework [Nijssen and Fromont
2010] is given later by the same authors that uses an itemset mining approach to learn decision trees for various types of structure-level constraints (size of the tree, number of leaf nodes, etc.) and data-level constraints (see Section
3.3). Also based on dynamic programming, this extended version of DL8 needs enough memory to encapsulate all the lattices of variables.
This section focused on methods that aim at enumerating all (or a subset of) possible trees to enforce the size constraint. If these methods have the advantage of learning an optimal solution in certain circumstances, then they can become costly to use. In particular, when there is a large and/or complex space of possible trees (e.g., when the number of features is important), these methods can fail to provide a solution under a reasonable time.
Linear, SAT, and Constraint Programming Formulations. Instead of safely enumerating consistent trees, other methods prefer to formalise the tree encoding in terms of variables and constraints and use a solver to get an optimal tree that satisfies fixed or bounded structures in terms of their characteristics (for example, fixed number of nodes and/or leaves). Bessiere et al. [
2009] propose a method to find decision trees with minimum size using constraint programming and
integer linear programming (ILP). By presenting an SAT-based encoding of a decision tree, they express all constraints that need to be satisfied by a decision tree. Translating the problem using linear constraints and integer variables makes it possible to use ILP to explore richer and smaller trees. However, computational time remains high. To speed up the search, Narodytska et al. [
2018] propose another SAT-based encoding for optimal decision trees based on tree size. The heart of their method is to consider a perfect (error-free) binary classification where the selected features on nodes and the valid tree topology are modelled with SAT formulae. They search for the smallest decision tree considering that it must perfectly classify examples in the dataset.
Again, the above methods seek optimal solutions, however, depending on the chosen formalisation, different types of solvers have to be used. SAT solvers may be very efficient but are limited to propositional formulae, while CP solvers can handle more complex problems but have difficulty to scale to large search spaces.
Summary about the Tree Size Constraint. To conclude, constraining the size of trees helps to control their complexity, making them more easily understandable and readable. This problem of retrieving decision trees optimised with respect to their size and accuracy has been vastly explored using local search through heuristics, enumeration, and constraint programming approaches. To effectively control the size of the tree, the last two approaches generally assume trees to be binary, even for non-binary categorical variables. This assumption allows to highly reduce the search space. However, top-down greedy approaches do not need this assumption.
3.1.2 Depth of the Tree.
The depth constraint is important to control overfitting but also the comprehensibility of decision trees. It usually takes the form of learning a decision tree with a given maximum depth.
Top-down greedy Methods. Diverse algorithms have been proposed to learn interpretable and proxy decision trees under depth constraints. Trying to approximate a neural network, Zilke et al. [
2016] propose a constrained and more elaborated version of CRED [Sato and Tsukimoto
2001] (a method that learns decision trees to interpret predictions of a decomposed neural network into hidden units). This version extracts rules for each hidden unit and approximates these local decisions by a decision tree with a depth
\( K \le 10 \) using a modified version of C4.5.
Safe Enumeration Methods. Enumeration-based methods have been proposed also to learn optimal trees (in terms of classification error). For example, the T2 [Auer et al.
1995] and T3 [Tjortjis and Keane
2002] algorithms, respectively, find optimal trees with a maximum depth to 2 and 3 using a careful exhaustive search based on agnostic learning. The authors of T2 are one of the few authors who proposed a constraint-based tree learning algorithms and to theoretically analyse the computational time complexity and the guarantees of the learning algorithm. T3C [Tzirakis and Tjortjis
2017] is an improved version, which changes the way T3 splits continuous attributes to four decision rules.
To enforce the depth constraint in DL8 (presented in Section
3.1.1), Aglin et al. [
2020a] introduce DL8.5. This algorithm uses a branch-and-bound search with caching to safely enumerate trees under the depth constraint. However, unlike DL8, DL8.5 cannot enforce test cost constraints [Nijssen and Fromont
2010].
Linear, SAT, Constraint Programming Methods. For seeking richer and more accurate trees, Verwer and Zhang [
2017] present a formulation of the optimal decision tree given a specific depth (i.e., depth constraint) as an integer linear program. By creating variables that link training instances to their leaf nodes, authors are able to formalise, in terms of linear constraints, notions that include, but are not limited to, the selection of features over internal nodes and splitting rules. Furthermore, using this formulation, a tree learned by existing algorithms such as CART can be used as a starting solution for the
mixed integer programming (MIP) problem. Authors used CPLEX as a MIP solver. In a more general framework, Bertsimas and Dunn [
2017] give a new formulation of optimal classification decision trees (called OCT) as a MIP problem. Authors also propose an adaptation to overcome the problem of multivariate splits. They define ancestors of nodes and divide them into two categories: left and right ones. Only considering left ones helped authors to formalise decision rules and break a symmetry. They specified all the tree consistency constraints as linear constraints that can be pushed to the CPLEX solver to get the optimal tree. Authors claimed to outperform greedy top-down methods, yet there is a need to start with an “acceptable good” solution to reduce computation time. To overcome the computational time problem of OCT, Firat et al. [
2020] propose an ILP formulation based on paths for trees with a fixed depth. While using column generation methods or variable pricing with CART to speed up the computations, their formulation allows defining flexible objectives coupled with a regularisation term based on the number of leaf nodes. Aghaei et al. [
2021] translated the OCT model into a maximum flow problem, which is optimised with a MIP solver. Although this maximum flow formulation accelerates the optimisation, it only works with binary features and classes, unlike the general OCT.
Instead of making computations faster by looking for a warm-start solution, Verwer and Zhang [
2019] propose an algorithm that finds an encoding whose number of decision variables is independent of the size of the dataset. This allows them to introduce a new binary linear program to find optimal decision trees given a fixed depth. Verhaeghe et al. [
2019] rather prefer a
Constraint Programming (CP) modelling inspired by the link on itemset mining of DL8 [Nijssen and Fromont
2010].
After noting that the SAT-based encoding of Narodytska et al. [
2018] (see Section
3.1.1) was only applicable to the size-constraint, Avellaneda [
2020] proposed a novel SAT-based encoding of the depth-constrained optimal decision tree that does not only minimise the classification error, but also the depth for error-free classification. In addition, this improved version incrementally adds data-related literals and clauses to reduce the computational time and memory requirement. Later, Hu et al. [
2020] introduce a MaxSAT version of Narodytska et al. [
2018] that integrates depth and size constraints to speed up computations using the Loandra state-of-the-art MaxSAT solver.
Summary about the Depth Constraint. In conclusion, methods to learn decision trees under depth constraints are generally based on enumerations, linear programming, and constraint programming. Their goal is primarily to learn more accurate and most importantly accelerate computations to reach optimal depth-constrained decision trees. Because these methods control the depth of decision trees, they prefer to learn shallow trees (trees with small depth) [Bertsimas and Dunn
2017; Firat et al.
2020] to improve accuracy and interpretability as well. Nevertheless, particular attention was paid to the scalability of these methods, in particular the over-simplicity of learned trees that we discuss later in this article (see Section
6).
3.1.3 Number of Leaf Nodes.
The number of leaf nodes is an important factor for the summary of the decision rules and also to limit the growth of the tree that in turn can be important when trying to understand how the model predicts a particular class [Piltaver et al.
2016]. In the case of binary trees, the number of leaf nodes
\( L \) is linked to the size of the tree
\( |V| \) by the formula
\( |V|=2L-1 \) . In other general cases, there is no such explicit formula. However, in this equality,
\( L \) and
\( |V| \) do not relate to the same aspects of the explainability of decision trees. The size of the tree is related to the readability of the tree, while the number of leaf nodes is essential for the comprehensibility of the predictions of a given class. Very few studies have focused their interest on constraining the number of leaf nodes.
By drawing inspiration from the work of Angelino et al. [
2018] on optimal rule lists, Hu et al. [
2019] propose
optimal sparse decision trees (OSDT) that uses a branch-and-bound search for binary classification. Analytic bounds are used to prune the search space, while the number of leaves is constrained using a regularised loss function that balances accuracy and the number of leaves. Thanks to the use of a structural empirical minimisation scheme and analytic bounds, OSDT and its extended version called GOSDT [Lin et al.
2020] learn efficiently trees whose structures are very sparse and therefore likely to generalise well.
Just as linear and constraint programming formulations are used to learn decision trees under size and depth constraints, they can also be used to learn decision trees with a given maximum number of leaf nodes. Namely, to overcome the problem of computation time of MIP solvers in Bertsimas and Dunn [
2017], the work of Menickelly et al. [
2016] proposes another formulation of decision trees for binary classification as integer programming problem with a predefined size adjusted by the number of leaf nodes. After creating variables that are directly linked to internal nodes, leaf nodes, and attributes, they encode the tree by imposing constraints on these variables. Authors have pushed those constraints into the CPLEX solver with a predefined topology (structure) and they chose the best topology through cross-validation. Their method finds the optimal solution but is limited to binary trees with categorical variables only.
In a completely different way, the work of Nijssen [
2008] extends previous works on DL8 algorithm [Nijssen and Fromont
2007] in a Bayesian setting. It proposes a MAP formulation of the optimal decision tree problem under soft constraints (i.e., constraints that might be violated) on the maximum number of leaf nodes. Based on the link between itemsets and decision trees, the algorithm can find predictions using a single optimal tree or Bayesian predictions for several trees. To incorporate the constraints on the number of leaf nodes (although other types of constraints may be targeted), Chipman et al. [
1998] present a Bayesian approach to learn decision trees. Their main contribution is to propose a prior over the structure of the tree and a stochastic search of the posterior to explore the search space and therefore find some “acceptable” good trees. The search consists in building a Markov Chain of trees by the Metropolis-Hasting algorithm considering the transitions:
grow (i.e., split a terminal node),
prune (i.e., choose a parent of terminal node and prune it),
change (i.e., change the splitting rule of an internal node), and
swap (i.e., swap splitting rules of parent-child pair), all randomly. To circumvent the local optima of the previous Bayesian formulation, Angelopoulos and Cussens [
2005a] exploit
stochastic logic programming (SLP) to integrate informative priors (that try to penalise unlikely transitions). They also extend this to a
tempere version, which improves the convergence and predictive accuracy [Angelopoulos and Cussens
2005b]. However, even though their posterior predictive performs usually well, when selecting a single tree as the mode of their Bayesian formulation, this tree is less accurate than the one learned by greedy algorithms. To interpret Bayesian tree ensembles, inspired by the Bayesian formulation of Chipman et al. [
1998], Schetinin et al. [
2007] propose a new probabilistic interpretation of Bayesian decision trees, whose goal is to find the most confident tree within the ensemble of Bayesian trees.
In conclusion, the works that deal with the constraints on the number of leaf nodes are generally based on a probabilistic formulation of decision trees. Bayesian learning seems to be suitable for this kind of constraint. However, the challenge is to effectively model the learning of the structure and the selection of rules of the decision tree. Mentioned works [Chipman et al.
1998; Schetinin et al.
2007] learn trees using stochastic search. While having the advantage of integrating constraints in a rigorous and clear mathematical manner with priors, Bayesian formulations of decision trees are also computationally expensive when implemented (see Section
4.4 for more details).
3.1.4 Summary and Discussion about Structure-level Constraints.
To sum up, the presented works focus on making decision trees more readable by constraining trees to be small, or limiting the number of decision rules or the number of attributes to take into account in the decision tree, since it is related to the abstraction capabilities of human beings. Also, since structural characteristics of the tree are linked to each other, setting one aspect constrains the others, but they all have different impacts on the interpretability of decision trees. If the size controls the readability of the tree, then the depth defines how easy the interpretability of a prediction can be, and the number of leaf nodes gives an idea of how understandable the prediction among a particular class is. Presented works that take into account these constraints try to find optimal solutions, while others leverage heuristic-based approaches to either explore richer trees or to learn more accurate ones compared to traditional methods. Besides, works in the direction of proxy models for black-box classifiers consider the structural constraints to make sure that the resulting tree will be easily understandable.
Nevertheless, the tree balance constraint (also depending on the branching factor of the tree) is understudied in the literature, despite its impact on the readability and therefore the comprehensibility of decision trees. Also, the majority of the works on structure-level constraints assume that decision trees are binary to better handle the number of leaf nodes and the sizes. This assumption is severely lacking in flexibility. In some cases, for the sake of interpretability, it would be useful to have nodes with more than two child nodes. For example, if a categorical variable like the number of doors of a car has three values (namely, 2, 4, and 6), then it may be important, for comprehensibility purposes, not to transform this variable into two binary variables so the knowledge “number of doors” appears only once in a branch of the tree. Additionally, another generally common assumption is that shallow trees (i.e., trees with small depth) and small trees (small size) enhance the interpretability and the comprehensibility. This question requires further study because, actually, in certain domains such as health care, a decision tree with a maximum depth of two such as in Bertsimas and Dunn [
2017] and Tjortjis and Keane [
2002] could not be comprehensible by experts [Freitas
2014; López-Vallverdú et al.
2007]. Indeed, rules may be too simple to fit domain-expert requirements. To overcome this problem, it may be necessary to add human and expert knowledge as constraints when learning decision trees.