Ensembles of Learning Machines Giorgio Valentini1,2 and Francesco Masulli1,3 1 INFM, Istituto Nazionale per la Fisica della Materia, 16146 Genova, Italy 2 DISI, Università di Genova, 16146 Genova, Italy valenti@disi.unige.it 3 Dipartimento di Informatica, Università di Pisa, 56125 Pisa, Italy masulli@di.unipi.it Abstract. Ensembles of learning machines constitute one of the main current directions in machine learning research, and have been applied to a wide range of real problems. Despite of the absence of an unified theory on ensembles, there are many theoretical reasons for combining multiple learners, and an empirical evidence of the effectiveness of this approach. In this paper we present a brief overview of ensemble methods, explaining the main reasons why they are able to outperform any single classifier within the ensemble, and proposing a taxonomy based on the main ways base classifiers can be generated or combined together. 1 Introduction Ensembles are sets of learning machines whose decisions are combined to improve the performance of the overall system. In this last decade one of the main research areas in machine learning has been represented by methods for constructing ensembles of learning machines. Although in the literature [86, 129, 130, 69, 61, 23, 33, 12, 7, 37] a plethora of terms, such as committee, classifier fusion, combination, aggregation and others are used to indicate sets of learning machines that work together to solve a machine learning problem, in this paper we shall use the term ensemble in its widest meaning, in order to include the whole range of combining methods. This variety of terms and specifications reflects the absence of an unified theory on ensemble methods and the youngness of this research area. However, the great effort of the researchers, reflected by the amount of the literature [118, 70, 71] dedicated to this emerging discipline, achieved meaningful and encouraging results. Empirical studies showed that both classification and regression problem ensembles are often much more accurate than the individual base learner that make them up [8, 29, 40], and recently different theoretical explanations have been proposed to justify the effectiveness of some commonly used ensemble methods [69, 112, 75, 3]. The interest in this research area is motivated also by the availability of very fast computers and networks of workstations at a relatively low cost that allow the implementation and the experimentation of complex ensemble methods using off-the-shelf computer platforms. However, as explained in Sect. 2 of this paper there are deeper reasons to use ensembles of learning machines. motivated by the intrinsic characteristics of the ensemble methods. This work presents a brief overview of the main areas of research, without pretending to be exhaustive or to explain the detailed characteristics of each ensemble method. The paper is organized as follows. In the next section the main reasons for combining multiple learners are depicted. Sect. 3 presents an overview of the main ensemble methods reported in the literature, distinguishing between generative and non-generative methods, while Sect. 4 outlines some open problems not covered in this paper. 2 Reasons for Combining Multiple Learners Both empirical observations and specific machine learning applications confirm that a given learning algorithm outperforms all others for a specific problem or for a specific subset of the input data, but it is unusual to find a single expert achieving the best results on the overall problem domain. As a consequence multiple learner systems try to exploit the local different behavior of the base learners to enhance the accuracy and the reliability of the overall inductive learning system. There are also hopes that if some learner fails, the overall system can recover the error. Employing multiple learners can derive from the application context, such as when multiple sensor data are available, inducing a natural decomposition of the problem. In more general cases we can dispose of different training sets, collected at different times, having eventually different features and we can use different specialized learning machine for each different item. However, there are deeper reasons why ensembles can improve performances with respect to a single learning machine. As an example, consider the following one given by Tom Dietterich in [28]. If we have a dichotomic classification problem and L hypotheses whose error is lower than 0.5, then the resulting majority voting ensemble has an error lower than the single classifier, as long as the error of the base learners are uncorrelated. In fact, if we have 21 classifiers, and the error rates of each base learner are all equal to p = 0.3 and the errors are independent, the overall error of the majority voting ensemble will be given by the area under the binomial distribution where more than L/2 hypotheses are wrong: Perror = L X (i=⌈L/2⌉) µ ¶ L pi (1 − p)L−i i ⇒ Perror = 0.026 ≪ p = 0.3 This result has been studied by mathematicians since the end of the XVIII century in the context of social sciences: in fact the Condorcet Jury Theorem [26]) proved that the judgment of a committee is superior to those of individuals, provided the individuals have reasonable competence (that is, a probability of being correct higher than 0.5). As noted in [85], this theorem theoretically justifies recent research on multiple ”weak” classifiers [63, 51, 74], representing an interesting research direction diametrically opposite to the development of highly accurate and specific classifiers. This simple example shows also an important issue in the design of ensembles of learning machines: the effectiveness of ensemble methods relies on the independence of the error committed by the component base learner. In this example, if the independence assumption does not hold, we have no assurance that the ensemble will lower the error, and we know that in many cases the errors are correlated. From a general standpoint we know that the effectiveness of ensemble methods depends on the accuracy and the diversity of the base learners, that is if they exhibit low error rates and if they produce different errors [49, 123, 92]. The correlated concept of independence between the base learners has been commonly regarded as a requirement for effective classifier combinations, but recent works have shown that not always independent classifiers outperform dependent ones [84]. In fact there is a trade-off between accuracy and independence: more accurate are the base learners, less independent they are. Learning algorithms try to find an hypothesis in a given space H of hypotheses, and in many cases if we have sufficient data they can find the optimal one for a given problem. But in real cases we have only limited data sets and sometimes only few examples are available. In these cases the learning algorithm can find different hypotheses that appear equally accurate with respect to the available training data, and although we can sometimes select among them the simplest or the one with the lowest capacity, we can avoid the problem averaging or combining them to get a good approximation of the unknown true hypothesis. Another reason for combining multiple learners arises from the limited representational capability of learning algorithms. In many cases the unknown function to be approximated is not present in H, but a combination of hypotheses drawn from H can expand the space of representable functions, embracing also the true one. Although many learning algorithms present universal approximation properties [55, 100], with finite data sets these asymptotic features do not hold: the effective space of hypotheses explored by the learning algorithm is a function of the available data and it can be significantly smaller than the virtual H considered in the asymptotic case. From this standpoint ensembles can enlarge the effective hypotheses coverage, expanding the space of representable functions. Many learning algorithms apply local optimization techniques that may get stuck in local optima. For instance inductive decision trees employ a greedy local optimization approach, and neural networks apply gradient descent techniques to minimize an error function over the training data. Moreover optimal training with finite data both for neural networks and decision trees is NP-hard [13, 57]. As a consequence even if the learning algorithm can in principle find the best hypothesis, we actually may not be able to find it. Building an ensemble using, for instance, different starting points may achieve a better approximation, even if no assurance of this is given. Another way to look at the need for ensembles is represented by the classical bias–variance analysis of the error [45, 78]: different works have shown that several ensemble methods reduce variance [15, 87] or both bias and variance [15, 39, 77]. Recently the improved generalization capabilities of different ensemble methods have also been interpretated in the framework of the theory of large margin classifiers [89, 113, 3], showing that methods such as boosting and ECOC enlarge the margins of the examples. 3 Ensemble Methods Overview A large number of combination schemes and ensemble methods have been proposed in literature. Combination techniques can be grouped and analysed in different ways, depending on the main classification criterion adopted. If we consider the representation of the input patterns as the main criterion, we can identify two distinct large groups, one that uses the same and one that uses different representations of the inputs [68, 69]. Assuming the architecture of the ensemble as the main criterion, we can distinguish between serial, parallel and hierarchical schemes [85], and if the base learners are selected or not by the ensemble algorithm we can separate selectionoriented and combiner-oriented ensemble methods [61, 81]. In this brief overview we adopt an approach similar to the one cited above, in order to distinguish between non-generative and generative ensemble methods. Non-generative ensemble methods confine theirselves to combine a set of given possibly well-designed base learners: they do not actively generate new base learners but try to combine in a suitable way a set of existing base classifiers. Generative ensemble methods generate sets of base learners acting on the base learning algorithm or on the structure of the data set and try to actively improve diversity and accuracy of the base learners. 3.1 Non-generative Ensembles This large group of ensemble methods embraces a large set of different approaches to combine learning machines. They share the very general common property of using a predetermined set of learning machines previously trained with suitable algorithms. The base learners are then put together by a combiner module that may vary depending on its adaptivity to the input patterns and on the requirement of the output of the individual learning machines. The type of combination may depend on the type of output. If only labels are available or if continuous outputs are hardened, then majority voting, that is the class most represented among the base classifiers, is used [67, 104, 87]. This approach can be refined assigning different weights to each classifier to optimize the performance of the combined classifier on the training set [86], or, assuming mutual independence between classifiers, a Bayesian decision rule selects the class with the highest posterior probability computed through the estimated class conditional probabilities and the Bayes’ formula [130, 122]. A Bayesian approach has also been used in Consensus based classification of multisource remote sensing data [10, 9, 19], outperforming conventional multivariate methods for classification. To overcome the problem of the independence assumption (that is unrealistic in most cases), the Behavior-Knowledge Space (BKS) method [56] considers each possible combination of class labels, filling a look-up table using the available data set, but this technique requires a huge volume of training data. Where we interpret the classifier outputs as the support for the classes, fuzzy aggregation methods can be applied, such as simple connectives between fuzzy sets or the fuzzy integral [23, 22, 66, 128]; if the classifier outputs are possibilistic, Dempster-Schafer combination rules can be applied [108]. Statistical methods and similarity measures to estimate classifier correlation have also been used to evaluate expert system combination for a proper design of multi-expert systems [58]. The base learners can also be aggregated using simple operators as Minimum, Maximum, Average and Product and Ordered Weight Averaging [111, 18, 80]. In particular, on the basis of a common bayesian framework, Josef Kittler provided a theoretical underpinning of many existing classifier combination schemes based on the product and the sum rule, showing also that the sum rule is less sensitive to the errors of subsets of base classifiers [69]. Recently Ljudmila Kuncheva has developed a global combination scheme that takes into account the decision profiles of all the ensemble classifiers with respect to all the classes, designing Decision templates that summarize in matrix format the average decision profiles of the training set examples. Different similarity measures can be used to evaluate the matching between the matrix of classifier outputs for an input x, that is the decision profiles referred to x, and the matrix templates (one for each class) found as the class means of the classifier outputs [81]. This general fuzzy approach produce soft class labels that can be seen as a generalization of the conventional crisp and probabilistic combination schemes. Another general approach consists in explicitly training combining rules, using second-level learning machines on top of the set of the base learners [34]. This stacked structure makes use of the outputs of the base learners as features in the intermediate space: the outputs are fed into a second-level machine to perform a trained combination of the base learners. 3.2 Generative Ensembles Generative ensemble methods try to improve the overall accuracy of the ensemble by directly boosting the accuracy and the diversity of the base learner. They can modify the structure and the characteristics of the available input data, as in resampling methods or in feature selection methods, they can manipulate the aggregation of the classes (Output Coding methods), can select base learners specialized for a specific input region (mixture of experts methods), can select a proper set of base learners evaluating the performance and the characteristics of the component base learners (test-and-select methods) or can randomly modify the base learning algorithm (randomized methods). Resampling methods Resampling techniques can be used to generate different hypotheses. For instance, bootstrapping techniques [35] may be used to generate different training sets and a learning algorithm can be applied to the obtained subsets of data in order to produce multiple hypotheses. These techniques are effective especially with unstable learning algorithms, which are algorithms very sensitive to small changes in the training data, such as neural-networks and decision trees. In bagging [15] the ensemble is formed by making bootstrap replicates of the training sets, and then multiple generated hypotheses are used to get an aggregated predictor. The aggregation can be performed averaging the outputs in regression or by majority or weighted voting in classification problems [120, 121]. While in bagging the samples are drawn with replacement using a uniform probability distribution, in boosting methods the learning algorithm is called at each iteration using a different distribution or weighting over the training examples [111, 40, 112, 39, 115, 110, 32, 38, 33, 32, 16, 17, 42, 41]. This technique places the highest weight on the examples most often misclassified by the previous base learner: in this way the base learner focuses its attention on the hardest examples. Then the boosting algorithm combines the base rules taking a weighted majority vote of the base rules. Schapire and Singer showed that the training error exponentially drops down with the number of iterations [114] and Schapire et al. [113] proved that boosting enlarges the margins of the training examples, showing also that this fact translates into a superior upper bound on the generalization error. Experimental work showed that bagging is effective with noisy data, while boosting, concentrating its efforts on noisy data seems to be very sensitive to noise [107, 29]. Another training set sampling method consists in constructing training sets by leaving out disjoint subsets of the training data as in cross-validated committees [101, 102] or sampling without replacement [116]. Another general approach, named Stochastic Discrimination [73, 74, 75, 72], is based on randomly sampling from a space of subsets of the feature space underlying a given problem, then combining these subsets to form a final classifier, using a set-theoretic abstraction to remove all the algorithmic details of classifiers and training procedures. By this approach the classifiers’ decision regions are considered only in form of point sets, and the set of classifiers is just a sample into the power set of the feature space. A rigorous mathematical treatment starting from the ”representativeness” of the examples used in machine learning problems leads to the design of ensemble of weak classifiers, whose accuracy is governed by the law of large numbers [20]. Feature selection methods This approach consists in reducing the number of input features of the base learners, a simple method to fight the effects of the classical curse of dimensionality problem [43]. For instance, in the Random Subspace Method [51, 82], a subset of features is randomly selected and assigned to an arbitrary learning algorithm. This way, one obtains a random subspace of the original feature space, and constructs classifiers inside this reduced subspace. The aggregation is usually performed using weighted voting on the basis of the base classifiers accuracy. It has been shown that this method is effective for classifiers having a decreasing learning curve constructed on small and critical training sample sizes [119] The Input Decimation approach [124, 98] reduces the correlation among the errors of the base classifiers, decoupling the base classifiers by training them with different subsets of the input features. It differs from the previous Random Subspace Method as for each class the correlation between each feature and the output of the class is explicitly computed, and the base classifier is trained only on the most correlated subset of features. Feature subspace methods performed by partitioning the set of features, where each subset is used by one classifier in the team, are proposed in [130, 99, 18]. Other methods for combining different feature sets using genetic algorithms are proposed in [81, 79]. Different approaches consider feature sets obtained by using different operators on the original feature space, such as Principal Component Analysis, Fourier coefficients, Karhunen-Loewe coefficients, or other [21, 34]. An experiment with a systematic partition of the feature space, using nine different combination schemes is performed in [83], showing that there are no ”best” combinations for all situations and that there is no assurance that in all cases a classifier team will outperform the single best individual. Mixtures of experts methods The recombination of the base learners can be governed by a supervisor learning machine, that selects the most appropriate element of the ensemble on the basis of the available input data. This idea led to the mixture of experts methods [60, 59], where a gating network performs the division of the input space and small neural networks perform the effective calculation at each assigned region separately. An extension of this approach is the hierarchical mixture of experts method, where the outputs of the different experts are non-linearly combined by different supervisor gating networks hierarchically organized [64, 65, 59]. Cohen and Intrator extended the idea of constructing local simple base learners for different regions of input space, searching for appropriate architectures that should be locally used and for a criterion to select a proper unit for each region of input space [24, 25]. They proposed a hybrid MLP/RBF network by combining RBF and Perceptron units in the same hidden layer and using a forward selection [36] to add units until an error goal is reached. Although the resulting Hybrid Perceptron/Radial Network is not in a strict sense an ensemble, the way by which the regions of the input space and the computational units are selected and tested could be in principle extended to ensembles of learning machines. Output Coding decomposition methods Output Coding (OC) methods decompose a multiclass–classification problem in a set of two-class subproblems, and then recompose the original problem combining them to achieve the class label [94, 90, 28]. An equivalent way of thinking about these methods consists in encoding each class as a bit string (named codeword), and in training a different two-class base learner (dichotomizer) in order to separately learn each codeword bit. When the dichotomizers are applied to classify new points, a suitable misure of similarity between the codeword computed by the ensemble and the codeword classes is used to predict the class. Different decomposition schemes have been proposed in literature: In the One-Per-Class (OPC) decomposition [5], each dichotomizer fi has to separate a single class from all others; in the PairWise Coupling (PWC) decomposition [50], the task of each dichotomizer fi consists in separating a class Ci form class Cj , ignoring all other classes; the Correcting Classifiers (CC) and the PairWise Coupling Correcting Classifiers (PWC-CC) are variants of the PWC decomposition scheme, that reduce the noise originated in the PWC scheme due to the processing of non pertinent information performed by the PWC dichotomizers [96]. Error Correcting Output Coding [30, 31] is the most studied OC method, and has been successfully applied to several classification problems [1, 11, 46, 6, 126, 131]. Thisdecomposition method tries to improve the error correcting capabilities of the codes generated by the decomposition through the maximization of the minimum distance between each couple of codewords [77, 90]. This goal is achieved by means of the redundancy of the coding scheme [127]. ECOC methods present several open problems. The tradeoff between error recovering capabilities and complexity/learnability of the dichotomies induced by the decomposition scheme has been tackled in several works [3, 125], but an extensive experimental evaluation of the tradeoff has to be performed in order to achieve a better understanding of this phenomenon. A related problem is the analysis of the relationship between codeword length and performances: some preliminary results seem to show that long codewords improve performance [46]. Another open problem, not sufficiently investigated in literature [46, 91, 11], is the selection of optimal dichotomic learning machines for the decomposition unit. Several methods for generating ECOC codes have been proposed: exhaustive codes, randomized hill climbing [31], random codes [62], and Hadamard and BCH codes [14, 105]. An open problems is still the joint maximization of distances between rows and columns in the decomposition matrix. Another open problem consists in designing codes for a given multiclass problem. An interesting greedy approach is proposed in [94], and a method based on soft weight sharing to learn error correcting codes from data is presented in [4]. In [27] it is shown that given a set of dichotomizers the problem of finding an optimal decomposition matrix is NP-complete: by introducing continuous codes and casting the design problem of continuous codes as a constrained optimization problem, we can achieve an optimal continuous decomposition using standard optimization methods. The work in [91] highlights that the effectiveness of ECOC decomposition methods depends mainly on the design of the learning machines implementing the decision units, on the similarity of the ECOC codewords, on the accuracy of the dichotomizers, on the complexity of the multiclass learning problem and on the correlation of the codeword bits. In particular, Peterson and Weldon [105] showed that if errors on different code bits are dependent, the effectiveness of error correcting code is reduced. Consequently, if a decomposition matrix contains very similar rows (dichotomies), each error of an assigned dichotomizer will be likely to appear in the most correlated dichotomizers, thus reducing the effectiveness of ECOC. These hypotheses have been experimentally supported by a quantitative evaluation of the dependency among output errors of the decomposition unit of ECOC learning machines using mutual information based measures [92, 93]. Test and select methods The test and select methodology relies on the idea of selection in ensemble creation [117]. The simplest approach is a greedy one [104], where a new learner is added to the ensemble only if the resulting squared error is reduced, but in principle any optimization technique can be used to select the ”best” component of the ensemble, including genetic algorithms [97]. It should be noted that the time complexity of the selection of optimal subsets of classifiers is exponential with respect to the number of base learners used. From this point of view heuristic rules, as the ”choose the best” or the ”choose the best in the class”, using classifiers of different types strongly reduce the computational complexity of the selected phase, as the evaluation of different classifier subsets is not required [103]. Moreover test and select methods implicitly include a ”production stage”, by which a set of classifiers must be generated. Different selection methods based on different search algorithm mututated from feature selection methods (forward and backward search) or for the solution of complex optimization tasks (tabu search) are proposed in [109]. Another interesting approach uses clustering methods and a misure of diversity to generate sets of diverse classifiers combined by majority voting, selecting the ensemble with the highest performance [48]. Finally, Dynamic Classifier Selection methods [54, 129, 47] are based on the definition of a function selecting for each pattern the classifier which is probably the most accurate, estimating, for instance the accuracy of each classifier in a local region of the feature space surrounding an unknown test pattern [47]. Randomized ensemble methods Injecting randomness into the learning algorithm is another general method to generate ensembles of learning machines. For instance, if we initialize with random values the initial weights in the backpropagation algorithm, we can obtain different learning machines that can be combined into an ensemble [76, 101]. Several experimental results showed that randomized learning algorithms used to generate base elements of ensembles improve the performances of single non-randomized classifiers. For instance in [29] randomized decision tree ensembles outperform single C4.5 decision trees [106], and adding gaussian noise to the data inputs, together with bootstrap and weight regularization can achieve large improvements in classification accuracy [107]. 4 Conclusions Ensemble methods have shown to be effective in many applicative domains and can be considered as one of the main current directions in machine learning research. We presented an overview of the ensemble methods, showing the main areas of research in this discipline, and the fundamental reasons why ensemble methods are able to outperform any single classifier within the ensemble. A general taxonomy, distinguishing between generative and non–generative ensemble methods, has been proposed, considering the different ways base learners can be generated or combined together. Several important issues have not been discussed in this paper. In particular the theoretical problems behind ensemble methods need to be reviewed and discussed more in detail, even if a general theoretical framework for ensemble methods has not been developed. 