Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Ensembles of Learning Machines Giorgio Valentini1,2 and Francesco Masulli1,3 1 INFM, Istituto Nazionale per la Fisica della Materia, 16146 Genova, Italy 2 DISI, Università di Genova, 16146 Genova, Italy valenti@disi.unige.it 3 Dipartimento di Informatica, Università di Pisa, 56125 Pisa, Italy masulli@di.unipi.it Abstract. Ensembles of learning machines constitute one of the main current directions in machine learning research, and have been applied to a wide range of real problems. Despite of the absence of an unified theory on ensembles, there are many theoretical reasons for combining multiple learners, and an empirical evidence of the effectiveness of this approach. In this paper we present a brief overview of ensemble methods, explaining the main reasons why they are able to outperform any single classifier within the ensemble, and proposing a taxonomy based on the main ways base classifiers can be generated or combined together. 1 Introduction Ensembles are sets of learning machines whose decisions are combined to improve the performance of the overall system. In this last decade one of the main research areas in machine learning has been represented by methods for constructing ensembles of learning machines. Although in the literature [86, 129, 130, 69, 61, 23, 33, 12, 7, 37] a plethora of terms, such as committee, classifier fusion, combination, aggregation and others are used to indicate sets of learning machines that work together to solve a machine learning problem, in this paper we shall use the term ensemble in its widest meaning, in order to include the whole range of combining methods. This variety of terms and specifications reflects the absence of an unified theory on ensemble methods and the youngness of this research area. However, the great effort of the researchers, reflected by the amount of the literature [118, 70, 71] dedicated to this emerging discipline, achieved meaningful and encouraging results. Empirical studies showed that both classification and regression problem ensembles are often much more accurate than the individual base learner that make them up [8, 29, 40], and recently different theoretical explanations have been proposed to justify the effectiveness of some commonly used ensemble methods [69, 112, 75, 3]. The interest in this research area is motivated also by the availability of very fast computers and networks of workstations at a relatively low cost that allow the implementation and the experimentation of complex ensemble methods using off-the-shelf computer platforms. However, as explained in Sect. 2 of this paper there are deeper reasons to use ensembles of learning machines. motivated by the intrinsic characteristics of the ensemble methods. This work presents a brief overview of the main areas of research, without pretending to be exhaustive or to explain the detailed characteristics of each ensemble method. The paper is organized as follows. In the next section the main reasons for combining multiple learners are depicted. Sect. 3 presents an overview of the main ensemble methods reported in the literature, distinguishing between generative and non-generative methods, while Sect. 4 outlines some open problems not covered in this paper. 2 Reasons for Combining Multiple Learners Both empirical observations and specific machine learning applications confirm that a given learning algorithm outperforms all others for a specific problem or for a specific subset of the input data, but it is unusual to find a single expert achieving the best results on the overall problem domain. As a consequence multiple learner systems try to exploit the local different behavior of the base learners to enhance the accuracy and the reliability of the overall inductive learning system. There are also hopes that if some learner fails, the overall system can recover the error. Employing multiple learners can derive from the application context, such as when multiple sensor data are available, inducing a natural decomposition of the problem. In more general cases we can dispose of different training sets, collected at different times, having eventually different features and we can use different specialized learning machine for each different item. However, there are deeper reasons why ensembles can improve performances with respect to a single learning machine. As an example, consider the following one given by Tom Dietterich in [28]. If we have a dichotomic classification problem and L hypotheses whose error is lower than 0.5, then the resulting majority voting ensemble has an error lower than the single classifier, as long as the error of the base learners are uncorrelated. In fact, if we have 21 classifiers, and the error rates of each base learner are all equal to p = 0.3 and the errors are independent, the overall error of the majority voting ensemble will be given by the area under the binomial distribution where more than L/2 hypotheses are wrong: Perror = L X (i=⌈L/2⌉) µ ¶ L pi (1 − p)L−i i ⇒ Perror = 0.026 ≪ p = 0.3 This result has been studied by mathematicians since the end of the XVIII century in the context of social sciences: in fact the Condorcet Jury Theorem [26]) proved that the judgment of a committee is superior to those of individuals, provided the individuals have reasonable competence (that is, a probability of being correct higher than 0.5). As noted in [85], this theorem theoretically justifies recent research on multiple ”weak” classifiers [63, 51, 74], representing an interesting research direction diametrically opposite to the development of highly accurate and specific classifiers. This simple example shows also an important issue in the design of ensembles of learning machines: the effectiveness of ensemble methods relies on the independence of the error committed by the component base learner. In this example, if the independence assumption does not hold, we have no assurance that the ensemble will lower the error, and we know that in many cases the errors are correlated. From a general standpoint we know that the effectiveness of ensemble methods depends on the accuracy and the diversity of the base learners, that is if they exhibit low error rates and if they produce different errors [49, 123, 92]. The correlated concept of independence between the base learners has been commonly regarded as a requirement for effective classifier combinations, but recent works have shown that not always independent classifiers outperform dependent ones [84]. In fact there is a trade-off between accuracy and independence: more accurate are the base learners, less independent they are. Learning algorithms try to find an hypothesis in a given space H of hypotheses, and in many cases if we have sufficient data they can find the optimal one for a given problem. But in real cases we have only limited data sets and sometimes only few examples are available. In these cases the learning algorithm can find different hypotheses that appear equally accurate with respect to the available training data, and although we can sometimes select among them the simplest or the one with the lowest capacity, we can avoid the problem averaging or combining them to get a good approximation of the unknown true hypothesis. Another reason for combining multiple learners arises from the limited representational capability of learning algorithms. In many cases the unknown function to be approximated is not present in H, but a combination of hypotheses drawn from H can expand the space of representable functions, embracing also the true one. Although many learning algorithms present universal approximation properties [55, 100], with finite data sets these asymptotic features do not hold: the effective space of hypotheses explored by the learning algorithm is a function of the available data and it can be significantly smaller than the virtual H considered in the asymptotic case. From this standpoint ensembles can enlarge the effective hypotheses coverage, expanding the space of representable functions. Many learning algorithms apply local optimization techniques that may get stuck in local optima. For instance inductive decision trees employ a greedy local optimization approach, and neural networks apply gradient descent techniques to minimize an error function over the training data. Moreover optimal training with finite data both for neural networks and decision trees is NP-hard [13, 57]. As a consequence even if the learning algorithm can in principle find the best hypothesis, we actually may not be able to find it. Building an ensemble using, for instance, different starting points may achieve a better approximation, even if no assurance of this is given. Another way to look at the need for ensembles is represented by the classical bias–variance analysis of the error [45, 78]: different works have shown that several ensemble methods reduce variance [15, 87] or both bias and variance [15, 39, 77]. Recently the improved generalization capabilities of different ensemble methods have also been interpretated in the framework of the theory of large margin classifiers [89, 113, 3], showing that methods such as boosting and ECOC enlarge the margins of the examples. 3 Ensemble Methods Overview A large number of combination schemes and ensemble methods have been proposed in literature. Combination techniques can be grouped and analysed in different ways, depending on the main classification criterion adopted. If we consider the representation of the input patterns as the main criterion, we can identify two distinct large groups, one that uses the same and one that uses different representations of the inputs [68, 69]. Assuming the architecture of the ensemble as the main criterion, we can distinguish between serial, parallel and hierarchical schemes [85], and if the base learners are selected or not by the ensemble algorithm we can separate selectionoriented and combiner-oriented ensemble methods [61, 81]. In this brief overview we adopt an approach similar to the one cited above, in order to distinguish between non-generative and generative ensemble methods. Non-generative ensemble methods confine theirselves to combine a set of given possibly well-designed base learners: they do not actively generate new base learners but try to combine in a suitable way a set of existing base classifiers. Generative ensemble methods generate sets of base learners acting on the base learning algorithm or on the structure of the data set and try to actively improve diversity and accuracy of the base learners. 3.1 Non-generative Ensembles This large group of ensemble methods embraces a large set of different approaches to combine learning machines. They share the very general common property of using a predetermined set of learning machines previously trained with suitable algorithms. The base learners are then put together by a combiner module that may vary depending on its adaptivity to the input patterns and on the requirement of the output of the individual learning machines. The type of combination may depend on the type of output. If only labels are available or if continuous outputs are hardened, then majority voting, that is the class most represented among the base classifiers, is used [67, 104, 87]. This approach can be refined assigning different weights to each classifier to optimize the performance of the combined classifier on the training set [86], or, assuming mutual independence between classifiers, a Bayesian decision rule selects the class with the highest posterior probability computed through the estimated class conditional probabilities and the Bayes’ formula [130, 122]. A Bayesian approach has also been used in Consensus based classification of multisource remote sensing data [10, 9, 19], outperforming conventional multivariate methods for classification. To overcome the problem of the independence assumption (that is unrealistic in most cases), the Behavior-Knowledge Space (BKS) method [56] considers each possible combination of class labels, filling a look-up table using the available data set, but this technique requires a huge volume of training data. Where we interpret the classifier outputs as the support for the classes, fuzzy aggregation methods can be applied, such as simple connectives between fuzzy sets or the fuzzy integral [23, 22, 66, 128]; if the classifier outputs are possibilistic, Dempster-Schafer combination rules can be applied [108]. Statistical methods and similarity measures to estimate classifier correlation have also been used to evaluate expert system combination for a proper design of multi-expert systems [58]. The base learners can also be aggregated using simple operators as Minimum, Maximum, Average and Product and Ordered Weight Averaging [111, 18, 80]. In particular, on the basis of a common bayesian framework, Josef Kittler provided a theoretical underpinning of many existing classifier combination schemes based on the product and the sum rule, showing also that the sum rule is less sensitive to the errors of subsets of base classifiers [69]. Recently Ljudmila Kuncheva has developed a global combination scheme that takes into account the decision profiles of all the ensemble classifiers with respect to all the classes, designing Decision templates that summarize in matrix format the average decision profiles of the training set examples. Different similarity measures can be used to evaluate the matching between the matrix of classifier outputs for an input x, that is the decision profiles referred to x, and the matrix templates (one for each class) found as the class means of the classifier outputs [81]. This general fuzzy approach produce soft class labels that can be seen as a generalization of the conventional crisp and probabilistic combination schemes. Another general approach consists in explicitly training combining rules, using second-level learning machines on top of the set of the base learners [34]. This stacked structure makes use of the outputs of the base learners as features in the intermediate space: the outputs are fed into a second-level machine to perform a trained combination of the base learners. 3.2 Generative Ensembles Generative ensemble methods try to improve the overall accuracy of the ensemble by directly boosting the accuracy and the diversity of the base learner. They can modify the structure and the characteristics of the available input data, as in resampling methods or in feature selection methods, they can manipulate the aggregation of the classes (Output Coding methods), can select base learners specialized for a specific input region (mixture of experts methods), can select a proper set of base learners evaluating the performance and the characteristics of the component base learners (test-and-select methods) or can randomly modify the base learning algorithm (randomized methods). Resampling methods Resampling techniques can be used to generate different hypotheses. For instance, bootstrapping techniques [35] may be used to generate different training sets and a learning algorithm can be applied to the obtained subsets of data in order to produce multiple hypotheses. These techniques are effective especially with unstable learning algorithms, which are algorithms very sensitive to small changes in the training data, such as neural-networks and decision trees. In bagging [15] the ensemble is formed by making bootstrap replicates of the training sets, and then multiple generated hypotheses are used to get an aggregated predictor. The aggregation can be performed averaging the outputs in regression or by majority or weighted voting in classification problems [120, 121]. While in bagging the samples are drawn with replacement using a uniform probability distribution, in boosting methods the learning algorithm is called at each iteration using a different distribution or weighting over the training examples [111, 40, 112, 39, 115, 110, 32, 38, 33, 32, 16, 17, 42, 41]. This technique places the highest weight on the examples most often misclassified by the previous base learner: in this way the base learner focuses its attention on the hardest examples. Then the boosting algorithm combines the base rules taking a weighted majority vote of the base rules. Schapire and Singer showed that the training error exponentially drops down with the number of iterations [114] and Schapire et al. [113] proved that boosting enlarges the margins of the training examples, showing also that this fact translates into a superior upper bound on the generalization error. Experimental work showed that bagging is effective with noisy data, while boosting, concentrating its efforts on noisy data seems to be very sensitive to noise [107, 29]. Another training set sampling method consists in constructing training sets by leaving out disjoint subsets of the training data as in cross-validated committees [101, 102] or sampling without replacement [116]. Another general approach, named Stochastic Discrimination [73, 74, 75, 72], is based on randomly sampling from a space of subsets of the feature space underlying a given problem, then combining these subsets to form a final classifier, using a set-theoretic abstraction to remove all the algorithmic details of classifiers and training procedures. By this approach the classifiers’ decision regions are considered only in form of point sets, and the set of classifiers is just a sample into the power set of the feature space. A rigorous mathematical treatment starting from the ”representativeness” of the examples used in machine learning problems leads to the design of ensemble of weak classifiers, whose accuracy is governed by the law of large numbers [20]. Feature selection methods This approach consists in reducing the number of input features of the base learners, a simple method to fight the effects of the classical curse of dimensionality problem [43]. For instance, in the Random Subspace Method [51, 82], a subset of features is randomly selected and assigned to an arbitrary learning algorithm. This way, one obtains a random subspace of the original feature space, and constructs classifiers inside this reduced subspace. The aggregation is usually performed using weighted voting on the basis of the base classifiers accuracy. It has been shown that this method is effective for classifiers having a decreasing learning curve constructed on small and critical training sample sizes [119] The Input Decimation approach [124, 98] reduces the correlation among the errors of the base classifiers, decoupling the base classifiers by training them with different subsets of the input features. It differs from the previous Random Subspace Method as for each class the correlation between each feature and the output of the class is explicitly computed, and the base classifier is trained only on the most correlated subset of features. Feature subspace methods performed by partitioning the set of features, where each subset is used by one classifier in the team, are proposed in [130, 99, 18]. Other methods for combining different feature sets using genetic algorithms are proposed in [81, 79]. Different approaches consider feature sets obtained by using different operators on the original feature space, such as Principal Component Analysis, Fourier coefficients, Karhunen-Loewe coefficients, or other [21, 34]. An experiment with a systematic partition of the feature space, using nine different combination schemes is performed in [83], showing that there are no ”best” combinations for all situations and that there is no assurance that in all cases a classifier team will outperform the single best individual. Mixtures of experts methods The recombination of the base learners can be governed by a supervisor learning machine, that selects the most appropriate element of the ensemble on the basis of the available input data. This idea led to the mixture of experts methods [60, 59], where a gating network performs the division of the input space and small neural networks perform the effective calculation at each assigned region separately. An extension of this approach is the hierarchical mixture of experts method, where the outputs of the different experts are non-linearly combined by different supervisor gating networks hierarchically organized [64, 65, 59]. Cohen and Intrator extended the idea of constructing local simple base learners for different regions of input space, searching for appropriate architectures that should be locally used and for a criterion to select a proper unit for each region of input space [24, 25]. They proposed a hybrid MLP/RBF network by combining RBF and Perceptron units in the same hidden layer and using a forward selection [36] to add units until an error goal is reached. Although the resulting Hybrid Perceptron/Radial Network is not in a strict sense an ensemble, the way by which the regions of the input space and the computational units are selected and tested could be in principle extended to ensembles of learning machines. Output Coding decomposition methods Output Coding (OC) methods decompose a multiclass–classification problem in a set of two-class subproblems, and then recompose the original problem combining them to achieve the class label [94, 90, 28]. An equivalent way of thinking about these methods consists in encoding each class as a bit string (named codeword), and in training a different two-class base learner (dichotomizer) in order to separately learn each codeword bit. When the dichotomizers are applied to classify new points, a suitable misure of similarity between the codeword computed by the ensemble and the codeword classes is used to predict the class. Different decomposition schemes have been proposed in literature: In the One-Per-Class (OPC) decomposition [5], each dichotomizer fi has to separate a single class from all others; in the PairWise Coupling (PWC) decomposition [50], the task of each dichotomizer fi consists in separating a class Ci form class Cj , ignoring all other classes; the Correcting Classifiers (CC) and the PairWise Coupling Correcting Classifiers (PWC-CC) are variants of the PWC decomposition scheme, that reduce the noise originated in the PWC scheme due to the processing of non pertinent information performed by the PWC dichotomizers [96]. Error Correcting Output Coding [30, 31] is the most studied OC method, and has been successfully applied to several classification problems [1, 11, 46, 6, 126, 131]. Thisdecomposition method tries to improve the error correcting capabilities of the codes generated by the decomposition through the maximization of the minimum distance between each couple of codewords [77, 90]. This goal is achieved by means of the redundancy of the coding scheme [127]. ECOC methods present several open problems. The tradeoff between error recovering capabilities and complexity/learnability of the dichotomies induced by the decomposition scheme has been tackled in several works [3, 125], but an extensive experimental evaluation of the tradeoff has to be performed in order to achieve a better understanding of this phenomenon. A related problem is the analysis of the relationship between codeword length and performances: some preliminary results seem to show that long codewords improve performance [46]. Another open problem, not sufficiently investigated in literature [46, 91, 11], is the selection of optimal dichotomic learning machines for the decomposition unit. Several methods for generating ECOC codes have been proposed: exhaustive codes, randomized hill climbing [31], random codes [62], and Hadamard and BCH codes [14, 105]. An open problems is still the joint maximization of distances between rows and columns in the decomposition matrix. Another open problem consists in designing codes for a given multiclass problem. An interesting greedy approach is proposed in [94], and a method based on soft weight sharing to learn error correcting codes from data is presented in [4]. In [27] it is shown that given a set of dichotomizers the problem of finding an optimal decomposition matrix is NP-complete: by introducing continuous codes and casting the design problem of continuous codes as a constrained optimization problem, we can achieve an optimal continuous decomposition using standard optimization methods. The work in [91] highlights that the effectiveness of ECOC decomposition methods depends mainly on the design of the learning machines implementing the decision units, on the similarity of the ECOC codewords, on the accuracy of the dichotomizers, on the complexity of the multiclass learning problem and on the correlation of the codeword bits. In particular, Peterson and Weldon [105] showed that if errors on different code bits are dependent, the effectiveness of error correcting code is reduced. Consequently, if a decomposition matrix contains very similar rows (dichotomies), each error of an assigned dichotomizer will be likely to appear in the most correlated dichotomizers, thus reducing the effectiveness of ECOC. These hypotheses have been experimentally supported by a quantitative evaluation of the dependency among output errors of the decomposition unit of ECOC learning machines using mutual information based measures [92, 93]. Test and select methods The test and select methodology relies on the idea of selection in ensemble creation [117]. The simplest approach is a greedy one [104], where a new learner is added to the ensemble only if the resulting squared error is reduced, but in principle any optimization technique can be used to select the ”best” component of the ensemble, including genetic algorithms [97]. It should be noted that the time complexity of the selection of optimal subsets of classifiers is exponential with respect to the number of base learners used. From this point of view heuristic rules, as the ”choose the best” or the ”choose the best in the class”, using classifiers of different types strongly reduce the computational complexity of the selected phase, as the evaluation of different classifier subsets is not required [103]. Moreover test and select methods implicitly include a ”production stage”, by which a set of classifiers must be generated. Different selection methods based on different search algorithm mututated from feature selection methods (forward and backward search) or for the solution of complex optimization tasks (tabu search) are proposed in [109]. Another interesting approach uses clustering methods and a misure of diversity to generate sets of diverse classifiers combined by majority voting, selecting the ensemble with the highest performance [48]. Finally, Dynamic Classifier Selection methods [54, 129, 47] are based on the definition of a function selecting for each pattern the classifier which is probably the most accurate, estimating, for instance the accuracy of each classifier in a local region of the feature space surrounding an unknown test pattern [47]. Randomized ensemble methods Injecting randomness into the learning algorithm is another general method to generate ensembles of learning machines. For instance, if we initialize with random values the initial weights in the backpropagation algorithm, we can obtain different learning machines that can be combined into an ensemble [76, 101]. Several experimental results showed that randomized learning algorithms used to generate base elements of ensembles improve the performances of single non-randomized classifiers. For instance in [29] randomized decision tree ensembles outperform single C4.5 decision trees [106], and adding gaussian noise to the data inputs, together with bootstrap and weight regularization can achieve large improvements in classification accuracy [107]. 4 Conclusions Ensemble methods have shown to be effective in many applicative domains and can be considered as one of the main current directions in machine learning research. We presented an overview of the ensemble methods, showing the main areas of research in this discipline, and the fundamental reasons why ensemble methods are able to outperform any single classifier within the ensemble. A general taxonomy, distinguishing between generative and non–generative ensemble methods, has been proposed, considering the different ways base learners can be generated or combined together. Several important issues have not been discussed in this paper. In particular the theoretical problems behind ensemble methods need to be reviewed and discussed more in detail, even if a general theoretical framework for ensemble methods has not been developed. Other open problems not covered in this work are the relationships between ensemble methods and data complexity [52, 53, 88], a systematic research of hidden commonalities among all the combination approaches despite their superficial differences, and a general analysis of the relationships between ensemble methods and the characteristics of the base learners used in the ensemble itself. Acknowledgments This work has been partially funded by INFM. References [1] D. Aha and R. Bankert. Cloud classification using error-correcting output codes. In Artificial Intelligence Applications: Natural Science, Agriculture and Environmental Science, volume 11, pages 13–28. 1997. [2] K.M. Ali and M.J. Pazzani. Error reduction through learning multiple descriptions. Machine Learning, 24(3):173–202, 1996. [3] E.L. Allwein, R.E. Schapire, and Y. Singer. Reducing multiclass to binary: a unifying approach for margin classifiers. Journal of Machine Learning Research, 1:113–141, 2000. [4] E. Alpaydin and E. Mayoraz. Learning error-correcting output codes from data. In ICANN’99, pages 743–748, Edinburgh, UK, 1999. [5] R. Anand, G. Mehrotra, C.K. Mohan, and S. Ranka. Efficient classification for multiclass problems using modular neural networks. IEEE Transactions on Neural Networks, 6:117–124, 1995. [6] G. Bakiri and T.G. Dietterich. Achieving high accuracy text-to-speech with machine learning. In Data mining in speech synthesis. 1999. [7] R. Battiti and A.M. Colla. Democracy in neural nets: Voting schemes for classification. Neural Networks, 7:691–707, 1994. [8] E. Bauer and R.. Kohavi. An empirical comparison of voting classification algorithms: Bagging, boosting and variants. Machine Learning, 36(1/2):525–536, 1999. [9] J. Benediktsson, J. Sveinsson, O. Ersoy, and P. Swain. Parallel consensual neural networks. IEEE Transactions on Neural Networks, 8:54–65, 1997. [10] J. Benediktsson and P. Swain. Consensus theoretic classification methods. IEEE Transactions on Systems, Man and Cybernetics, 22:688–704, 1992. [11] A. Berger. Error correcting output coding for text classification. In IJCAI’99: Workshop on machine learning for information filtering, 1999. [12] C. M. Bishop. Neural Networks for Pattern Recognition. Clarendon Press, Oxford, 1995. [13] A. Blum and R.L. Rivest. Training a 3-node neural network is NP-complete. In Proc. of the 1988 Workshop ob Computational Learning Learning Theory, pages 9–18, San Francisco, CA, 1988. Morgan Kaufmann. [14] R.C. Bose and D.K. Ray-Chauduri. On a class of error correcting binary group codes. Information and Control, (3):68–79, 1960. [15] L. Breiman. Bagging predictors. Machine Learning, 24(2):123–140, 1996. [16] L. Breiman. Arcing classifiers. The Annals of Statistics, 26(3):801–849, 1998. [17] L. Breiman. Prediction games and arcing classifiers. Neural Computation, 11(7):1493–1517, 1999. [18] M. Breukelen van, R.P.W. Duin, D. Tax, and J.E. Hartog den. Combining classifiers fir the recognition of handwritten digits. In Ist IAPR TC1 Workshop on Statistical Techniques in Pattern Recognition, pages 13–18, Prague, Czech republic, 1997. [19] G.J. Briem, J.A. Benediktsson, and J.R. Sveinsson. Boosting. Bagging and Consensus Based Classification of Multisource Remote Sensing Data. In J. Kittler and F. Roli, editors, Multiple Classifier Systems. Second International Workshop, MCS 2001, Cambridge, UK, volume 2096 of Lecture Notes in Computer Science, pages 279–288. Springer-Verlag, 2001. [20] D.. Chen. Statistical estimates for Kleinberg’s method of Stochastic Discrimination. PhD thesis, The State University of New York, Buffalo, USA, 1998. [21] K.J. Cherkauker. Human expert-level performance on a scientific image analysis task by a system using combined artificial neural networks. In Chan P., editor, Working notes of the AAAI Workshop on Integrating Multiple Learned Models, pages 15–21. 1996. [22] S. Cho and J. Kim. Combining multiple neural networks by fuzzy integral and robust classification. IEEE Transactions on Systems, Man and Cybernetics, 25:380–384, 1995. [23] S. Cho and J. Kim. Multiple network fusion using fuzzy logic. IEEE Transactions on Neural Networks, 6:497–501, 1995. [24] S. Cohen and N. Intrator. A Hybrid Projection Based and Radial Basis Function Architecture. In J. Kittler and F. Roli, editors, Multiple Classifier Systems. First International Workshop, MCS 2000, Cagliari, Italy, volume 1857 of Lecture Notes in Computer Science, pages 147–156. Springer-Verlag, 2000. [25] S. Cohen and N. Intrator. Automatic Model Selection in a Hybrid Perceptron/Radial Network. In Multiple Classifier Systems. Second International Workshop, MCS 2001, Cambridge, UK, volume 2096 of Lecture Notes in Computer Science, pages 349–358. Springer-Verlag, 2001. [26] N.C. de Condorcet. Essai sur l’ application de l’ analyse à la probabilité des decisions rendues à la pluralité des voix. Imprimerie Royale, Paris, 1785. [27] K. Crammer and Y. Singer. On the learnability and design of output codes for multiclass problems. In Proceedings of the Thirteenth Annual Conference on Computational Learning Theory, pages 35–46, 2000. [28] T.G. Dietterich. Ensemble methods in machine learning. In J. Kittler and F. Roli, editors, Multiple Classifier Systems. First International Workshop, MCS 2000, Cagliari, Italy, volume 1857 of Lecture Notes in Computer Science, pages 1–15. Springer-Verlag, 2000. [29] T.G. Dietterich. An experimental comparison of three methods for constructing ensembles of decision tress: Bagging, boosting and randomization. Machine Learning, 40(2):139–158, 2000. [30] T.G. Dietterich and G. Bakiri. Error - correcting output codes: A general method for improving multiclass inductive learning programs. In Proceedings of AAAI91, pages 572–577. AAAI Press / MIT Press, 1991. [31] T.G. Dietterich and G. Bakiri. Solving multiclass learning problems via errorcorrecting output codes. Journal of Artificial Intelligence Research, (2):263–286, 1995. [32] H. Drucker and C. Cortes. Boosting decision trees. In Advances in Neural Information Processing Systems, volume 8. 1996. [33] H. Drucker, C. Cortes, L. Jackel, Y. LeCun, and V. Vapnik. Boosting and other ensemble methods. Neural Computation, 6(6):1289–1301, 1994. [34] R.P.W. Duin and D.M.J. Tax. Experiments with Classifier Combination Rules. In J. Kittler and F. Roli, editors, Multiple Classifier Systems. First International Workshop, MCS 2000, Cagliari, Italy, volume 1857 of Lecture Notes in Computer Science, pages 16–29. Springer-Verlag, 2000. [35] B. Efron and R. Tibshirani. An introduction to the Bootstrap. Chapman and Hall, New York, 1993. [36] S.E. Fahlman and C. Lebiere. The cascade-correlation learning architecture. In D.S. Touretzky, editor, Advances in Neural Information Processing Systems, volume 2, pages 524–532. Morgan Kauffman, San Mateo, CA, 1990. [37] E. Filippi, M. Costa, and E. Pasero. Multi-layer perceptron ensembles for increased performance and fault-tolerance in pattern recognition tasks. In IEEE International Conference on Neural Networks, pages 2901–2906, Orlando, Florida, 1994. [38] Y. Freund. Boosting a weak learning algorithm by majority. Information and Computation, 121(2):256–285, 1995. [39] Y. Freund and R. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and Systems Sciences, 55(1):119–139, 1997. [40] Y. Freund and R.E. Schapire. Experiments with a new boosting algorithm. In Proceedings of the 13th International Conference on Machine Learning, pages 148–156. Morgan Kauffman, 1996. [41] J. Friedman. Greedy function approximation: A gradient boosting machine. The Annals of Statistics, 39(5), 2001. [42] J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: A statistical view of boosting. The Annals of Statistics, 38(2):337–374, 2000. [43] J.H. Friedman. On bias, variance, 0/1 loss and the curse of dimensionality. Data Mining and Knowledge Discovery, 1:55–77, 1997. [44] C. Furlanello and S. Merler. Boosting of Tree-based Classifiers for Predictive Risk Modeling in GIS. In J. Kittler and F. Roli, editors, Multiple Classifier Systems. First International Workshop, MCS 2000, Cagliari, Italy, volume 1857 of Lecture Notes in Computer Science, pages 220–229. Springer-Verlag, 2000. [45] S. Geman, E. Bienenstock, and R. Doursat. Neural networks and the biasvariance dilemma. Neural Computation, 4(1):1–58, 1992. [46] R. Ghani. Using error correcting output codes for text classification. In ICML 2000: Proceedings of the 17th International Conference on Machine Learning, pages 303–310, San Francisco, US, 2000. Morgan Kaufmann Publishers. [47] G. Giacinto and F. Roli. Dynamic Classifier Fusion. In J. Kittler and F. Roli, editors, Multiple Classifier Systems. First International Workshop, MCS 2000, Cagliari, Italy, volume 1857 of Lecture Notes in Computer Science, pages 177– 189. Springer-Verlag, 2000. [48] G. Giacinto and F. Roli. An approach to automatic design of multiple classifier systems. Pattern Recognition Letters, 22:25–33, 2001. [49] T. Hastie and R. Tibshirani. Generalized Additive Models. Chapman and Hall, London, 1990. [50] T. Hastie and R. Tibshirani. Classification by pairwise coupling. The Annals of Statistics, 26(1):451–471, 1998. [51] T.K. Ho. The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(8):832–844, 1998. [52] T.K. Ho. Complexity of Classification Problems ans Comparative Advantages of Combined Classifiers. In J. Kittler and F. Roli, editors, Multiple Classifier Systems. First International Workshop, MCS 2000, Cagliari, Italy, volume 1857 of Lecture Notes in Computer Science, pages 97–106. Springer-Verlag, 2000. [53] T.K. Ho. Data Complexity Analysis for Classifiers Combination. In J. Kittler and F. Roli, editors, Multiple Classifier Systems. Second International Workshop, MCS 2001, Cambridge, UK, volume 2096 of Lecture Notes in Computer Science, pages 53–67, Berlin, 2001. Springer-Verlag. [54] T.K. Ho, J.J. Hull, and S.N. Srihari. Decision combination in multiple classifiers. IEEE Trans. on Pattern Analysis and Machine Intelligence, 19(4):405–410, 1997. [55] K. Hornik. Approximation capabilities of multilayer feedforward networks. Neural Networks, 4:251–257, 1991. [56] Y.S. Huang and Suen. C.Y. Combination of multiple experts for the recognition of unconstrained handwritten numerals. IEEE Trans. on Pattern Analysis and Machine Intelligence, 17:90–94, 1995. [57] L. Hyafil and R.L. Rivest. Constructing optimal binary decision tree is npcomplete. Information Processing Letters, 5(1):15–17, 1976. [58] S. Impedovo and A. Salzo. A New Evaluation Method for Expert Combination in Multi-expert System Designing. In J. Kittler and F. Roli, editors, Multiple Classifier Systems. First International Workshop, MCS 2000, Cagliari, Italy, volume 1857 of Lecture Notes in Computer Science, pages 230–239. SpringerVerlag, 2000. [59] R.A. Jacobs. Methods for combining experts probability assessment. Neural Computation, 7:867–888, 1995. [60] R.A. Jacobs, M.I. Jordan, S.J. Nowlan, and G.E. Hinton. Adaptive mixtures of local experts. Neural Computation, 3(1):125–130, 1991. [61] A. Jain, R. Duin, and J. Mao. Statistical pattern recognition: a review. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22:4–37, 2000. [62] G. James. Majority vote classifiers: theory and applications. PhD thesis, Department of Statistics - Stanford University, Stanford, CA, 1998. [63] C. Ji and S. Ma. Combinination of weak classifiers. IEEE Trans. Neural Networks, 8(1):32–42, 1997. [64] M. Jordan and R. Jacobs. Hierarchies of adaptive experts. In Advances in Neural Information Processing Systems, volume 4, pages 985–992. Morgan Kauffman, San Mateo, CA, 1992. [65] M.I. Jordan and R.A. Jacobs. Hierarchical mixture of experts and the em algorithm. Neural Computation, 6:181–214, 1994. [66] J.M. Keller, P. Gader, H. Tahani, J. Chiang, and M. Mohamed. Advances in fuzzy integratiopn for pattern recognition. Fuzzy Sets and Systems, 65:273–283, 1994. [67] F. Kimura and M. Shridar. Handwritten Numerical Recognition Based on Multiple Algorithms. Pattern Recognition, 24(10):969–983, 1991. [68] J. Kittler. Combining classifiers: a theoretical framework. Pattern Analysis and Applications, (1):18–27, 1998. [69] J. Kittler, M. Hatef, R.P.W. Duin, and Matas J. On combining classifiers. IEEE Trans. on Pattern Analysis and Machine Intelligence, 20(3):226–239, 1998. [70] J. Kittler and F. (editors) Roli. Multiple Classifier Systems, Proc. of 1st International Workshop, MCS 2000, Cagliari, Italy, volume 1857 of Lecture Notes in Computer Science. Springer-Verlag, Berlin, 2000. [71] J. Kittler and F. (editors) Roli. Multiple Classifier Systems, Proc. of 2nd International Workshop, MCS2001, Cambridge, UK. Springer-Verlag, Berlin, 2001. [72] E.M. Kleinberg. On the Algorithmic Implementation of Stochastic Discrimination. IEEE Transactions on Pattern Analysis and Machine Intelligence. [73] E.M. Kleinberg. Stochastic Discrimination. Annals of Mathematics and Artificial Intelligence, pages 207–239, 1990. [74] E.M. Kleinberg. An overtraining-resistant stochastic modeling method for pattern recognition. Annals of Statistics, 4(6):2319–2349, 1996. [75] E.M. Kleinberg. A Mathematically Rigorous Foundation for Supervised Learning. In J. Kittler and F. Roli, editors, Multiple Classifier Systems. First International Workshop, MCS 2000, Cagliari, Italy, volume 1857 of Lecture Notes in Computer Science, pages 67–76. Springer-Verlag, 2000. [76] J. Kolen and Pollack J. Back propagation is sensitive to initial conditions. In Advances in Neural Information Processing Systems, volume 3, pages 860–867. Morgan Kauffman, San Francisco, CA, 1991. [77] E. Kong and T.G. Dietterich. Error - correcting output coding correct bias and variance. In The XII International Conference on Machine Learning, pages 313–321, San Francisco, CA, 1995. Morgan Kauffman. [78] A. Krogh and J. Vedelsby. Neural networks ensembles, cross validation and active learning. In D.S. Touretzky, G. Tesauro, and T.K. Leen, editors, Advances in Neural Information Processing Systems, volume 7, pages 107–115. MIT Press, Cambridge, MA, 1995. [79] L.I. Kuncheva. Genetic algorithm for feature selection for parallel classifiers. Information Processing Letters, 46:163–168, 1993. [80] L.I. Kuncheva. An application of OWA operators to the aggragation of multiple classification decisions. In The Ordered Weighted Averaging operators. Theory and Applciations, pages 330–343. Kluwer Academic Publisher, USA, 1997. [81] L.I. Kuncheva, J.C. Bezdek, and R.P.W. Duin. Decision templates for multiple classifier fusion: an experimental comparison. Pattern Recognition, 34(2):299– 314, 2001. [82] L.I. Kuncheva, F. Roli, G.L. Marcialis, and C.A. Shipp. Complexity of Data Subsets Generated by the Random Subspace Method: An Experimental Investigation. In J. Kittler and F. Roli, editors, Multiple Classifier Systems. Second [83] [84] [85] [86] [87] [88] [89] [90] [91] [92] [93] [94] [95] [96] [97] International Workshop, MCS 2001, Cambridge, UK, volume 2096 of Lecture Notes in Computer Science, pages 349–358. Springer-Verlag, 2001. L.I. Kuncheva and C.J. Whitaker. Feature Subsets for Classifier Combination: An Enumerative Experiment. In J. Kittler and F. Roli, editors, Multiple Classifier Systems. Second International Workshop, MCS 2001, Cambridge, UK, volume 2096 of Lecture Notes in Computer Science, pages 228–237. Springer-Verlag, 2001. L.I. Kuncheva et al. Is independence good for combining classifiers? In Proc. of 15th Int. Conf. on Pattern Recognition, Barcelona, Spain, 2000. L. Lam. Classifier combinations: Implementations and theoretical issues. In Multiple Classifier Systems. First International Workshop, MCS 2000, Cagliari, Italy, volume 1857 of Lecture Notes in Computer Science, pages 77–86. SpringerVerlag, 2000. L. Lam and C. Sue. Optimal combination of pattern classifiers. Pattern Recognition Letters, 16:945–954, 1995. L. Lam and C. Sue. Application of majority voting to pattern recognition: an analysis of its behavior and performance. IEEE Transactions on Systems, Man and Cybernetics, 27(5):553–568, 1997. M. Li and P Vitanyi. An Introduction to Kolmogorov Complexity and Its Applications. Springer-Verlag, Berlin, 1993. L. Mason, P. Bartlett, and J. Baxter. Improved generalization through explicit optimization of margins. Machine Learning, 2000. F. Masulli and G. Valentini. Comparing decomposition methods for classification. In R.J. Howlett and L.C. Jain, editors, KES’2000, Fourth International Conference on Knowledge-Based Intelligent Engineering Systems & Allied Technologies, pages 788–791, Piscataway, NJ, 2000. IEEE. F. Masulli and G. Valentini. Effectiveness of error correcting output codes in multiclass learning problems. In Lecture Notes in Computer Science, volume 1857, pages 107–116. Springer-Verlag, Berlin, Heidelberg, 2000. F. Masulli and G. Valentini. Dependence among Codeword Bits Errors in ECOC Learning Machines: an Experimental Analysis. In Lecture Notes in Computer Science, volume 2096, pages 158–167. Springer-Verlag, Berlin, 2001. F. Masulli and G. Valentini. Quantitative Evaluation of Dependence among Outputs in ECOC Classifiers Using Mutual Information Based Measures. In K. Marko and P. Webos, editors, Proceedings of the International Joint Conference on Neural Networks IJCNN’01, volume 2, pages 784–789, Piscataway, NJ, USA, 2001. IEEE. E. Mayoraz and M. Moreira. On the decomposition of polychotomies into dichotomies. In The XIV International Conference on Machine Learning, pages 219–226, Nashville, TN, July 1997. S. Merler, C. Furlanello, B. Larcher, and A. Sboner. Tuning Cost-Sensitive Boosting and its Application to Melanoma Diagnosis. In J. Kittler and F. Roli, editors, Multiple Classifier Systems. Second International Workshop, MCS 2001, Cambridge, UK, volume 2096 of Lecture Notes in Computer Science, pages 32– 42. Springer-Verlag, 2001. M. Moreira and E. Mayoraz. Improved pairwise coupling classifiers with correcting classifiers. In C. Nedellec and C. Rouveirol, editors, Lecture Notes in Artificial Intelligence, Vol. 1398, pages 160–171, Berlin, Heidelberg, New York, 1998. D.W. Opitz and J.W. Shavlik. Actively searching for an effective neural network ensemble. Connection Science, 8(3/4):337–353, 1996. [98] N.C. Oza and K. Tumer. Input Decimation Ensembles: Decorrelation through Dimensionality Reduction. In J. Kittler and F. Roli, editors, Multiple Classifier Systems. Second International Workshop, MCS 2001, Cambridge, UK, volume 2096 of Lecture Notes in Computer Science, pages 238–247. Springer-Verlag, 2001. [99] H.S. Park and S.W. Lee. Off-line recognition of large sets handwritten characters with multiple Hidden-Markov models. Pattern Recognition, 29(2):231–244, 1996. [100] J. Park and I.W. Sandberg. Approximation and radial basis function networks. Neural Computation, 5(2):305–316, 1993. [101] B. Parmanto, P. Munro, and H. Doyle. Improving committe diagnosis with resampling techniques. In D.S. Touretzky, M. Mozer, and M. Hesselmo, editors, Advances in Neural Information Processing Systems, volume 8, pages 882–888. MIT Press, Cambridge, MA, 1996. [102] B. Parmanto, P. Munro, and H. Doyle. Reducing variance of committee predition with resampling techniques. Connection Science, 8(3/4):405–416, 1996. [103] D. Partridge and W.B. Yates. Engineering multiversion neural-net systems. Neural Computation, 8:869–893, 1996. [104] M.P. Perrone and L.N. Cooper. When networks disagree: ensemble methods for hybrid neural networks. In Mammone R.J., editor, Artificial Neural Networks for Speech and Vision, pages 126–142. Chapman & Hall, London, 1993. [105] W.W. Peterson and E.J.Jr. Weldon. Error correcting codes. MIT Press, Cambridge, MA, 1972. [106] J.R. Quinlan. C4.5 Programs for Machine Learning. Morgan Kauffman, 1993. [107] Y. Raviv and N. Intrator. Bootstrapping with noise: An effective regularization technique. Connection Science, 8(3/4):355–372, 1996. [108] G. Rogova. Combining the results of several neural neetworks classifiers. Neural Networks, 7:777–781, 1994. [109] F. Roli, G. Giacinto, and G. Vernazza. Methods for Designing Multiple Classifier Systems. In J. Kittler and F. Roli, editors, Multiple Classifier Systems. Second International Workshop, MCS 2001, Cambridge, UK, volume 2096 of Lecture Notes in Computer Science, pages 78–87. Springer-Verlag, 2001. [110] R. Schapire and Y. Singer. Boostexter: A boosting-based system for text categorization. Machine Learning, 39(2/3):135–168, 2000. [111] R.E. Schapire. The strenght of weak learnability. Machine Learning, 5(2):197– 227, 1990. [112] R.E. Schapire. A brief introduction to boosting. In Thomas Dean, editor, 16th International Joint Conference on Artificial Intelligence, pages 1401–1406. Morgan Kauffman, 1999. [113] R.E. Schapire, Y. Freund, P. Bartlett, and W. Lee. Boosting the margin: A new explanation for the effectiveness of voting methods. The Annals of Statistics, 26(5):1651–1686, 1998. [114] R.E. Schapire and Y. Singer. Improved boosting algorithms using confidencerated predictions. Machine Learning, 37(3):297–336, 1999. [115] H. Schwenk and Y. Bengio. Training methods for adaptive boosting of neural networks. In Advances in Neural Information Processing Systems, volume 10, pages 647–653. 1998. [116] A. Sharkey, N. Sharkey, and G. Chandroth. Diverse neural net solutions to a fault diagnosis problem. Neural Computing and Applications, 4:218–227, 1996. [117] A Sharkey, N. Sharkey, U. Gerecke, and G. Chandroth. The test and select approach to ensemble combination. In J. Kittler and F. Roli, editors, Multiple [118] [119] [120] [121] [122] [123] [124] [125] [126] [127] [128] [129] [130] [131] Classifier Systems. First International Workshop, MCS 2000, Cagliari, Italy, volume 1857 of Lecture Notes in Computer Science, pages 30–44. Springer-Verlag, 2000. A. Sharkey (editor). Combining Artificial Neural Nets: Ensemble and Modular Multi-Net Systems. Springer-Verlag, London, 1999. M. Skurichina and R.P.W. Duin. Bagging, boosting and the randon subspace method for linear classifiers. Pattern Analysis and Applications. (in press). M. Skurichina and R.P.W. Duin. Bagging for linear classifiers. Pattern Recognition, 31(7):909–930, 1998. M. Skurichina and R.P.W. Duin. Bagging and the Random Subspace Method for Redundant Feature Spaces. In Multiple Classifier Systems. Second International Workshop, MCS 2001, Cambridge, UK, volume 2096 of Lecture Notes in Computer Science, pages 1–10. Springer-Verlag, 2001. C. Suen and L. Lam. Multiple classifier combination methodologies for different output levels. In Multiple Classifier Systems. First International Workshop, MCS 2000, Cagliari, Italy, volume 1857 of Lecture Notes in Computer Science, pages 52–66. Springer-Verlag, 2000. K. Tumer and J. Ghosh. Error correlation and error reduction in ensemble classifiers. Connection Science, 8(3/4):385–404, 1996. K. Tumer and N.C. Oza. Decimated input ensembles for improved generalization. In IJCNN-99, The IEEE-INNS-ENNS International Joint Conference on Neural Networks, 1999. G. Valentini. Upper bounds on the training error of ECOC-SVM ensembles. Technical Report TR-00-17, DISI - Dipartimento di Informatica e Scienze dell’ Informazione - Università di Genova, 2000. ftp://ftp.disi.unige.it/person/ValentiniG/papers/TR-00-17.ps.gz. G. Valentini. Gene expression data analysis of human lymphoma using Support Vector Machines and Output Coding ensembles. Artificial Intelligence in Medicine (to appear). J. Van Lint. Coding theory. Spriger Verlag, Berlin, 1971. D. Wang, J.M. Keller, C.A. Carson, K.K. McAdoo-Edwards, and C.W. Bailey. Use of fuzzy logic inspired features to improve bacterial recognition through classifier fusion. IEEE Transactions on Systems, Man and Cybernetics, 28B(4):583– 591, 1998. K. Woods, W.P. Kegelmeyer, and K. Bowyer. Combination of multiple classifiers using local accuracy estimates. IEEE Trans. on Pattern Analysis and Machine Intelligence, 19(4):405–410, 1997. L Xu, C Krzyzak, and C. Suen. Methods of combining multiple classifiers and their applications to handwritting recognition. IEEE Transactions on Systems, Man and Cybernetics, 22(3):418–435, 1992. C. Yeang et al. Molecular classification of multiple tumor types. In ISMB 2001, Proceedings of the 9th International Conference on Intelligent Systems for Molecular Biology, pages 316–322, Copenaghen, Denmark, 2001. Oxford University Press.