Dipartimento di Informatica e
Scienze dell’Informazione
• •• •
•• •
Giorgio Valentini
Ensemble methods based on bias–variance analysis
Theses Series
DISI-TH-2003-June 24
DISI, Università di Genova
v. Dodecaneso 35, 16146 Genova, Italy
http://www.disi.unige.it/
Università degli Studi di Genova
Dipartimento di Informatica e
Scienze dell’Informazione
Dottorato di Ricerca in Informatica
Ph.D. Thesis in Computer Science
Giorgio Valentini
Ensemble methods based on bias–variance
analysis
April, 2003
Dottorato di Ricerca in Informatica
Dipartimento di Informatica e Scienze dell’Informazione
Università degli Studi di Genova
DISI, Univ. di Genova
via Dodecaneso 35
I-16146 Genova, Italy
http://www.disi.unige.it/
Ph.D. Thesis in Computer Science
Submitted by Giorgio Valentini
DISI, Univ. di Genova
valenti@disi.unige.it
Date of submission: April 2003
Title: Ensemble methods based on bias–variance analysis
Advisor: Prof. Francesco Masulli
Dipartimento di Informatica, Università di Pisa, Italy
masulli@di.unipi.it
Ext. Reviewers:
Prof. Ludmila Kuncheva
School of Informatics,
University of Wales, Bangor, UK
l.i.kuncheva@bangor.ac.uk
Prof. Fabio Roli
Dipartimento di Ingegneria
Elettrica ed Elettronica
Università degli Studi di Cagliari, Italy
roli@diee.unica.it
Abstract
Ensembles of classifiers represent one of the main research directions in machine learning.
Two main theories are invoked to explain the success of ensemble methods. The first one
consider the ensembles in the framework of large margin classifiers, showing that ensembles
enlarge the margins, enhancing the generalization capabilities of learning algorithms. The
second is based on the classical bias–variance decomposition of the error, and it shows that
ensembles can reduce variance and/or bias.
In accordance with this second approach, this thesis pursues a twofold purpose: on the
one hand it explores the possibility of using bias–variance decomposition of the error as an
analytical tool to study the properties of learning algorithms; on the other hand it explores
the possibility of designing ensemble methods based on bias–variance analysis of the error.
At first, bias–variance decomposition of the error is considered as a tool to analyze learning
algorithms. This work shows how to apply Domingos and James theories on bias–variance
decomposition of the error to the analysis of learning algorithms. Extended experiments
with Support Vector Machines (SVMs) are presented, and the analysis of the relationships
between bias, variance, kernel type and its parameters provides a characterization of the
error decomposition, offering insights into the way SVMs learn.
In a similar way bias–variance analysis is applied as a tool to explain the properties of ensembles of learners. A bias–variance analysis of ensembles based on resampling techniques
is conducted, showing that, as expected, bagging is a variance reduction ensemble method,
while the theoretical property of canceled variance holds only for Breiman’s random aggregated predictors.
In addition to analyzing learning algorithms, bias–variance analysis can offer guidance to
the design of ensemble methods. This work shows that it provides a theoretical and practical
tool to develop new ensemble methods well-tuned to the characteristics of a specific base
learner.
On the basis of the analysis and experiments performed on SVMs and bagged ensembles of
SVMs, new ensemble methods based on bias–variance analysis are proposed. In particular
Lobag (Low bias bagging ) selects low bias base learners and then combines them through
bootstrap aggregating techniques. This approach affects both bias, through the selection
of low bias base learners, and variance, through bootstrap aggregation of the selected low
bias base learners. Moreover a new potential class of ensemble methods (heterogeneous
ensembles of SVMs), that aggregate different SVM models on the basis of their bias–
variance characteristics, is introduced.
From an applicative standpoint it is also shown that the proposed ensemble methods can
be successfully applied to the analysis of DNA microarray data.
Al mio caro topo
Acknowledgements
I would like to thank Tom Dietterich for his friendly indirect support to my thesis. Indeed
the main idea behind my thesis, that is applying bias–variance analysis as a tool to study
learning algorithms and to develop new ensemble methods comes from Tom. Moreover
I would like to thank him for the long discussions (especially by e-mail) about different
topics related to my thesis, for his suggestions and constructive criticism.
I would like also to thank Franco Masulli: without his support and encouragement probably
I would have not finish my Ph.D. in Computer Science. I thank him also for the large degree
of freedom I enjoyed during my research activity.
Thanks to DISI and to the University of Genova for allowing me to pursue my Ph.D
activity, and to the DISI people who helped me during my Ph.D activity.
Thanks to INFM, Istituto Nazionale di Fisica della Materia, especially for the financial
support to my research activity and for the cluster of workstations that I used for my
experiments.
Finally I would like to thank my wife Cristina, for bearing the too many saturdays and
sundays with her husband on his papers, books and computers. This thesis is dedicated to
her, even if I suspect that she wisely prefers her ancient potteries and archeological sites.
Table of Contents
List of Figures
6
List of Tables
11
Chapter 1 Introduction
12
Chapter 2 Ensemble methods
18
2.1
Reasons for Combining Multiple Learners . . . . . . . . . . . . . . . . . . .
19
2.2
Ensemble Methods Overview . . . . . . . . . . . . . . . . . . . . . . . . . .
21
2.2.1
Non-generative Ensembles . . . . . . . . . . . . . . . . . . . . . . .
21
2.2.2
Generative Ensembles . . . . . . . . . . . . . . . . . . . . . . . . .
23
2.2.2.1
Resampling methods . . . . . . . . . . . . . . . . . . . . .
23
2.2.2.2
Feature selection methods . . . . . . . . . . . . . . . . . .
24
2.2.2.3
Mixtures of experts methods
. . . . . . . . . . . . . . . .
25
2.2.2.4
Output Coding decomposition methods . . . . . . . . . .
25
2.2.2.5
Test and select methods . . . . . . . . . . . . . . . . . . .
26
2.2.2.6
Randomized ensemble methods . . . . . . . . . . . . . . .
27
New directions in ensemble methods research . . . . . . . . . . . . . . . . .
27
2.3
Chapter 3 Bias–variance decomposition of the error
3.1
Bias–Variance Decomposition for the 0/1 loss function . . . . . . . . . . .
1
30
31
3.1.1
3.2
Expected loss depends on the randomness of the training set and the
target . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
3.1.2
Optimal and main prediction. . . . . . . . . . . . . . . . . . . . . .
33
3.1.3
Bias, unbiased and biased variance. . . . . . . . . . . . . . . . . . .
33
3.1.4
Domingos bias–variance decomposition. . . . . . . . . . . . . . . . .
36
3.1.5
Bias, variance and their effects on the error . . . . . . . . . . . . . .
37
Measuring bias and variance . . . . . . . . . . . . . . . . . . . . . . . . . .
38
3.2.1
Measuring with artificial or large benchmark data sets . . . . . . .
39
3.2.2
Measuring with small data sets . . . . . . . . . . . . . . . . . . . .
41
Chapter 4 Bias–Variance Analysis in single SVMs
4.1
Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
4.1.1
Data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
4.1.1.1
P2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
44
4.1.1.2
Waveform . . . . . . . . . . . . . . . . . . . . . . . . . . .
44
4.1.1.3
Grey-Landsat . . . . . . . . . . . . . . . . . . . . . . . . .
44
4.1.1.4
Letter-Two . . . . . . . . . . . . . . . . . . . . . . . . . .
45
4.1.1.5
Spam . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
4.1.1.6
Musk . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
46
Experimental tasks . . . . . . . . . . . . . . . . . . . . . . . . . . .
46
4.1.2.1
Set up of the data . . . . . . . . . . . . . . . . . . . . . .
46
4.1.2.2
Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
48
Software used in the experiments . . . . . . . . . . . . . . . . . . .
50
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
4.2.1
51
4.1.2
4.1.3
4.2
43
Gaussian kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.1.1
4.2.1.2
The discriminant function computed by the SVM-RBF classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
51
Behavior of SVMs with large σ values . . . . . . . . . . .
58
2
4.2.1.3
4.3
Relationships between generalization error, training error,
number of support vectors and capacity . . . . . . . . . .
58
4.2.2
Polynomial and dot-product kernels . . . . . . . . . . . . . . . . . .
65
4.2.3
Comparing kernels . . . . . . . . . . . . . . . . . . . . . . . . . . .
69
Characterization of Bias–Variance Decomposition of the Error . . . . . . .
73
4.3.1
Gaussian kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . .
74
4.3.2
Polynomial and dot-product kernels . . . . . . . . . . . . . . . . . .
76
Chapter 5 Bias–variance analysis in random aggregated and bagged ensembles of SVMs
79
5.1
5.2
Random aggregating and bagging . . . . . . . . . . . . . . . . . . . . . . .
80
5.1.1
Random aggregating in regression . . . . . . . . . . . . . . . . . . .
81
5.1.2
Random aggregating in classification . . . . . . . . . . . . . . . . .
82
5.1.3
Bagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
84
Bias–variance analysis in bagged SVM ensembles . . . . . . . . . . . . . .
86
5.2.1
Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . .
86
5.2.2
Bagged RBF-SVM ensembles . . . . . . . . . . . . . . . . . . . . .
89
5.2.2.1
Bias–variance decomposition of the error . . . . . . . . . .
89
5.2.2.2
Decomposition with respect to the number of base learners
89
5.2.2.3
Comparison of bias–variance decomposition in single and
bagged RBF-SVMs . . . . . . . . . . . . . . . . . . . . . .
89
Bagged polynomial SVM ensembles . . . . . . . . . . . . . . . . . .
91
5.2.3.1
Bias–variance decomposition of the error . . . . . . . . . .
91
5.2.3.2
Decomposition with respect to the number of base learners
93
5.2.3.3
Comparison of bias–variance decomposition in single and
bagged polynomial SVMs . . . . . . . . . . . . . . . . . .
94
Bagged dot-product SVM ensembles . . . . . . . . . . . . . . . . .
94
5.2.4.1
Bias–variance decomposition of the error . . . . . . . . . .
94
5.2.4.2
Decomposition with respect to the number of base learners
95
5.2.3
5.2.4
3
5.2.4.3
Comparison of bias–variance decomposition in single and
bagged dot-product SVMs . . . . . . . . . . . . . . . . . .
96
Bias–variance characteristics of bagged SVM ensembles . . . . . . .
97
Bias–variance analysis in random aggregated ensembles of SVMs . . . . . .
98
5.2.5
5.3
5.3.1
Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.3.2
Random aggregated RBF-SVM ensembles . . . . . . . . . . . . . . 102
5.3.3
5.3.4
5.3.5
5.3.2.1
Bias–variance decomposition of the error . . . . . . . . . . 102
5.3.2.2
Decomposition with respect to the number of base learners 103
5.3.2.3
Comparison of bias–variance decomposition in single and
random aggregated RBF-SVMs . . . . . . . . . . . . . . . 104
Random aggregated polynomial SVM ensembles . . . . . . . . . . . 106
5.3.3.1
Bias–variance decomposition of the error . . . . . . . . . . 106
5.3.3.2
Decomposition with respect to the number of base learners 106
5.3.3.3
Comparison of bias–variance decomposition in single and
random aggregated polynomial SVMs . . . . . . . . . . . 108
Random aggregated dot-product SVM ensembles . . . . . . . . . . 110
5.3.4.1
Bias–variance decomposition of the error . . . . . . . . . . 110
5.3.4.2
Decomposition with respect to the number of base learners 112
5.3.4.3
Comparison of bias–variance decomposition in single and
random aggregated dot-product SVMs . . . . . . . . . . . 112
Bias–variance characteristics of random aggregated SVM ensembles
112
5.4
Undersampled bagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.5
Summary of bias–variance analysis results in random aggregated and bagged
ensembles of SVMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
Chapter 6 SVM ensemble methods based on bias–variance analysis
121
6.1
Heterogeneous Ensembles of SVMs . . . . . . . . . . . . . . . . . . . . . . 123
6.2
Bagged Ensemble of Selected Low-Bias SVMs . . . . . . . . . . . . . . . . 125
6.2.1
Parameters controlling bias in SVMs . . . . . . . . . . . . . . . . . 125
4
6.3
6.4
6.5
6.2.2
Aggregating low bias base learners by bootstrap replicates . . . . . 125
6.2.3
Measuring Bias and Variance . . . . . . . . . . . . . . . . . . . . . 127
6.2.4
Selecting low-bias base learners. . . . . . . . . . . . . . . . . . . . . 127
6.2.5
Previous related work . . . . . . . . . . . . . . . . . . . . . . . . . . 128
The lobag algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
6.3.1
The Bias–variance decomposition procedure . . . . . . . . . . . . . 129
6.3.2
The Model selection procedure . . . . . . . . . . . . . . . . . . . . . 131
6.3.3
The overall Lobag algorithm . . . . . . . . . . . . . . . . . . . . . . 131
6.3.4
Multiple hold-out Lobag algorithm . . . . . . . . . . . . . . . . . . . 132
6.3.5
Cross-validated Lobag algorithm . . . . . . . . . . . . . . . . . . . . 134
6.3.6
A heterogeneous Lobag approach . . . . . . . . . . . . . . . . . . . . 136
Experiments with lobag . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
6.4.1
Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
6.4.2
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
Application of lobag to DNA microarray data analysis
. . . . . . . . . . . 141
6.5.1
Data set and experimental set-up. . . . . . . . . . . . . . . . . . . . 142
6.5.2
Gene selection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
6.5.3
Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
Conclusions
147
Bibliography
150
5
List of Figures
3.1
Case analysis of error. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
3.2
Effects of biased and unbiased variance on the error. The unbiased variance
increments, while the biased variance decrements the error. . . . . . . . . .
36
4.1
P2 data set, a bidimensional two class synthetic data set. . . . . . . . . . .
45
4.2
Procedure to generate samples to be used for bias–variance analysis with
single SVMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
47
4.3
Procedure to perform bias–variance analysis on single SVMs . . . . . . . .
48
4.4
Grey-Landsat data set. Error (a) and its decomposition in bias (b), net
variance (c), unbiased variance (d), and biased variance (e) in SVM RBF,
varying both C and σ. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
52
Bias-variance decomposition of the error in bias, net variance, unbiased and
biased variance in SVM RBF, varying σ and for fixed C values: (a) Waveform, (b) Grey-Landsat, (c) Letter-Two with C = 0.1, (c) Letter-Two with
C = 1, (e) Letter-Two with added noise and (f) Spam. . . . . . . . . . . .
53
Letter-Two data set. Bias-variance decomposition of error in bias, net variance, unbiased and biased variance in SVM RBF, while varying C and for
some fixed values of σ: (a) σ = 0.01, (b) σ = 0.1, (c) σ = 1, (d) σ = 5, (e)
σ = 20, (f) σ = 100. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
54
The discriminant function computed by the SVM on the P2 data set with
σ = 0.01, C = 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
55
The discriminant function computed by the SVM on the P2 data set, with
σ = 1, C = 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
56
The discriminant function computed by the SVM on the P2 data set. (a)
σ = 20, C = 1, (b) σ = 20 C = 1000. . . . . . . . . . . . . . . . . . . . . .
57
4.5
4.6
4.7
4.8
4.9
6
4.10 Grey-Landsat data set. Bias-variance decomposition of error in bias, net
variance, unbiased and biased variance in SVM RBF, while varying σ and
for some fixed values of C: (a) C = 0.1, (b) C = 1, (c) C = 10, (d) C = 100. 59
4.11 Bias-variance decomposition of error in bias, net variance, unbiased and
biased variance in SVM RBF, while varying σ and for some fixed values of
C: (a) P2, with C = 1, (b) P2, with C = 1000, Musk, with C = 1, (d)
Musk, with C = 1000. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
60
4.12 Letter-Two data set. Error, bias, training error, halved fraction of support
vectors, and estimated VC dimension while varying the σ parameter and for
some fixed values of C: (a) C = 1, (b) C = 10, (c) C = 100, and C = 1000.
62
4.13 Grey-Landsat data set. Error, bias, training error, halved fraction of support
vectors, and estimated VC dimension while varying the σ parameter and for
some fixed values of C: (a) C = 1, (b) C = 10, (c) C = 100, and C = 1000.
63
4.14 Bias-variance decomposition of error in bias, net variance, unbiased and
biased variance in polynomial SVM, while varying the degree and for some
fixed values of C: (a) Waveform, C = 0.1, (b) Waveform, C = 50, (c)
Letter-Two, C = 0.1, (d) Letter-Two, C = 50. . . . . . . . . . . . . . . . .
65
4.15 P2 data set. Error (a) and its decomposition in bias (b) and net variance
(c), varying both C and the polynomial degree. . . . . . . . . . . . . . . .
66
4.16 Letter-Two data set. Bias-variance decomposition of error in bias, net variance, unbiased and biased variance in polynomial SVM, while varying C
and for some polynomial degrees: (a) degree = 2, (b) degree = 3, (c)
degree = 5, (d) degree = 10 . . . . . . . . . . . . . . . . . . . . . . . . . .
67
4.17 Bias in polynomial SVMs with (a) Waveform and (b) Spam data sets, varying both C and polynomial degree. . . . . . . . . . . . . . . . . . . . . . .
68
4.18 Bias-variance decomposition of error in bias, net variance, unbiased and biased variance in polynomial SVM, varying C: (a) P2 data set with degree =
6, (b) Spam data set with degree = 3. . . . . . . . . . . . . . . . . . . . .
68
4.19 Bias-variance decomposition of error in bias, net variance, unbiased and
biased variance in dot-product SVM, varying C: (a) P2, (b) Grey-Landsat,
(c) Letter-Two, (d) Letter-Two with added noise, (e) Spam, (f) Musk. . . .
70
4.20 Bias-variance decomposition of error in bias, net variance, unbiased and
biased variance with respect to C, considering different kernels. (a) P2,
gaussian; (b) Musk, gaussian (c) P2, polynomial; (d) Musk, polynomial; (e)
P2, dot–product; (f) Musk, dot–product. . . . . . . . . . . . . . . . . . . .
71
7
4.21 Bias-variance decomposition of error in bias, net variance, unbiased and biased variance with respect to C, considering different kernels. (a) Waveform,
gaussian; (b) Letter-Two, gaussian (c) Waveform, polynomial; (d) LetterTwo, polynomial; (e) Waveform, dot–product; (f) Letter-Two, dot–product.
72
4.22 The 3 regions of error in RBF-SVM with respect to σ. . . . . . . . . . . .
75
4.23 Behaviour of polynomial SVM with respect of the bias–variance decomposition of the error. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
76
4.24 Behaviour of the dot–product SVM with respect of the bias–variance decomposition of the error. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
77
5.1
Bagging for classification problems. . . . . . . . . . . . . . . . . . . . . . .
85
5.2
Procedure to generate samples to be used for bias–variance analysis in bagging 87
5.3
Procedure to perform bias–variance analysis on bagged SVM ensembles . .
88
5.4
Bias-variance decomposition of error in bias, net variance, unbiased and
biased variance in bagged RBF-SVMs, while varying σ and for some fixed
values of C. P2 data set: (a) C = 1, (b) C = 100. Letter-Two data set: (c)
C = 1, (d) C = 100 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
90
Bias-variance decomposition of error in bias, net variance, unbiased and biased variance in bagged SVM RBF, with respect to the number of iterations.
(a) Grey-Landsat data set (b) Spam data set. . . . . . . . . . . . . . . . .
91
5.5
5.6
Comparison between bias-variance decomposition between single RBF-SVMs
(lines labeled with crosses) and bagged SVM RBF ensembles (lines labeled
with triangles), while varying σ and for some fixed values of C. Letter-Two
data set: (a) C = 1, (b) C = 100. Waveform data set: (c) C = 1, (d) C = 100. 92
5.7
Bias-variance decomposition of error in bias, net variance, unbiased and
biased variance in bagged polynomial SVM, while varying the degree and
for some fixed values of C. P2 data set: (a) C = 0.1, (b) C = 100. LetterTwo data set: (c) C = 0.1, (d) C = 100 . . . . . . . . . . . . . . . . . . . .
93
Bias-variance decomposition of error in bias, net variance, unbiased and
biased variance in bagged polynomial SVMs, with respect to the number of
iterations. (a) P2 data set (b)Letter-Two data set. . . . . . . . . . . . . . .
94
5.8
8
5.9
Comparison between bias-variance decomposition between single polynomial
SVMs (lines labeled with crosses) and bagged polynomial SVM ensembles
(lines labeled with triangles), while varying the degree and for some fixed
values of C. P2 data set: (a) C = 1, (b) C = 100. Grey-Landsat data set:
(c) C = 1, (d) C = 100. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
95
5.10 Bias-variance decomposition of error in bias, net variance, unbiased and
biased variance in bagged dot-product SVM, while varying C. (a) Waveform
data set (b) Grey-Landsat (c) Letter-Two with noise (d) Spam . . . . . . .
96
5.11 Bias-variance decomposition of error in bias, net variance, unbiased and
biased variance in bagged dot-product SVMs, with respect to the number
of iterations. (a) Grey-Landsat data set (b) Letter-Two data set. . . . . . .
97
5.12 Comparison between bias-variance decomposition between single dot-product
SVMs (lines labeled with crosses) and bagged dot-product SVM ensembles
(lines labeled with triangles), while varying the values of C. (a) Waveform
(b) Grey-Landsat (c) Spam (d) Musk. . . . . . . . . . . . . . . . . . . . .
98
5.13 Procedure to generate samples to be used for bias–variance analysis in random aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.14 Procedure to perform bias–variance analysis on random aggregated SVM
ensembles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.15 Bias-variance decomposition of error in bias, net variance, unbiased and
biased variance in random aggregated gaussian SVMs, while varying σ and
for some fixed values of C. P2 data set: (a) C = 1, (b) C = 100. Letter-Two
data set: (c) C = 1, (d) C = 100 . . . . . . . . . . . . . . . . . . . . . . . . 104
5.16 Bias-variance decomposition of error in bias, net variance, unbiased and
biased variance in random aggregated SVM RBF, with respect to the number
of iterations. P2 dataset: (a) C = 1, σ = 0.2, (b) C = 100, σ = 0.5. LetterTwo data set: (c) C = 100, σ = 1, (d) C = 100, σ = 2 . . . . . . . . . . . 105
5.17 Comparison of bias-variance decomposition between single RBF-SVMs (lines
labeled with crosses) and random aggregated ensembles of RBF-SVMs (lines
labeled with triangles), while varying σ and for some fixed values of C.
Letter-Two data set: (a) C = 1, (b) C = 100. Waveform data set: (c)
C = 1, (d) C = 100. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.18 Bias-variance decomposition of error in bias, net variance, unbiased and
biased variance in random aggregated polynomial SVM, while varying the
degree and for some fixed values of C. P2 data set: (a) C = 1, (b) C = 100.
Letter-Two data set: (c) C = 1, (d) C = 100 . . . . . . . . . . . . . . . . . 108
9
5.19 Bias-variance decomposition of error in bias, net variance, unbiased and
biased variance in random aggregated polynomial SVMs, with respect to
the number of iterations. P2 dataset: (a) C = 1, degree = 6 (b) C = 100,
degree = 9. Letter-Two data set: (c) C = 1, degree = 3, (d) C = 100,
degree = 9. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.20 Comparison of bias-variance decomposition between single polynomial SVMs
(lines labeled with crosses) and random aggregated polynomial SVM ensembles (lines labeled with triangles), while varying the degree and for some
fixed values of C. P2 data set: (a) C = 1, (b) C = 100. Grey-Landsat data
set: (c) C = 1, (d) C = 100. . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.21 Bias-variance decomposition of error in bias, net variance, unbiased and
biased variance in random aggregated dot-product SVM, while varying C.
(a) Grey-Landsat data set (b) Letter-Two (c) Letter-Two with noise (d) Spam111
5.22 Bias-variance decomposition of error in bias, net variance, unbiased and
biased variance in random aggregated dot-product SVM, with respect to
the number of iterations. (a) Waveform (b) Letter-Two (c) Spam (d) Musk. 113
5.23 Comparison of bias-variance decomposition between single dot-product SVMs
(lines labeled with crosses) and random aggregated dot-product SVM ensembles (lines labeled with triangles), while varying the values of C. (a)
Waveform (b) Grey-Landsat (c) Spam (d) Musk. . . . . . . . . . . . . . . 114
5.24 Comparison of the error between single SVMs, bagged and random aggregated ensembles of SVMs. Results refers to 7 different data sets. (a) Gaussian kernels (b) Polynomial kernels (c) Dot-product kernels. . . . . . . . . 117
5.25 Comparison of the relative error, bias and unbiased variance reduction between bagged and single SVMs (lines labeled with triangles), and between
random aggregated and single SVMs (lines labeled with squares). B/S
stands for Bagged versus Single SVMs, and R/S for random aggregated
versus Single SVMs. Results refers to 7 different data sets. (a) Gaussian
kernels (b) Polynomial kernels (c) Dot-product kernels. . . . . . . . . . . 118
6.1
Graphical comparison of Lobag, bagging, and single SVM. . . . . . . . . . 139
6.2
GCM data set: bias-variance decomposition of the error in bias, net-variance,
unbiased and biased variance, while varying the regularization parameter
with linear SVMs (a), the degree with polynomial kernels (b), and the kernel parameter σ with gaussian SVMs (c). . . . . . . . . . . . . . . . . . . . 145
10
List of Tables
4.1
Data sets used in the experiments. . . . . . . . . . . . . . . . . . . . . . .
44
4.2
Compared best results with different kernels and data sets. RBF-SVM
stands for SVM with gaussian kernel; Poly-SVM for SVM with polynomial
kernel and D-prod SVM for SVM with dot-product kernel. Var unb. and
Var. bias. stand for unbiased and biased variance. . . . . . . . . . . . . . .
73
5.1
Comparison of the results between single and bagged SVMs. . . . . . . . .
99
5.2
Comparison of the results between single and random aggregated SVMs. . 115
6.1
Results of the experiments using pairs of train D and test T sets. Elobag ,
Ebag and ESV M stand respectively for estimated error of lobag, bagged and
single SVMs on the test set T . The three last columns show the confidence
level according to the Mc Nemar test. L/B, L/S and B/S stand respectively
for the comparison Lobag/Bagging, Lobag/Single SVM and Bagging/Single
SVM. If the confidence level is equal to 1, no significant difference is registered.137
6.2
Comparison of the results between lobag, bagging and single SVMs. Elobag ,
Ebag and ESV M stand respectively for average error of lobag, bagging and
single SVMs. r.e.r. stands for relative error reduction between lobag and
single SVMs and between bagging and single SVMs. . . . . . . . . . . . . . 140
6.3
GCM data set: results with single SVMs . . . . . . . . . . . . . . . . . . . 143
6.4
GCM data set: compared results of single and bagged SVMs . . . . . . . . 144
6.5
GCM data set: compared results of single, bagged and Lobag SVMs on
gene expression data. An asterisk in the last three columns points out that
a statistical significant difference is registered (p = 0.05) according to the
Mc Nemar test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
11
Chapter 1
Introduction
Ensembles of classifiers represent one of the main research directions in machine learning [43, 183].
The success of this emerging discipline is the result of the exchange and interactions between different cultural backgrounds and different perspectives of researchers from diverse
disciplines, ranging from neural networks, to statistics, pattern recognition and soft computing, as reported by the recent international workshops on Multiple Classifier System
organized by Josef Kittler and Fabio Roli [106, 107, 157].
Indeed empirical studies showed that both in classification and regression problems ensembles are often much more accurate than the individual base learners that make them
up [8, 44, 63], and different theoretical explanations have been proposed to justify the
effectiveness of some commonly used ensemble methods [105, 161, 111, 2].
Nonetheless, the variety of terms and specifications used to indicate sets of learning machines that work together to solve a machine learning problem [123, 190, 191, 105, 92,
30, 51, 12, 7, 60], reflects the absence of an unified theory on ensemble methods and the
youngness of this research area.
A large number of combination schemes and ensemble methods have been proposed in literature. The combination by majority voting [103, 147], where the class most represented
among the base classifiers is chosen, is probably the first ensemble method proposed in the
literature, far before the first computers appeared [38]. A refinement of this approach is
represented by a combination through Bayesian decision rules, where the class with the
highest posterior probability computed through the estimated class conditional probabilities and the Bayes’ formula is selected [172]. The base learners can also be aggregated
using simple operators as Minimum, Maximum, Average and Product and Ordered Weight
Averaging [160, 20, 117], and if we can interpret the classifier outputs as the support for the
12
classes, fuzzy aggregation methods can be applied [30, 188, 118]. Other methods consist in
training the combining rules, using second-level learning machines on top of the set of the
base learners [55] or meta-learning techniques [25, 150]. Other classes of ensemble methods
try to improve the overall accuracy of the ensemble by directly boosting the accuracy and
the diversity of the base learners. For instance they can modify the structure and the
characteristics of the available input data, as in resampling methods [15, 63, 161] or in
feature selection [81] methods. They can also manipulate the aggregation of the classes,
as in Output Coding methods [45, 46], or they can select base learners specialized for a
specific input region, as in mixture of experts methods [98, 90]. Other approaches can
inject randomness at different levels to the base learning algorithm [44, 19], or can select
a proper set of base learners evaluating the performances of the component base learners,
as in test-and-select methods [166, 156].
Despite the variety and the differences between the diverse classes of ensemble methods
proposed in literature, they share a common characteristic: they emphasize in particular
the combination scheme, or more in general the way the base learners are aggregated. Of
course this is a fundamental aspect of ensemble methods, as it represents the main structural element of any ensemble of learning machines. Indeed ensemble methods have been
conceived quite independently of the characteristics of specific base learners, emphasizing
the combination scheme instead of the properties of the applied basic learning algorithm.
However, several researchers showed that the effectiveness of ensemble methods depends on
the specific characteristics of the base learners; in particular on their individual accuracy,
on the relationship between diversity and accuracy of the base learners [77, 119, 72, 121],
on their stability [16], and on their general geometrical properties [32]. In other words,
the analysis of the features and properties of the base learners used in ensemble methods
is another important item for the design of ensemble methods [43]. Then we could try to
develop ensemble methods well-tuned to the characteristics of specific base learners.
According to this this standpoint, this research starts from the ”bottom edge” of ensemble
methods: trying to exploit the features of base learning algorithms in order to build around
them ensemble methods well-tuned to the learning characteristics of the base learners.
This requires the analysis of their learning properties, discovering and using appropriate
tools to perform/execute this task. In principle, we could use measures of accuracy, diversity and complexity to study and characterize the behaviour of learning algorithms.
However, as shown by L. Kuncheva [121], diversity may be related in a complex way to
accuracy, and it may be very difficult to characterize the behaviour of a learning algorithm
in terms of capacity/complexity of the resulting learning machine [187].
Decomposition of the error in bias and variance is a classical topic in statistical learning [52, 68]. Recently, Domingos proposed an unified theory on bias-variance analysis of
the error, independent of the particular loss function [47], and James extended the Domin13
gos approach, introducing the notions of bias and variance effect [94].
Using the theoretical tools provided by Domingos, we tried to evaluate if the analysis of
bias and variance can provide insights into the way learning algorithms work. In principle,
we could use the knowledge obtained from this analysis to design new learning algorithms,
as suggested by Domingos himself [49], but in this work we did not follow this research
line.
We used bias-variance analysis as a tool to study the behavior of learning methods, focusing
in particular on Support Vector Machines [35]. A related issue of this work was also to
evaluate if it was possible to characterize learning in terms of bias–variance decomposition
of the error, studying the relationships between learning parameters and the bias and
variance components of the error. We considered also if and how this analysis could be
extended from specific learning algorithms to ensemble methods, trying to characterize
also the behaviour of ensemble methods based on resampling techniques in terms of bias–
variance decomposition of the error.
Besides studying if bias-variance theory offers a rationale to analyze the behaviour of
learning algorithms and to explain the properties of ensembles of classifiers, the second
main research topic of this thesis consists in researching if the decomposition of the error
in bias and variance can also give guidance to the design of ensemble methods by relating
measurable properties of learning algorithms to expected performances of ensembles [182].
On the basis of the knowledge gained from the bias variance analysis of specific learning
algorithms, we tried to understand if it was possible to design new ensemble methods welltuned to the bias–variance characteristics of specific base learners. Moreover we studied
also if we could design ensemble methods with embedded bias–variance analysis procedures
in order to take into account bias–variance characteristics of both the base learner and the
ensemble. For instance, we researched if ensemble methods based on resampling techniques
(e.g. bagging) could be enhanced through the bias–variance analysis approach, or if we
could build variants of bagging exploiting the bias–variance characteristics of the base
learners. This research line was motivated also by an applicative standpoint, in order to
consider low-sized and high-dimensional classification problems in bioinformatics.
Outline of the thesis
Chapter 2 (Ensemble methods) introduces the main subject of this thesis into the the
general framework of ensemble methods. It presents an overview of ensembles of learning
machines, explaining the main reasons why they are able to outperform any single classifier
within the ensemble, and proposing a taxonomy based on the main ways base classifiers
can be generated or combined together. New directions in ensemble methods research are
depicted, introducing ensemble methods well-tuned to the learning characteristics of the
base learners.
14
The main goal of this work is twofold: on one hand it consists in evaluating if bias-variance
analysis can be used as a tool to study and characterize learning algorithms and ensemble
methods; on the other hand it consists in evaluating if the decomposition of the error
into bias and variance can guide the design of ensemble methods by relating measurable
properties of algorithms to the expected performances of ensembles. In both cases bias–
variance theory plays a central role, and chapter 3 (Bias–variance decomposition of the
error) summarizes the main topics of Domingos bias–variance theory. After a brief outline
of the literature on bias–variance decomposition, the chapter focuses on bias–variance
decomposition for the 0/1 loss, as we are mainly interested in classification problems. We
underline that in this context bias variance is not additive, and we present an analysis of
the combined effect of bias, variance and noise on the overall error, comparing also the
theoretical approaches of Domingos and James on this topic. Moreover we consider the
procedures to measure bias and variance, distinguishing the case when ”large” or ”small”
data sets are available. In the latter case we propose to use out-of-bag procedures, as
they are unbiased and computationally less expensive compared with multiple hold-out
and cross-validation techniques.
Chapter 4 (Bias–variance analysis in single SVMs) presents an experimental bias–variance
analysis in SVMs. The main goal of this chapter is to study the learning properties of SVMs
with respect to their bias–variance characteristics and to characterize their learning behavior. In particular this analysis gets insights into the way SVMs learn, unraveling the specific
effects of bias, unbiased and biased variance on the overall error. Firstly, the experimental
set-up, involving training and testing of more than half-million of different SVMs, using
gaussian, polynomial and dot-product kernels, is presented. Then we analyzed the relationships of bias–variance decomposition with different kernels, regularization and kernel
parameters, using both synthetic and ”real world” data. In particular, with gaussian kernels we studied the reasons why usually SVMs do not learn with small values of the spread
parameter, their behavior with large values of the spread, and the relationships between
generalization error, training error, number of support vectors and capacity. We provided
also a characterization of bias–variance decomposition of the error in gaussian kernels, distinguishing three main regions characterized by specific trends of bias and variance with
respect to the values of the spread parameter σ. Similar characterizations were provided
also for polynomial and dot-product kernels.
Bias-variance can also be a useful tool to analyze bias–variance characteristics of ensemble
methods. To this purpose chapter 5 (Bias–variance analysis in random aggregated and
bagged ensembles of SVMs) provides an extended experimental analysis of bias–variance
decomposition of the error for ensembles based on resampling techniques. We consider
theoretical issues about the relationships between random aggregating and bagging. Indeed bagging can be seen as an approximation of random aggregating, that is a process
by which base learners, trained on samples drawn accordingly to an unknown probability
distribution from the entire universe population, are aggregated through majority voting
15
(classification) or averaging between them (regression). Breiman showed that random aggregating and bagging are effective with unstable learning algorithms, that is when small
changes in the training set can result in large changes in the predictions of the base learners; we prove that there is a strict relationship between instability and the variance of the
base predictors. Theoretical analysis shows that random aggregating should significantly
reduce variance, without incrementing bias. Bagging also, as an approximation of random
aggregating, should reduce variance. We performed an extended experimental analysis, involving training and testing of about 10 million SVMs, to test these theoretical outcomes.
Moreover, we analyzed bias–variance in bagged and random aggregated SVM ensembles, to
understand the effect of bagging and random aggregating on bias and variance components
of the error in SVMs. In both cases we evaluated for each kernel the expected error and
its decomposition in bias, net-variance, unbiased and biased variance with respect to the
learning parameters of the base learners. Then we analyzed the bias–variance decomposition as a function of the number of the base learners employed. Finally, we compared bias
and variance with respect to the learning parameters in random aggregated and bagged
SVM ensembles and in the corresponding single SVMs, in order to study the effect of
bagging and random aggregating on the bias and variance components of the error. With
random aggregated ensembles we registered a very large reduction of the net-variance with
respect to single SVMs. It was always reduced close to 0, independently of the type of
kernel used. This behaviour is due primarily to the unbiased variance reduction, while
the bias remains unchanged with respect to the single SVMs. With bagging we have also
a reduction of the error, but not as large as with random aggregated ensembles. Indeed,
unlike random aggregating, net and unbiased variance, although reduced, are not actually
reduced to 0, while bias remains unchanged or slightly increases. An interesting byproduct
of this analysis is that undersampled bagging can be viewed as another approximation of
random aggregating (using a bootstrap approximation of the unknown probability distribution), if we consider the universe U as a data set from which undersampled data, that
is data sets whose cardinality is much less than the cardinality of U , are randomly drawn
with replacement. This approach should provide very significant reduction of the variance
and could be in practice applied to data mining problems, when learning algorithms cannot
comfortably manage very large data sets.
In addition to providing insights into the behavior of learning algorithms, the analysis of
the bias–variance decomposition of the error can identify the situations in which ensemble
methods might improve base learner performances. Indeed the decomposition of the error
into bias and variance can guide the design of ensemble methods by relating measurable
properties of algorithms to the expected performance of ensembles. Chapter 6 (SVM
ensemble methods based on bias–variance analysis), presents two possible ways of applying
bias–variance analysis to develop SVM-based ensemble methods. The first approach tries
to apply bias–variance analysis to enhance both accuracy and diversity of the base learners.
The second research direction consists in bootstrap aggregating low bias base learners in
16
order to lower both bias and variance. Regarding the first approach, only some very
general research lines are depicted. About the second direction, a specific new method
that we named Lobag, that is Low bias bagged SVMs, is introduced, considering also
different variants. Lobag applies bias–variance analysis in order to direct the tuning of
Support Vector Machines toward the optimization of the performance of bagged ensembles.
Specifically, since bagging is primarily a variance-reduction method, and since the overall
error is (to a first approximation) the sum of bias and variance, this suggests that SVMs
should be tuned to minimize bias before being combined by bagging. The key-issue of this
methods consists in efficiently evaluating the bias–variance decomposition of the error.
We embed this procedure inside the Lobag ensemble method implementing a relatively
inexpensive out-of-bag estimate of bias and variance. The pseudocode of Lobag is provided,
as well as a C++ implementation (available on-line). Numerical experiments show that
Lobag compares favorably with bagging, and some preliminary results show that it can be
successfully applied to DNA microarray data analysis.
The conclusions summarize the main results achieved, and several open questions delineate
possible future works and developments.
17
Chapter 2
Ensemble methods
Ensembles are sets of learning machines whose decisions are combined to improve the
performance of the overall system. In this last decade one of the main research areas in
machine learning has been represented by methods for constructing ensembles of learning
machines. Although in the literature [123, 190, 191, 105, 92, 30, 51, 12, 7, 60] a plethora
of terms, such as committee, classifier fusion, combination, aggregation and others are
used to indicate sets of learning machines that work together to solve a machine learning
problem, in this paper we shall use the term ensemble in its widest meaning, in order to
include the whole range of combining methods. This variety of terms and specifications
reflects the absence of an unified theory on ensemble methods and the youngness of this
research area. However, the great effort of the researchers, reflected by the amount of the
literature [167, 106, 107, 157] dedicated to this emerging discipline, achieved meaningful
and encouraging results.
Empirical studies showed that both in classification and regression problem ensembles
are often much more accurate than the individual base learner that make them up [8,
44, 63], and recently different theoretical explanations have been proposed to justify the
effectiveness of some commonly used ensemble methods [105, 161, 111, 2].
The interest in this research area is motivated also by the availability of very fast computers
and networks of workstations at a relatively low cost that allow the implementation and
the experimentation of complex ensemble methods using off-the-shelf computer platforms.
However, as explained in Sect. 2.1 there are deeper reasons to use ensembles of learning
machines. motivated by the intrinsic characteristics of the ensemble methods.
This chapter presents a brief overview of the main areas of research, without pretending
to be exhaustive or to explain the detailed characteristics of each ensemble method.
18
2.1
Reasons for Combining Multiple Learners
Both empirical observations and specific machine learning applications confirm that a given
learning algorithm outperforms all others for a specific problem or for a specific subset of
the input data, but it is unusual to find a single expert achieving the best results on the
overall problem domain. As a consequence multiple learner systems try to exploit the local
different behavior of the base learners to enhance the accuracy and the reliability of the
overall inductive learning system. There are also hopes that if some learner fails, the overall
system can recover the error. Employing multiple learners can derive from the application
context, such as when multiple sensor data are available, inducing a natural decomposition
of the problem. In more general cases we can dispose of different training sets, collected
at different times, having eventually different features and we can use different specialized
learning machine for each different item.
However, there are deeper reasons why ensembles can improve performances with respect
to a single learning machine. As an example, consider the following one given by Tom
Dietterich in [43]. If we have a dichotomic classification problem and L hypotheses whose
error is lower than 0.5, then the resulting majority voting ensemble has an error lower than
the single classifier, as long as the error of the base learners are uncorrelated. In fact, if
we have 21 classifiers, and the error rates of each base learner are all equal to p = 0.3 and
the errors are independent, the overall error of the majority voting ensemble will be given
by the area under the binomial distribution where more than L/2 hypotheses are wrong:
Perror =
L
X
(i=⌈L/2⌉)
µ
L
i
¶
pi (1 − p)L−i
⇒ Perror = 0.026 ¿ p = 0.3
This result has been studied by mathematicians since the end of the XVIII century in
the context of social sciences: in fact the Condorcet Jury Theorem [38]) proved that the
judgment of a committee is superior to those of individuals, provided the individuals have
reasonable competence (that is, a probability of being correct higher than 0.5). As noted
in [122], this theorem theoretically justifies recent research on multiple ”weak” classifiers [95, 81, 110], representing an interesting research direction diametrically opposite to
the development of highly accurate and specific classifiers.
This simple example shows also an important issue in the design of ensembles of learning
machines: the effectiveness of ensemble methods relies on the independence of the error
committed by the component base learner. In this example, if the independence assumption
does not hold, we have no assurance that the ensemble will lower the error, and we know
that in many cases the errors are correlated. From a general standpoint we know that
the effectiveness of ensemble methods depends on the accuracy and the diversity of the
base learners, that is if they exhibit low error rates and if they produce different errors
19
[78, 174, 132]. The correlated concept of independence between the base learners has been
commonly regarded as a requirement for effective classifier combinations, but Kuncheva
and Whitaker have recently shown that not always independent classifiers outperform
dependent ones [121]. In fact there is a trade-off between accuracy and independence:
more accurate are the base learners, less independent they are.
Learning algorithms try to find an hypothesis in a given space H of hypotheses, and in
many cases if we have sufficient data they can find the optimal one for a given problem. But
in real cases we have only limited data sets and sometimes only few examples are available.
In these cases the learning algorithm can find different hypotheses that appear equally
accurate with respect to the available training data, and although we can sometimes select
among them the simplest or the one with the lowest capacity, we can avoid the problem
averaging or combining them to get a good approximation of the unknown true hypothesis.
Another reason for combining multiple learners arises from the limited representational
capability of learning algorithms. In many cases the unknown function to be approximated
is not present in H, but a combination of hypotheses drawn from H can expand the space of
representable functions, embracing also the true one. Although many learning algorithms
present universal approximation properties [86, 142], with finite data sets these asymptotic
features do not hold: the effective space of hypotheses explored by the learning algorithm
is a function of the available data and it can be significantly smaller than the virtual H
considered in the asymptotic case. From this standpoint ensembles can enlarge the effective
hypotheses coverage, expanding the space of representable functions.
Many learning algorithms apply local optimization techniques that may get stuck in local
optima. For instance inductive decision trees employ a greedy local optimization approach,
and neural networks apply gradient descent techniques to minimize an error function over
the training data. Moreover optimal training with finite data both for neural networks and
decision trees is NP-hard [13, 88]. As a consequence even if the learning algorithm can
in principle find the best hypothesis, we actually may not be able to find it. Building an
ensemble using, for instance, different starting points may achieve a better approximation,
even if no assurance of this is given.
Another way to look at the need for ensembles is represented by the classical bias–variance
analysis of the error [68, 115]: different works have shown that several ensemble methods
reduce variance [15, 124] or both bias and variance [15, 62, 114]. Recently the improved
generalization capabilities of different ensemble methods have also been interpreted in the
framework of the theory of large margin classifiers [129, 162, 2], showing that methods such
as boosting and ECOC enlarge the margins of the examples.
20
2.2
Ensemble Methods Overview
A large number of combination schemes and ensemble methods have been proposed in literature. Combination techniques can be grouped and analyzed in different ways, depending
on the main classification criterion adopted. If we consider the representation of the input
patterns as the main criterion, we can identify two distinct large groups, one that uses the
same and one that uses different representations of the inputs [104, 105].
Assuming the architecture of the ensemble as the main criterion, we can distinguish between
serial, parallel and hierarchical schemes [122], and if the base learners are selected or not by
the ensemble algorithm we can separate selection-oriented and combiner-oriented ensemble
methods [92, 118]. In this brief overview we adopt an approach similar to the one cited
above, in order to distinguish between non-generative and generative ensemble methods.
Non-generative ensemble methods confine themselves to combine a set of given possibly
well-designed base learners: they do not actively generate new base learners but try to
combine in a suitable way a set of existing base classifiers. Generative ensemble methods
generate sets of base learners acting on the base learning algorithm or on the structure of
the data set and try to actively improve diversity and accuracy of the base learners.
Note that in some cases it is difficult to assign a specific ensemble method to either of the
proposed general superclasses: the purpose of this general taxonomy is simply to provide
a general framework for the main ensemble methods proposed in the literature.
2.2.1
Non-generative Ensembles
This large group of ensemble methods embraces a large set of different approaches to
combine learning machines. They share the very general common property of using a
predetermined set of learning machines previously trained with suitable algorithms. The
base learners are then put together by a combiner module that may vary depending on its
adaptivity to the input patterns and on the requirement of the output of the individual
learning machines.
The type of combination may depend on the type of output. If only labels are available or
if continuous outputs are hardened, then majority voting, that is the class most represented
among the base classifiers, is used [103, 147, 124].
This approach can be refined assigning different weights to each classifier to optimize the
performance of the combined classifier on the training set [123], or, assuming mutual independence between classifiers, a Bayesian decision rule selects the class with the highest
posterior probability computed through the estimated class conditional probabilities and
the Bayes’ formula [191, 172]. A Bayesian approach has also been used in Consensus
based classification of multisource remote sensing data [10, 9, 21], outperforming conven21
tional multivariate methods for classification. To overcome the problem of the independence assumption (that is unrealistic in most cases), the Behavior-Knowledge Space (BKS)
method [87] considers each possible combination of class labels, filling a look-up table using
the available data set, but this technique requires a large volume of training data.
Where we interpret the classifier outputs as the support for the classes, fuzzy aggregation methods can be applied, such as simple connectives between fuzzy sets or the fuzzy
integral [30, 29, 100, 188]; if the classifier outputs are possibilistic, Dempster-Schafer combination rules can be applied [154]. Statistical methods and similarity measures to estimate
classifier correlation have also been used to evaluate expert system combination for a proper
design of multi-expert systems [89].
The base learners can also be aggregated using simple operators as Minimum, Maximum,
Average and Product and Ordered Weight Averaging and other statistics [160, 20, 117, 155].
In particular, on the basis of a common bayesian framework, Josef Kittler provided a
theoretical underpinning of many existing classifier combination schemes based on the
product and the sum rule, showing also that the sum rule is less sensitive to the errors of
subsets of base classifiers [105].
Recently Ludmila Kuncheva has developed a global combination scheme that takes into
account the decision profiles of all the ensemble classifiers with respect to all the classes,
designing Decision templates that summarize in matrix format the average decision profiles
of the training set examples. Different similarity measures can be used to evaluate the
matching between the matrix of classifier outputs for an input x, that is the decision
profiles referred to x, and the matrix templates (one for each class) found as the class
means of the classifier outputs [118]. This general fuzzy approach produce soft class labels
that can be seen as a generalization of the conventional crisp and probabilistic combination
schemes.
Another general approach consists in explicitly training combining rules, using second-level
learning machines on top of the set of the base learners [55, 189]. This stacked structure
makes use of the outputs of the base learners as features in the intermediate space: the
outputs are fed into a second-level machine to perform a trained combination of the base
learners.
Meta-learning techniques can be interpreted as an extension of the previous approach [25,
26]. Indeed they can be defined as learning from learned knowledge and are characterized
by meta-level training sets generated by the first level base learners trained on the ”true”
data set, and a meta-learner trained from the meta-level training set [150]. In other words,
in meta-learning the integration rule is learned by the meta-learner on the basis of the
behavior of the trained base learners.
22
2.2.2
Generative Ensembles
Generative ensemble methods try to improve the overall accuracy of the ensemble by
directly boosting the accuracy and the diversity of the base learner. They can modify the
structure and the characteristics of the available input data, as in resampling methods or
in feature selection methods, they can manipulate the aggregation of the classes (Output
Coding methods), can select base learners specialized for a specific input region (mixture
of experts methods), can select a proper set of base learners evaluating the performance
and the characteristics of the component base learners (test-and-select methods) or can
randomly modify the base learning algorithm (randomized methods).
2.2.2.1
Resampling methods
Resampling techniques can be used to generate different hypotheses. For instance, bootstrapping techniques [56] may be used to generate different training sets and a learning
algorithm can be applied to the obtained subsets of data in order to produce multiple
hypotheses. These techniques are effective especially with unstable learning algorithms,
which are algorithms very sensitive to small changes in the training data, such as neuralnetworks and decision trees.
In bagging [15] the ensemble is formed by making bootstrap replicates of the training
sets, and then multiple generated hypotheses are used to get an aggregated predictor.
The aggregation can be performed averaging the outputs in regression or by majority or
weighted voting in classification problems [169, 170].
While in bagging the samples are drawn with replacement using a uniform probability
distribution, in boosting methods the learning algorithm is called at each iteration using a
different distribution or weighting over the training examples [160, 63, 161, 62, 164, 159,
50, 61, 51, 50, 17, 18, 65, 64]. This technique places the highest weight on the examples
most often misclassified by the previous base learner: in this way the base learner focuses
its attention on the hardest examples. Then the boosting algorithm combines the base
rules taking a weighted majority vote of the base rules. Schapire and Singer showed
that the training error exponentially drops down with the number of iterations [163] and
Schapire et al. [162] proved that boosting enlarges the margins of the training examples,
showing also that this fact translates into a superior upper bound on the generalization
error. Experimental work showed that bagging is effective with noisy data, while boosting,
concentrating its efforts on noisy data seems to be very sensitive to noise [153, 44]. Recently,
variants of boosting, specific for noisy data, have been proposed by several authors [152, 39].
Another resampling method consists in constructing training sets by leaving out disjoint
subsets of the training data as in cross-validated committees [143, 144] or sampling without
replacement [165].
23
Another general approach, named Stochastic Discrimination [109, 110, 111, 108], is based
on randomly sampling from a space of subsets of the feature space underlying a given
problem, then combining these subsets to form a final classifier, using a set-theoretic abstraction to remove all the algorithmic details of classifiers and training procedures. By
this approach the classifiers’ decision regions are considered only in form of point sets,
and the set of classifiers is just a sample into the power set of the feature space. A rigorous mathematical treatment starting from the ”representativeness” of the examples used
in machine learning problems leads to the design of ensemble of weak classifiers, whose
accuracy is governed by the law of large numbers [27].
2.2.2.2
Feature selection methods
This approach consists in reducing the number of input features of the base learners, a
simple method to fight the effects of the classical curse of dimensionality problem [66]. For
instance, in the Random Subspace Method [81, 119], a subset of features is randomly selected
and assigned to an arbitrary learning algorithm. This way, one obtains a random subspace
of the original feature space, and constructs classifiers inside this reduced subspace. The
aggregation is usually performed using weighted voting on the basis of the base classifiers
accuracy. It has been shown that this method is effective for classifiers having a decreasing
learning curve constructed on small and critical training sample sizes [168]
The Input Decimation approach [175, 139] reduces the correlation among the errors of the
base classifiers, decoupling the base classifiers by training them with different subsets of
the input features. It differs from the previous Random Subspace Method as for each class
the correlation between each feature and the output of the class is explicitly computed,
and the base classifier is trained only on the most correlated subset of features.
Feature subspace methods performed by partitioning the set of features, where each subset
is used by one classifier in the team, are proposed in [191, 141, 20]. Other methods for
combining different feature sets using genetic algorithms are proposed in [118, 116]. Different approaches consider feature sets obtained by using different operators on the original
feature space, such as Principal Component Analysis, Fourier coefficients, Karhunen-Loewe
coefficients, or other [28, 55]. An experiment with a systematic partition of the feature
space, using nine different combination schemes is performed in [120], showing that there
are no ”best” combinations for all situations and that there is no assurance that in all cases
a classifier team will outperform the single best individual.
24
2.2.2.3
Mixtures of experts methods
The recombination of the base learners can be governed by a supervisor learning machine,
that selects the most appropriate element of the ensemble on the basis of the available
input data. This idea led to the mixture of experts methods [91, 90], where a gating
network performs the division of the input space and small neural networks perform the
effective calculation at each assigned region separately. An extension of this approach is
the hierarchical mixture of experts method, where the outputs of the different experts are
non-linearly combined by different supervisor gating networks hierarchically organized [97,
98, 90].
Cohen and Intrator extended the idea of constructing local simple base learners for different
regions of input space, searching for appropriate architectures that should be locally used
and for a criterion to select a proper unit for each region of input space [31, 32]. They
proposed a hybrid MLP/RBF network by combining RBF and Perceptron units in the same
hidden layer and using a forward selection approach [58] to add units until a desired error is
reached. Although the resulting Hybrid Perceptron/Radial Network is not in a strict sense
an ensemble, the way by which the regions of the input space and the computational units
are selected and tested could be in principle extended to ensembles of learning machines.
2.2.2.4
Output Coding decomposition methods
Output Coding (OC) methods decompose a multiclass–classification problem in a set of twoclass subproblems, and then recompose the original problem combining them to achieve
the class label [134, 130, 43]. An equivalent way of thinking about these methods consists
in encoding each class as a bit string (named codeword), and in training a different twoclass base learner (dichotomizer) in order to separately learn each codeword bit. When the
dichotomizers are applied to classify new points, a suitable measure of similarity between
the codeword computed by the ensemble and the codeword classes is used to predict the
class.
Different decomposition schemes have been proposed in literature: In the One-Per-Class
(OPC) decomposition [4], each dichotomizer fi has to separate a single class from all others; in the PairWise Coupling (PWC) decomposition [79], the task of each dichotomizer
fi consists in separating a class Ci form class Cj , ignoring all other classes; the Correcting
Classifiers (CC) and the PairWise Coupling Correcting Classifiers (PWC-CC) are variants
of the PWC decomposition scheme, that reduce the noise originated in the PWC scheme
due to the processing of non pertinent information performed by the PWC dichotomizers [137].
Error Correcting Output Coding [45, 46] is the most studied OC method, and has been
successfully applied to several classification problems [1, 11, 69, 6, 177, 192]. This decom25
position method tries to improve the error correcting capabilities of the codes generated
by the decomposition through the maximization of the minimum distance between each
couple of codewords [114, 130]. This goal is achieved by means of the redundancy of the
coding scheme [186].
ECOC methods present several open problems. The tradeoff between error recovering
capabilities and complexity/learnability of the dichotomies induced by the decomposition
scheme has been tackled in several works [2, 176], but an extensive experimental evaluation
of the tradeoff has to be performed in order to achieve a better understanding of this
phenomenon. A related problem is the analysis of the relationship between codeword length
and performances: some preliminary results seem to show that long codewords improve
performance [69]. Another open problem, not sufficiently investigated in literature [69,
131, 11], is the selection of optimal dichotomic learning machines for the decomposition
unit. Several methods for generating ECOC codes have been proposed: exhaustive codes,
randomized hill climbing [46], random codes [93], and Hadamard and BCH codes [14, 148].
An open problems is still the joint maximization of distances between rows and columns in
the decomposition matrix. Another open problem consists in designing codes for a given
multiclass problem. An interesting greedy approach is proposed in [134], and a method
based on soft weight sharing to learn error correcting codes from data is presented in [3].
In [36] it is shown that given a set of dichotomizers the problem of finding an optimal
decomposition matrix is NP-complete: by introducing continuous codes and casting the
design problem of continuous codes as a constrained optimization problem, we can achieve
an optimal continuous decomposition using standard optimization methods.
The work in [131] highlights that the effectiveness of ECOC decomposition methods depends mainly on the design of the learning machines implementing the decision units, on
the similarity of the ECOC codewords, on the accuracy of the dichotomizers, on the complexity of the multiclass learning problem and on the correlation of the codeword bits.
In particular, Peterson and Weldon [148] showed that if errors on different code bits are
dependent, the effectiveness of error correcting code is reduced. Consequently, if a decomposition matrix contains very similar rows (dichotomies), each error of an assigned
dichotomizer will be likely to appear in the most correlated dichotomizers, thus reducing
the effectiveness of ECOC. These hypotheses have been experimentally supported by a
quantitative evaluation of the dependency among output errors of the decomposition unit
of ECOC learning machines using mutual information based measures [132, 133].
2.2.2.5
Test and select methods
The test and select methodology relies on the idea of selection in ensemble creation [166].
The simplest approach is a greedy one [147], where a new learner is added to the ensemble only if the resulting squared error is reduced, but in principle any optimization
26
technique can be used to select the ”best” component of the ensemble, including genetic
algorithms [138].
It should be noted that the time complexity of the selection of optimal subsets of classifiers
is exponential with respect to the number of base learners used. From this point of view
heuristic rules, as the ”choose the best” or the ”choose the best in the class”, using classifiers
of different types strongly reduce the computational complexity of the selected phase, as
the evaluation of different classifier subsets is not required [145]. Moreover test and select
methods implicitly include a ”production stage”, by which a set of classifiers must be
generated.
Different selection methods based on different search algorithm mututated from feature
selection methods (forward and backward search) or for the solution of complex optimization tasks (tabu search) are proposed in [156]. Another interesting approach uses clustering
methods and a measure of diversity to generate sets of diverse classifiers combined by majority voting, selecting the ensemble with the highest performance [72]. Finally, Dynamic
Classifier Selection methods [85, 190, 71] are based on the definition of a function selecting
for each pattern the classifier which is probably the most accurate, estimating, for instance
the accuracy of each classifier in a local region of the feature space surrounding an unknown
test pattern [71, 74, 73].
2.2.2.6
Randomized ensemble methods
Injecting randomness into the learning algorithm is another general method to generate
ensembles of learning machines. For instance, if we initialize with random values the initial
weights in the backpropagation algorithm, we can obtain different learning machines that
can be combined into an ensemble [113, 143].
Several experimental results showed that randomized learning algorithms used to generate
base elements of ensembles improve the performances of single non-randomized classifiers.
For instance in [44] randomized decision tree ensembles outperform single C4.5 decision
trees [151], and adding gaussian noise to the data inputs, together with bootstrap and
weight regularization can achieve large improvements in classification accuracy [153].
2.3
New directions in ensemble methods research
Ensemble methods are one of the most increasing research topic in machine learning.
Without pretending to be exhaustive, here are summarized some new directions in ensemble
method research, emphasizing those topics most related to my research interests.
Ensemble methods have been developed in classification and regression settings, but there
27
are very few approaches proposed for unsupervised clustering problems. For instance, a
multiple k-means method combines multiple approximate k-means solution to obtain a
final set of cluster centers [59], and cluster ensembles based on partial sets of features
or multiple views of data have been applied to data mining problems and to structure
rules in a knowledge base [99, 136]. Recently a new interesting research direction has
been proposed for unsupervised clustering problems [70, 171]. According to this approach
multiple partitioning of a set of objects, obtained from different clustering algorithms or
different instances of the same clustering algorithm, are combined without accessing the
original input features, but using only the cluster labels provided by the applied clusterers.
Then the ”optimal” partition labeling is selected as the one that maximizes the mutual
information with respect to all the provided labelings. This approach should also permit to
integrate both different clustering algorithms and views of data, exploiting heterogeneous
resources and data available in distributed environments.
Another research direction for ensemble methods could be represented by ensemble methods specific for feature selection. Indeed, if only small sized samples are available, ensemble
methods could provide robust estimates of sets of features correlated with the output of a
learning machines: several applications, for instance in bioinformatics, could take advantage of this approach.
Following the spirit of Breiman’s random forests [19], we could use randomness at different levels to improve performances of ensemble methods. For instance, we know that
random selection of input samples combined with random selection of features improve
the performance of random forests. This approach could be in principle extended to other
base learners. Moreover we could also extend this approach to other types of randomness,
as the strong law of large numbers assures the convergence and no overfitting problems
incrementing the number of base learners [19]. For instance we could design ”forests”, or,
more appropriately in this context, nets of neural networks, exploring suitable ways to inject randomness in building ensembles, extending the original Breiman’s approach ”limited
only” to random input and features.
Two main theories are invoked to explain the success of ensemble methods. The first
one consider the ensembles in the framework of large margin classifiers [129], showing
that ensembles enlarge the margins, enhancing the generalization capabilities of learning
algorithms [162, 2]. The second is based on the the classical bias–variance decomposition
of the error [68], and it shows that ensembles can reduce variance [16] and also bias [114].
Recently Domingos proved that Schapire’s notion of margins [162] can be expressed in
terms of bias and variance and viceversa [49], and hence Schapire’s bounds of ensemble’s
generalization error can be equivalently expressed in terms of the distribution of the margins
or in terms of the bias–variance decomposition of the error, showing the equivalence of
margin-based and bias–variance-based approaches.
28
Despite these important results, most of the theoretical problems behind ensemble methods
remain opened, and we need more research work to understand the characteristics and
generalization capabilities of ensemble methods.
For instance, a substantially unexplored research field is represented by the analysis of the
relationships between ensemble methods and data complexity [126]. The papers of Tin
Kam Ho [82, 83, 84] represent a fundamental starting point to explore the relationships
between ensemble methods (and more generally learning algorithms) and data complexity
in order to characterize ensemble methods with respect to the specific properties of the
data. Extending this approach we could also try to design ensemble methods well-tuned
to the data characteristics, embedding analysis of data complexity and/or the evaluation
of the geometrical or topological data characteristics into the ensemble method itself. An
interesting step in this direction is represented bu the research of Cohen and Intrator [31,
33]. Even if they use a single learning machine composed by heterogeneous radial and
sigmoidal units to properly fit geometrical data characteristics, their approach can be in
principle extended to heterogeneous ensembles of learning machines.
From a different standpoint we could also try to develop ensemble methods well-tuned
to the the characteristics of specific base learners. Usually ensemble methods have been
conceived quite independently of the characteristics of specific base learners, emphasizing
the combination scheme instead of the properties of the applied basic learning algorithm.
Hence, a promising research line could consist in characterizing the properties of a specific
base learner, building around it an ensemble method well-tuned to the learning characteristics of the base learner itself. Toward this research direction, bias-variance analysis [47]
could in principle be used to characterize the properties of learning algorithms in order to
design ensemble methods well-tuned to the bias–variance characteristics of a specific base
learner [182].
29
Chapter 3
Bias–variance decomposition of the
error
Our purpose consists in evaluating if bias–variance analysis can be used to characterize
the behavior of learning algorithms and to tune the individual base classifiers so as to
optimize the overall performance of the ensemble. As a consequence, to pursue these
goals, we considered the different approaches and theories proposed in the literature, and
in particular we propose a very general approach applicable to any loss function and in
particular to the 0/1 loss [47], as explained below in this chapter.
Historically, the bias–variance insight was borrowed from the field of regression, using
squared–loss as the loss function [68]. For classification problems, where the 0/1 loss is the
main criterion, several authors proposed bias–variance decompositions related to 0/1 loss.
Kong and Dietterich [114] proposed a bias–variance decomposition in the context of ECOC
ensembles [46], but their analysis is extensible to arbitrary classifiers, even if they defined
variance simply as a difference between loss and bias.
In Breiman’s decomposition [16] bias and variance are always non-negative (while Dietterich definition allows a negative variance), but at any input the reducible error (i.e. the
total error rate less noise) is assigned entirely to variance if the classification is unbiased,
and to bias if biased. Moreover he forced the decomposition to be purely additive, while
for the 0/1 loss this is not the case. Kohavi and Wolpert approach [112] produced a biased
estimation of bias and variance, assigning a non-zero bias to a Bayes classifier, while Tibshirani [173] did not use directly the notion of variance, decomposing the 0/1 loss in bias
and an unrelated quantity he called ”aggregation effect”, which is similar to the James’
notion of variance effect [94].
Friedman [66] showed that in classification problems, bias and variance are not purely
30
additive: in some cases increasing variance increases the error, but in other cases can also
reduce the error, especially when the prediction is biased.
Heskes [80] proposed a bias-variance decomposition using the Kullback-Leibler divergence
as loss function. By this approach the error between the target and the predicted classifier densities is measured; anyway when he tried to extend this approach to the zero-one
function interpreted as the limit case of log-likelihood type error, the resulting decomposition produces a definition of bias that losses his natural interpretation as systematic error
committed by the classifier.
As briefly outlined, these decompositions suffer of significant shortcomings: in particular
they lose the relationship to the original squared loss decomposition, forcing in most cases
bias and variance to be purely additive.
We consider classification problems and the 0/1 loss function in the Domingos’ unified
framework of bias–variance decomposition of the error [49, 48]. In this approach bias and
variance are defined for an arbitrary loss function, showing that the resulting decomposition
specializes to the standard one for squared loss, but it holds also for the 0/1 loss [49].
A similar approach has been proposed by James [94]: he extended the notion of variance
and bias for general loss functions, distinguishing also between bias and variance, interpreted respectively as the systematic error and the variability of an estimator, and the the
actual effect of bias and variance on the error.
In the rest of this chapter we consider Domingos and James bias–variance theory, focusing
on bias–variance for the 0/1 loss. Moreover we show how to measure bias and variance of
the error in classification problems, suggesting diverse approaches for respectively ”large”
or ”small” data sets.
3.1
Bias–Variance Decomposition for the 0/1 loss function
The analysis of bias–variance decomposition of the error has been originally developed in
the standard regression setting, where the squared error is usually used as loss function.
Considering a prediction y = f (x) of an unknown target t, provided by a learner f on input
x, with x ∈ Rd and y ∈ R, the classical decomposition of the error in bias and variance for
the squared error loss is [68]:
Ey,t [(y − t)2 ] = Et [(t − E[t])2 ] + Ey [(y − E[y])2 ] + (E[y] − E[t])2
= N oise(t) + V ar(y) + Bias2 (y)
31
In words, the expected loss of using y to predict t is the sum of the variances of t (noise) and
y plus the squared bias. Ey [·] indicates the expected value with respect to the distribution
of the random variable y.
This decomposition cannot be automatically extended to the standard classification setting,
as in this context the 0/1 loss function is usually applied, and bias and variance are not
purely additive. As we are mainly interested in analyzing bias–variance for classification
problems, we introduce the bias–variance decomposition for the 0/1 loss function, according
to the Domingos unified bias–variance decomposition of the error [48].
3.1.1
Expected loss depends on the randomness of the training
set and the target
Consider a (potentially infinite) population U of labeled training data points, where each
point is a pair (xj , tj ), tj ∈ C, xj ∈ Rd , d ∈ N, where C is the set of the class labels. Let
P (x, t) be the joint distribution of the data points in U . Let D be a set of m points drawn
identically and independently from U according to P . We think of D as being the training
sample that we are given for training a classifier. We can view D as a random variable,
and we will let ED [·] indicate the expected value with respect to the distribution of D.
Let L be a learning algorithm, and define fD = L(D) as the classifier produced by L
applied to a training set D. The model produces a prediction fD (x) = y. Let L(t, y) be
the 0/1 loss function, that is L(t, y) = 0 if y = t, and L(t, y) = 1 otherwise.
Suppose we consider a fixed point x ∈ Rd . This point may appear in many labeled training points in the population. We can view the corresponding labels as being distributed
according to the conditional distribution P (t|x). Recall that it is always possible to factor
the joint distribution as P (x, t) = P (x)P (t|x). Let Et [·] indicate the expectation with
respect to t drawn according to P (t|x).
Suppose we consider a fixed predicted class y for a given x. This prediction will have an
expected loss of Et [L(t, y)]. In general, however, the prediction y is not fixed. Instead, it
is computed from a model fD which is in turn computed from a training sample D.
Hence, the expected loss EL of learning algorithm L at point x can be written by considering both the randomness due to the choice of the training set D and the randomness in
t due to the choice of a particular test point (x, t):
EL(L, x) = ED [Et [L(t, fD (x))]],
where fD = L(D) is the classifier learned by L on training data D. The purpose of the
bias-variance analysis is to decompose this expected loss into terms that separate the bias
and the variance.
32
3.1.2
Optimal and main prediction.
To derive this decomposition, we must define two things: the optimal prediction and the
main prediction: according to Domingos, bias and variance can be defined in terms of these
quantities.
The optimal prediction y∗ for point x minimizes Et [L(t, y)] :
y∗ (x) = arg min Et [L(t, y)]
y
(3.1)
It is equal to the label t that is observed more often in the universe U of data points. The
optimal model fˆ(x) = y∗ , ∀x makes the optimal prediction at each point x. The noise
N (x), is defined in terms of the optimal prediction, and represents the remaining loss that
cannot be eliminated, even by the optimal prediction:
N (x) = Et [L(t, y∗ )]
Note that in the deterministic case y∗ (x) = t and N (x) = 0.
The main prediction ym at point x is defined as
ym = arg min
ED [L(fD (x), y ′ )].
′
y
(3.2)
This is a value that would give the lowest expected loss if it was the “true label” of x. It
expresses the ”central tendency” of a learner, that is its systematic prediction, or, in other
words, it is the label for x that the learning algorithm “wishes” were correct. For 0/1 loss,
the main prediction is the class predicted most often by the learning algorithm L when
applied to training sets D.
3.1.3
Bias, unbiased and biased variance.
Given these definitions, the bias B(x) (of learning algorithm L on training sets of size m)
is the loss of the main prediction relative to the optimal prediction:
B(x) = L(y∗ , ym )
For 0/1 loss, the bias is always 0 or 1. We will say that L is biased at point x, if B(x) = 1.
The variance V (x) is the average loss of the predictions relative to the main prediction:
V (x) = ED [L(ym , fD (x))]
It captures the extent to which the various predictions fD (x) vary depending on D.
33
(3.3)
In the case of the 0/1 loss we can also distinguish two opposite effects of variance (and
noise) on the error: in the unbiased case variance and noise increase the error, while in the
biased case variance and noise decrease the error.
There are three components that determine whether t = y:
1. Noise: is t = y∗ ?
2. Bias: is y∗ = ym ?
3. Variance: is ym = y ?
Note that bias is either 0 or 1 because neither y∗ nor ym are random variables. From this
standpoint we can consider two different cases: the unbiased and the biased case.
In the unbiased case, B(x) = 0 and hence y∗ = ym . In this case we suffer a loss if the
prediction y differs from the main prediction ym (variance) and the optimal prediction y∗
is equal to the target t, or y is equal to ym , but y∗ is different from t (noise).
In the biased case, B(x) = 1 and hence y∗ 6= ym . In this case we suffer a loss if the
prediction y is equal to the main prediction ym and the optimal prediction y∗ is equal to
the target t, or if both y is different from to ym (variance), and y∗ is different from t (noise).
Fig. 3.1 summarizes the different conditions under which an error can arise, considering
the combined effect of bias, variance and noise on the learner prediction.
Considering the above case analysis of the error, if we let P (t 6= y∗ ) = N (x) = τ and
P (ym 6= y) = V (x) = σ, in the unbiased case we have:
L(t, y) = τ (1 − σ) + σ(1 − τ )
= τ + σ − 2τ σ
= N (x) + V (x) − 2N (x)V (x)
(3.4)
while, in the the biased case:
L(t, y) = τ σ + (1 − τ )(1 − σ)
= 1 − (τ + σ − 2τ σ)
= B(x) − (N (x) + V (x) − 2N (x)V (x))
(3.5)
Note that in the unbiased case (eq. 3.4) the variance is an additive term of the loss function,
while in the biased case (eq. 3.5) the variance is a subtractive term of the loss function.
Moreover the interaction terms τ σ will usually be small, because, for instance, if both noise
and variance term will be both lower than 0.1, the interaction term 2N (x)V (x) will be
reduced to less than 0.02
34
y = ym ?
*
no [bias]
yes
ym = y ?
ym = y ?
yes
no [variance]
yes
correct
no [noise]
t =y ?
*
*
yes
no [variance]
t =y ?
t =y ?
t =y ?
*
yes
yes
no [noise]
error error
correct
[noise] [variance] [noise
cancels
variance]
error
[bias]
no [noise]
yes
*
no [noise]
correct correct
[noise [variance
cancels cancels
bias] bias]
error
[noise
cancels
variance
cancels
bias]
Figure 3.1: Case analysis of error.
In order to distinguish between these two different effects of the variance on the loss
function, Domingos defines the unbiased variance, Vu (x), to be the variance when B(x) = 0
and the biased variance, Vb (x), to be the variance when B(x) = 1. We can also define the
net variance Vn (x) to take into account the combined effect of the unbiased and biased
variance:
Vn (x) = Vu (x) − Vb (x)
Fig. 3.2 summarizes in graphic form the opposite effects of biased and unbiased variance
on the error.
If we can disregard the noise, the unbiased variance captures the extents to which the
learner deviates from the correct prediction ym (in the unbiased case ym = y∗ ), while
the biased variance captures the extents to which the learner deviates from the incorrect
prediction ym (in the biased case ym 6= y∗ ).
More precisely, for the two-class classification problem, with N (x) = 0, in the two cases
we have:
1. If B(x) = 0, pcorr (x) > 0.5 ⇒ Vu (x) = 1 − pcorr (x).
2. If B(x) = 1, pcorr (x) ≤ 0.5 ⇒ Vb (x) = pcorr (x)
where pcorr is the probability that a prediction is correct: pcorr (x) = P (y = t|x).
35
Error
1.0
biased
0.5
unbiased
0.5
Variance
Figure 3.2: Effects of biased and unbiased variance on the error. The unbiased variance
increments, while the biased variance decrements the error.
In fact, in the unbiased case:
pcorr (x) > 0.5 ⇒ ym = t ⇒ P (y = ym |x) = pcorr ⇒ P (y 6= ym ) = 1 − pcorr ⇒
ED [L(ym , y)] = V (x) = 1 − pcorr .
Hence the variance V (x) = Vu (x) = 1 − pcorr is given by the probability of an incorrect
prediction, or equivalently expresses the deviation from the correct prediction.
In the biased case:
pcorr (x) ≤ 0.5 ⇒ ym 6= t ⇒ P (y = ym |x) = 1 − pcorr ⇒ P (y 6= ym ) = pcorr ⇒
ED [L(ym , y)] = V (x) = pcorr .
Hence the variance V (x) = Vb (x) = pcorr is given by the probability of a correct prediction,
or equivalently expresses the deviation from the incorrect prediction.
3.1.4
Domingos bias–variance decomposition.
For quite general loss functions L Domingos [47] showed that the expected loss is:
EL(L, x) = c1 N (x) + B(x) + c2 V (x)
(3.6)
For the 0/1 loss, c1 is 2PD (fD (x) = y∗ ) − 1 and c2 is +1 if B(x) = 0 and −1 if B(x) = 1.
Note that c2 V (x) = Vu (x) − Vb (x) = Vn (x) (eq. 3.3), and if we disregard the noise, eq. 3.6
36
can be simplified to:
EL(L, x) = B(x) + Vn (x)
(3.7)
Summarizing, one of the most interesting aspects of Domingos’ decomposition is that
variance hurts on unbiased points x, but it helps on biased points. Nonetheless, to obtain
low overall expected loss, we want the bias to be small, and hence, we see to reduce both
the bias and the unbiased variance. A good classifier will have low bias, in which case the
expected loss will approximately equal the variance.
This decomposition for a single point x can be generalized to the entire population by
defining Ex [·] to be the expectation with respect to P (x). Then we can define the average
bias Ex [B(x)], the average unbiased variance Ex [Vu (x)], and the average biased variance
Ex [Vb (x)]. In the noise-free case, the expected loss over the entire population is
Ex [EL(L, x)] = Ex [B(x)] + Ex [Vu (x)] − Ex [Vb (x)].
3.1.5
Bias, variance and their effects on the error
James [94] provides definitions of bias and variance that are identical to those provided by
Domingos. Indeed bias and variance definitions are based on quantities that he named the
systematic part sy of y and the systematic part st of t. These correspond respectively to
the Domingos main prediction (eq.3.2) and optimal prediction (eq.3.1).
Moreover James distinguishes between bias and variance and systematic and variance effects. Bias and variance satisfy respectively the notion of the difference between the systematic parts of y and t, and the variability of the estimate y. Systematic effect SE
represents the change in error of predicting t when using sy instead of st, and the variance
effect V E the change in prediction error when using y instead of sy in order to predict t.
Using Domingos notation (ym for sy, and y∗ for st) the variance effect is:
V E(y, t) = Ey,t [L(y, t)] − Et [L(t, ym )]
while the systematic effect corresponds to:
SE(y, t) = Et [L(t, ym )] − Et [L(t, y∗ )]
In other words the systematic effect represents the change in prediction error caused by
bias, while the variance effect the change in prediction error caused by variance.
While for the squared loss the two sets of bias–variance definitions match, for general loss
functions the identity does not hold. In particular for the 0/1 loss James proposes the
following definitions for noise, variance and bias with 0/1 loss:
N (x) = P (t 6= y∗ )
37
V (x) = P (y 6= ym )
B(x) = I(y∗ 6= ym )
(3.8)
where I(z) is 1 if z is true and 0 otherwise.
The variance effect for the 0/1 loss can be expressed in the following way:
V E(y, t) = Ey,t [L(y, t) − L(t, ym )] = Py,t (y 6= t) − Pt (t 6= ym ) =
= 1 − Py,t (y = t) − (1 − Pt (t = ym )) = Pt (t = ym ) − Py,t (y = t)
(3.9)
while the systematic effect is:
SE(y, t) = Et [L(t, ym )] − Et [L(t, y∗ )] = Pt (t 6= ym ) − Pt (t 6= y∗ ) =
= 1 − Pt (t = ym ) − (1 − Pt (t = y∗ )) = Pt (t = y∗ ) − Pt (t = ym )
(3.10)
If we let N (x) = 0, considering eq. 3.7, eq. 3.8, and eq. 3.9 the variance effect becomes:
V E(y, t) = Pt (t = ym ) − Py,t (y = t) = P (y∗ = ym ) − Py (y = y∗ ) =
= 1 − P (y∗ 6= ym ) − (1 − Py (y 6= y∗ )) = 1 − B(x) − (1 − EL(L, x)) =
EL(L, x) − B(x) = Vn (x)
(3.11)
while from eq. 3.8 and eq. 3.10 the systematic effect becomes:
SE(y, t) = Pt (t = y∗ ) − Pt (t = ym ) = 1 − Pt (t 6= y∗ ) − (1 − Pt (t 6= ym )) =
P (y∗ 6= ym ) = I(y∗ 6= ym ) = B(x)
(3.12)
Hence if N (x) = 0, it follows that the variance effect is equal to the net-variance (eq. 3.11),
and the systematic effect is equal to the bias (eq. 3.12).
3.2
Measuring bias and variance
The procedures to measure bias and variance depend on the characteristics and on the
cardinality of the data sets used.
For synthetic data sets we can generate different sets of training data for each learner
to be trained. Then a large synthetic test set can be generated in order to estimate the
bias–variance decomposition of the error for a specific learner model.
Similarly, if a large data set is available, we can split it in a large learning set and in a
large testing set. Then we can randomly draw subsets of data from the large training set
in order to train the learners; bias–variance decomposition of the error is measured on the
large independent test set.
38
However, in practice, for real data we dispose of only one and often small data set. In this
case, we can use cross-validation techniques for estimating bias–variance decomposition,
but we propose to use out-of-bag [19] estimation procedures, as they are computationally
less expensive.
3.2.1
Measuring with artificial or large benchmark data sets
Consider a set D = {Di }ni=1 of learning sets Di = {xj , tj }m
j=1 . Here we consider only a
two-class case, i.e. tj ∈ C = {−1, 1}, xj ∈ X, for instance X = Rd , d ∈ N, but the
extension to the multiclass case is straightforward.
We define fDi = L(Di ) as the model fDi produced by a learner L using a training set Di .
The model produces a prediction fDi (x) = y.
In presence of noise and with the 0/1 loss, the optimal prediction y∗ is equal to the label
t that is observed more often in the universe U of data points:
y∗ (x) = arg max P (t|x)
t∈C
The noise N (x) for the 0/1 loss can be estimated if we can evaluate the probability of the
targets for a given example x:
X
X
N (x) =
L(t, y∗ )P (t|x) =
||t 6= y∗ ||P (t|x)
t∈C
t∈C
where ||z|| = 1 if z is true, 0 otherwise,
In practice, for ”real world” data sets it is difficult to estimate the noise, and to simplify
the computation we consider the noise free case. In this situation we have y∗ = t.
The main prediction is a function of the y = fDi (x). Considering a 0/1 loss, we have
ym = arg max(p1 , p−1 )
where p1 = PD (y = 1|x) and p−1 = PD (y = −1|x), i.e. the main prediction is the mode.
To calculate p1 , having a test set T = {xj , tj }rj=1 , it is sufficient to count the number of
learners that predict class 1 on a given input x:
Pn
kfDi (xj ) = 1k
p1 (xj ) = i=1
n
where kzk = 1 if z is true and kzk = 0 if z is false
The bias can be easily calculated after the evaluation of the main prediction:
¯
¯
½
¯ ym − t ¯
1 if ym 6= t
¯
= ¯¯
B(x) =
0 if ym = t
2 ¯
39
(3.13)
or equivalently:
B(x) =
½
1 if pcorr (x) ≤ 0.5
0 otherwise
where pcorr is the probability that a prediction is correct, i.e. pcorr (x) = P (y = t|x) =
PD (fD (x) = t).
In order to measure the variance V (x), if we define yDi = fDi (x), we have:
n
n
1X
1X
V (x) =
L(ym , yDi ) =
||(ym 6= yDi )||
n i=1
n i=1
The unbiased variance Vu (x) and the biased variance Vb (x) can be calculated evaluating if
the prediction of each learner differs from the main prediction respectively in the unbiased
and in the biased case:
n
1X
||(ym = t) and (ym 6= yDi )||
Vu (x) =
n i=1
n
Vb (x) =
1X
||(ym 6= t) and (ym 6= yDi )||
n i=1
In the noise-free case, the average loss on the example x ED (x) is calculated by a simple
algebraic sum of bias, unbiased and biased variance:
ED (x) = B(x) + Vu (x) − Vb (x) = B(x) + (1 − 2B(x))V (x)
In order to evaluate bias–variance decomposition on the entire set of examples, consider
r
a test set T = {xj , tj }j=1
. We can easily calculate the average bias, variance, unbiased,
biased and net variance, averaging over the entire set of examples:
Average bias:
¯
r
r ¯
1 X ¯¯ ym (xj ) − tj ¯¯
1X
B(xj ) =
Ex [B(x)] =
¯
r j=1
r j=1 ¯
2
Average variance:
r
1X
Ex [V (x)] =
V (xj )
r j=1
r
n
r
n
1 XX
=
L(ym (xj ), fDi (xj ))
nr j=1 i=1
1 XX
||ym (xj ) 6= fDi (xj )||
=
nr j=1 i=1
40
Average unbiased variance:
r
r
n
1X
1 XX
Ex [Vu (x)] =
Vu (xj ) =
||(ym (xj ) = tj ) and (ym (xj ) 6= fDi (xj ))||
r j=1
nr j=1 i=1
Average biased variance:
r
r
n
1X
1 XX
Ex [Vb (x)] =
Vb (xj ) =
||(ym (xj ) 6= tj ) and (ym (xj ) 6= fDi (xj ))||
r j=1
nr j=1 i=1
Average net variance:
r
Ex [Vn (x)] =
r
1X
1X
Vn (xj ) =
(Vu (xj ) − Vb (xj ))
r j=1
r j=1
Finally, the average loss on all the examples (with no noise) is the algebraic sum of the
average bias, unbiased and biased variance:
Ex [L(t, y)] = Ex [B(x)] + Ex [Vu (x)] − Ex [Vb (x)]
3.2.2
Measuring with small data sets
In practice (unlike in theory), we have only one and often small data set S. We can
simulate multiple training sets by bootstrap replicates S ′ = {x|x is drawn at random with
replacement from S}.
In order to measure bias and variance we can use out-of-bag points, providing in such a
manner an unbiased estimate of the error.
At first we need to construct B bootstrap replicates of S (e. g., B = 200): S1 , . . . , SB .
Then we apply a learning algorithm L to each replicate Sb to obtain hypotheses fb = L(Sb ).
Let Tb = S\Sb be the data points that do not appear in Sb (out of bag points). We can
use these data sets Tb to evaluate the bias–variance decomposition of the error; that is we
compute the predicted values fb (x), ∀x s.t. x ∈ Tb .
For each data point x, we will now have the observed corresponding value t and several
predictions y1 , . . . , yK , where K = |{Tb |x ∈ Tb , 1 ≤ b ≤ B}| depends on x, K ≤ B, and
on the average K ≃ B/3, because about 1/3 of the predictors is not trained on a specific
input x.
In order to compute the main prediction, for a two-class classification problem, we can
define:
B
1 X
||(x ∈ Tb ) and (fb (x) = 1)||
p1 (x) =
K b=1
41
B
1 X
||(x ∈ Tb ) and (fb (x) = −1)||
p−1 (x) =
K b=1
The main prediction ym (x) corresponds to the mode:
ym = arg max(p1 , p−1 )
The bias can be calculated as in eq. 3.13, and the variance V (x) is:
V (x) =
B
1 X
||(x ∈ Tb )and(ym 6= fb (x))||
K b=1
Similarly can be easily computed unbiased, biased and net–variance:
B
1 X
Vu (x) =
||(x ∈ Tb ) and (B(x) = 0) and (ym 6= fb (x))||
K b=1
B
1 X
||(x ∈ Tb ) and (B(x) = 1) and (ym 6= fb (x))||
Vb (x) =
K b=1
Vn (x) = Vu (x) − Vb (x)
Average bias, variance, unbiased, biased and net variance, can be easily calculated averaging over all the examples.
42
Chapter 4
Bias–Variance Analysis in single
SVMs
The bias–variance decomposition of the error represents a powerful tool to analyze learning
processes in learning machines. According to the procedures described in the previous
chapter, we analyzed bias and variance in SVMs, in order to study the relationships with
different kernel types and their parameters. To accomplish this task we computed bias–
variance decomposition of the error on different synthetic and ”real” data sets.
4.1
Experimental setup
We performed an extended bias–variance analysis of the error in Support Vector Machines,
training and testing more than half million of different SVMs on different training and test
sets.
4.1.1
Data sets
In the experiments we employed 7 different data sets, both synthetic and ”real”.
P2 is a synthetic bidimensional two–class data set; each region is delimited by one or more
of four simple polynomial and trigonometric functions (Fig. 4.1).
The synthetic data set Waveform is generated from a combination of 2 of 3 ”base” waves;
we reduced the original three classes of Waveform to two, deleting all samples pertaining
to class 0. The other data sets are all from the UCI repository [135].
43
Tab. 4.1 summarizes the main features of the data sets used in the experiments. The rest
of this section explains in more detail the characteristics of the data sets.
Table 4.1: Data sets used in the experiments.
Data set
P2
Waveform
Grey-Landsat
Letter
Letter w. noise
Spam
Musk
4.1.1.1
# of # of tr. # of tr.
# base
# of
attr. samples
sets
tr. set test samples
2
100
400 synthetic
10000
21
100
200 synthetic
10000
36
100
200
4425
2000
16
100
200
614
613
16
100
200
614
613
57
100
200
2301
2300
166
100
200
3299
3299
P2
We used a synthetic bidimensional two–class data set (Fig. 4.1). Each region, delimited
by one or more of four simple polynomial and trigonometric functions, belongs to one of
the two classes, according to the Roman numbers I and II. We generated a series of 400
training sets with 100 independent examples randomly extracted according to a uniform
probability distribution. The test set (10000 examples) was generated through the same
distribution. The application gensimple, that we developed to generate the data, is freely
available on line at ftp://ftp.disi.unige.it/person/ValentiniG/BV/gensimple.
4.1.1.2
Waveform
It is a synthetic data set from the UCI repository. Each class is generated from a combination of 2 of 3 ”base” waves. Using the application waveform we can generate an arbitrary
number of samples from the same distribution. We reduced the original three classes of
Waveform to two, deleting all samples pertaining to class 0.
4.1.1.3
Grey-Landsat
It is a data set from the UCI repository, modified in order to be available for a dichotomic
classification problem. The attributes represent intensity values for four spectral bands
44
10
I
I
8
II
II
6
X2
I
I
4
2
0
I
II
0
2
4
6
8
10
X1
Figure 4.1: P2 data set, a bidimensional two class synthetic data set.
and nine neighbouring pixels, while the classification refers to the central pixel. Hence we
have 9 data values for each spectral band for a total of 36 data attributes for each pattern.
The data come from a rectangular area approximately five miles wide. The original data
set Landsat (available from UCI repository) is a 6 way classification data set with 36
attributes. Following Scott and Langdon [125], classes 3, 4 and 7 were combined into one
(positive gray), while 1, 2 and 5 became the negative examples (not-Gray).
4.1.1.4
Letter-Two
It is a reduced version of the Letter data set from UCI: we consider here only letter B
versus letter R, taken from the letter recognition data set. The 16 attributes are integer
values that refer to different features of the letters. We used also a version of Letter-Two
with 20 % added classification noise (Letter-Two with added noise data set).
4.1.1.5
Spam
This data set from UCI separates ”spam” e-mails from ”non-spam’ e-mails, considering
mainly attributes that indicate whether a particular word or character frequently occurs
in the e-mail. Of course, the concept of spam is somewhat subjective: in particular the
creators of this data set selected non-spam e-mails that came from filed work and personal
e-mails, while collection of spam e-mails came from their postmaster and individuals who
45
had filed spam. However we have a relatively large data set with 4601 instances and 57
continuous attributes.
4.1.1.6
Musk
The dataset (available from UCI) describes a set of 102 molecules of which 39 are judged
by human experts to be musks and the remaining 63 molecules are judged to be nonmusks. The 166 features that describe these molecules depend upon the exact shape, or
conformation, of the molecule. Because bonds can rotate, a single molecule can adopt
many different shapes. To generate this data set, all the low-energy conformations of the
molecules were generated to produce 6,598 conformations. Then, a feature vector was
extracted that describes each conformation.
In these experiments the data set was used as a normal data set, considering directly the
different conformations of the same molecule as a different instance. As a consequence,
each feature vector represents a different example to be classified and the classifier does
not classify a molecule as ”musk” if any of its conformations is classified as a musk. In
other words we used the data set without considering the many-to-one relationship between
feature vectors and molecules that characterize the ”multiple instance problem”.
4.1.2
Experimental tasks
In order to perform a reliable evaluation of bias and variance we used small training set
and large test sets. For synthetic data we generated the desired number of samples. For
real data sets we used bootstrapping to replicate the data. In both cases we computed
the main prediction, bias, unbiased and biased variance, net-variance according to the
procedures explained in Sect. 3.2.1. In our experiments, the computation of variance effect
and systematic effect is reduced to the measurement of the net-variance and bias, as we
did not explicitly consider the noise (eq. 3.11 and 3.12).
4.1.2.1
Set up of the data
With synthetic data sets, we generated small training sets of about 100 examples and
reasonably large test sets using computer programs. In fact small samples show bias and
variance more clearly than having larger samples. We produced 400 different training sets
for P2 and 200 training sets for Waveform. The test sets were chosen reasonably large
(10000 examples) to obtain reliable estimates of bias and variance.
For real data sets we first divided the data into a training D and a test T sets. If the
46
data sets had a predefined training and test sets reasonably large, we used them (as in
Grey-Landsat and Spam), otherwise we split them in a training and test set of equal size.
Then we drew from D bootstrap samples. We chosen bootstrap samples much smaller than
|D| (100 examples). More precisely we drew 200 data sets from D, each one consisting of
100 examples uniformly drawn with replacement.
Fig. 4.2 outlines the experimental procedure we adopted for setting up the data and Fig. 4.3
the experimental procedure to evaluate bias–variance decomposition of the error.
Procedure Generate samples
Input arguments:
- Data set S
- Number n of samples
- Size s of the samples
Output:
- Set D̄ = {Di }ni=1 of samples
begin procedure
[D, T ] = Split(S)
D̄ = ∅
for i = 1 to n
begin
Di = Draw with replacement(D, s)
D̄ = D̄ + Di
end
end procedure.
Figure 4.2: Procedure to generate samples to be used for bias–variance analysis with single
SVMs
Samples Di are drawn with replacement according to an uniform probability distribution
from the training set D by the procedure Draw with replacement. This process is repeated
n times (procedure Generate samples, Fig. 4.2). Then the procedure Bias--Variance analysis
(Fig. 4.3) trains different SVM models, according to the different learning parameters α
provided to the procedure svm train). SVM Set(α) is the set of the SVMs trained using the same learning parameter α and a set D̄ of samples generated by the procedure
Generate samples.
The bias–variance decomposition of the error is performed on the separated test set T
using the previously trained SVMs (procedure Perform BV analysis).
47
Procedure Bias–Variance analysis
Input arguments:
- Test set T
- Number of samples n
- Set of learning parameters A
- Set D̄ = {Di }ni=1 of samples
Output:
- Error, bias, net-variance, unbiased and biased variance BV = {bv(α)}α∈A
of the SVMs with learning parameters α ∈ A.
begin procedure
For each α ∈ A
begin
SVM Set(α) = ∅
for i = 1 to n
begin
svm(α, Di ) = svm train (α, Di )
SVM Set(α) = SVM Set(α) ∪ svm(α, Di )
end
bv(α) = Perform BV analysis(SVM Set (α), T )
BV = BV ∪ bv(α)
end
end procedure.
Figure 4.3: Procedure to perform bias–variance analysis on single SVMs
4.1.2.2
Tasks
To evaluate bias and variance in SVMs we conducted experiments with different kernels
and different kernel parameters.
In particular we considered 3 different SVM kernels:
1. Gaussian kernels. We evaluated bias–variance decomposition varying the parameters σ of the kernel and the C parameter that controls the trade–off between training
error and the margin. In particular we analyzed:
(a) The relationships between average error, bias, net–variance, unbiased and biased
variance and the parameter σ of the kernel.
(b) The relationships between average error, bias, net–variance, unbiased and biased
variance and the parameter C (the regularization factor) of the kernel.
48
(c) The relationships between generalization error, training error, number of support vectors and capacity with respect to σ.
We trained RBF-SVM with all the combinations of the parameters σ and C, taken
from the following two sets:
½
σ ∈ {0.01, 0.02, 0.1, 0.2, 0.5, 1, 2, 5, 10, 20, 50, 100, 200, 300, 400, 500, 1000}
C ∈ {0.01, 0.1, 1, 2, 5, 10, 20, 50, 100, 200, 500, 1000}
evaluating in such a way 17 × 12 = 204 different RBF-SVM models for each data set.
2. Polynomial kernels. We evaluated bias–variance decomposition varying the degree
of the kernel and the C parameter that controls the trade–off between training error
and the margin. In particular we analyzed:
(a) The relationships between average error, bias, net–variance, unbiased and biased
variance and the degree of the kernel.
(b) The relationships between average error, bias, net–variance, unbiased and biased
variance and the parameter C (the regularization factor) of the kernel.
We trained polynomial-SVM with all the combinations of the parameters:
½
degree ∈ {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}
C
∈ {0.01, 0.1, 1, 2, 5, 10, 20, 50, 100, 200, 500, 1000}
evaluating in such a way 10 × 12 = 120 different polynomial-SVM models for each
data set. Following the heuristic of Jakkola, the dot product of polynomial kernel was
divided by the dimension of the input data, to ”normalize” the dot–product before
to raise to the degree of the polynomial.
3. Dot–product kernels. We evaluated bias–variance decomposition varying the C
parameter. We analyzed the relationships between average error, bias, net–variance,
unbiased and biased variance and the parameter C (the regularization factor) of
the kernel. We trained dot–product-SVM considering the following values for the C
parameter:
C ∈ {0.01, 0.1, 1, 2, 5, 10, 20, 50, 100, 200, 500, 1000}
evaluating in such a way 12 different dot–product-SVM models for each data set.
Each SVM model required the training of 200 different SVMs, one for each synthesized or
bootstrapped data set, for a total of (204 + 120 + 12) × 200 = 67200 trained SVMs for each
data set (134400 for the data set P2, as for this data set we used 400 data sets for each
model).
49
Summarizing the experiments required the training of more than half million of SVMs,
considering all the data sets and of course the testing of all the SVM previously trained
in order to evaluate the bias–variance decomposition of the error of the different SVM
models. For each SVM model we computed the main prediction, bias, net-variance, biased
and unbiased variance and the error on each example of the test set, and the corresponding
average quantities on the overall test set.
4.1.3
Software used in the experiments
In all our experiments we used the NEURObjects [184]1 C++ library and SVM-light [96]
applications. In particular the synthetic data P2 and Waveform were generated respectively through our C++ application gensimple and waveform from the UCI repository.
The bootstrapped data for the real data sets were extracted using the C++ NEURObjects
application subsample. The data then were normalized using the NEURObjects application convert data format. For some data sets in order to extract randomly a separated
training and test set we used the NEURObjects application dofold. Training and testing
of the SVM were performed using Joachim’s SVM-light software, and in particular the
applications svm learn and svm classify. We slightly modified svm learn in order to
force convergence of the SVM algorithm when the optimality conditions are not reached in
a reasonable time. We developed and used the C++ application analyze BV, to perform
bias–variance decomposition of the error2 . This application analyzes the output of a generic
learning machine model and computes the main prediction, error, bias, net–variance, unbiased and biased variance using the 0/1 loss function. Other C++ applications have been
developed for the automatic analysis of the results, using also Cshell scripts to train, test
and analyze bias–variance decomposition of all the SVM models for a specific data set,
considering respectively gaussian, polynomial and dot–product kernels.
4.2
Results
In this section we present the results of the experiments. We analyzed bias–variance decomposition with respect to the kernel parameters considering separately gaussian, polynomial
and dot product SVMs, comparing also the results among different kernels. Here we present
the main results. Full results, data and graphics are available by anonymous ftp at:
ftp://ftp.disi.unige.it/person/ValentiniG/papers/bv-svm.ps.gz.
1
Download web site: http://www.disi.unige.it/person/ValentiniG/NEURObjects.
The source code is available at ftp://ftp.disi.unige.it/person/ValentiniG/BV. Moreover C++
classes for bias–variance analysis have been developed as part of the NEURObjects library
2
50
4.2.1
Gaussian kernels
Fig. 4.4 depicts the average loss, bias net–variance, unbiased and biased variance varying
the values of σ and the regularization parameter C in RBF-SVM on the Grey-Landsat data
set. We note that σ is the most important parameter: although for very low values of C
the SVM cannot learn, independently of the values of σ, (Fig. 4.4 a), the error, the bias,
and the net–variance depend mostly on the σ parameter. In particular for low values of
σ, bias is very high (Fig. 4.4 b) and net-variance is 0, as biased and unbiased variance are
about equal (Fig. 4.4d and 4.4e). Then the bias suddenly goes down (Fig. 4.4b), lowering
the average loss (Fig. 4.4a), and then stabilizes for higher values of σ. Interestingly enough,
in this data set (but also in others, data not shown), we note an increment followed by
a decrement of the net–variance, resulting in a sort of ”wave shape” of the net variance
graph (Fig. 4.4c).
Fig. 4.5 shows the bias–variance decomposition on different data sets, varying σ, and for
a fixed value of C, that is a sort of ”slice” along the σ axis of the Fig. 4.4. The plots
show that average loss, bias, and variance depend significantly on σ for all the considered
data sets, confirming the existence of a “high biased region” for low values of σ. In this
region, biased and unbiased variance are about equal (net–variance Vn = Vu − Vb is low).
Then unbiased variance increases while biased variance decreases (Fig. 4.5 a,b,c and d),
and finally both stabilize for relatively high values of σ. Interestingly, the average loss and
the bias do not increase for high values of σ, especially if C is high.
Bias and average loss increases with σ only for very small C values. Note that net-variance
and bias show opposite trends only for small values of C (Fig. 4.5 c). For larger C values
the symmetric trend is limited only to σ ≤ 1 (Fig. 4.5 d), otherwise bias stabilizes and
net-variance slowly decreases.
Fig. 4.6 shows more in detail the effect of the C parameter on bias-variance decomposition.
For C ≥ 1 there are no variations of the average error, bias and variance for a fixed value
of σ. Note that for very low values of σ (Fig. 4.6a and b) there is no learning. In the
Letter-Two data set, as in other data sets (figures not shown), only for small C values we
have variations in bias and variance values (Fig. 4.6).
4.2.1.1
The discriminant function computed by the SVM-RBF classifier
In order to get insights into the behaviour of SVM learning algorithm with gaussian kernels
we plotted the real-valued functions computed without considering the discretization step
performed through the sign function. The real valued function computed by a gaussian
51
Avg. err.
0.5
0.4
0.3
0.2
0.1
0
0.01
1
5
20
C
100
1000
0.2
1 0.5
5 2
20 10
sigma
50
100
0.10.02 0.01
(a)
Bias
Net var.
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
0.01
-0.1
0.01
1
1
5
20
C
5
0.02 0.01
0.2 0.1
100
1000
20
C
1 0.5
5 2
20 10
sigma
50
100
100
1000
(b)
0.2
1 0.5
5 2
20 10
sigma
50
100
0.01
0.10.02
(c)
Unb. var.
Biased var.
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0.01
0
0.01
1
1
5
C
0.02 0.01
0.2 0.1
20
100
1000
1 0.5
5 2
20 10
sigma
50
100
5
C
20
100
1000
(d)
0.2
1 0.5
5 2
20 10
sigma
50
100
0.01
0.10.02
(e)
Figure 4.4: Grey-Landsat data set. Error (a) and its decomposition in bias (b), net variance
(c), unbiased variance (d), and biased variance (e) in SVM RBF, varying both C and σ.
52
C=10
C=10
0.5
avg. error
bias
net variance
unbiased var.
biased var
0.5
0.4
avg. error
bias
net variance
unbiased var.
biased var
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
-0.1
0.01 0.02
0.1
0.2
0.5
1
2
sigma
5
10
20
50
100
-0.1
0.01 0.02
0.1
0.2
0.5
(a)
0.5
C=0.1
0.5
avg. error
bias
net variance
unbiased var.
biased var
0.3
0.2
0.2
0.1
0.1
0
0
0.1
0.2
0.5
1
2
5
10
20
50
100
50
100
avg. error
bias
net variance
unbiased var.
biased var
-0.1
1
2
sigma
(c)
(d)
C=10
0.01 0.02
0.1
0.2
0.5
5
10
20
50
100
C=1
avg. error
bias
net variance
unbiased var.
biased var
0.4
0.4
0.3
0.2
0.2
0.1
0.1
0
0
0.1
0.2
0.5
1
2
sigma
5
10
20
50
avg. error
bias
net variance
unbiased var.
biased var
0.5
0.3
0.01 0.02
20
C=1
sigma
0.5
-0.1
10
0.4
0.3
0.01 0.02
5
(b)
0.4
-0.1
1
2
sigma
100
-0.1
(e)
0.01 0.02
0.1
0.2
0.5
1
2
sigma
5
10
20
50
100
(f)
Figure 4.5: Bias-variance decomposition of the error in bias, net variance, unbiased and
biased variance in SVM RBF, varying σ and for fixed C values: (a) Waveform, (b) GreyLandsat, (c) Letter-Two with C = 0.1, (c) Letter-Two with C = 1, (e) Letter-Two with
added noise and (f) Spam.
53
σ=0.01
σ=0.1
avg. error
bias
net variance
unbiased var.
biased var
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
-0.1
0.01
0.1
1
2
5
10
20
50
100
200
-0.1
500 1000
0.1
1
2
5
10
20
C
(a)
(b)
50
100
200
500 1000
σ=5
avg. error
bias
net variance
unbiased var.
biased var
0.5
0.4
0.4
0.3
0.2
0.2
0.1
0.1
0
0
0.01
0.1
1
2
5
10
20
50
100
200
-0.1
500 1000
0.01
0.1
1
2
5
10
20
C
C
(c)
(d)
σ=20
50
100
200
500 1000
σ=100
avg. error
bias
net variance
unbiased var.
biased var
0.5
0.4
0.4
0.3
0.2
0.2
0.1
0.1
0
0
0.1
1
2
5
10
20
50
100
200
500 1000
avg. error
bias
net variance
unbiased var.
biased var
0.5
0.3
0.01
avg. error
bias
net variance
unbiased var.
biased var
0.5
0.3
-0.1
0.01
C
σ=1
-0.1
avg. error
bias
net variance
unbiased var.
biased var
0.5
-0.1
C
0.01
0.1
1
2
5
10
20
50
100
200
500 1000
C
(e)
(f)
Figure 4.6: Letter-Two data set. Bias-variance decomposition of error in bias, net variance,
unbiased and biased variance in SVM RBF, while varying C and for some fixed values of
σ: (a) σ = 0.01, (b) σ = 0.1, (c) σ = 1, (d) σ = 5, (e) σ = 20, (f) σ = 100.
54
0.5
0
−0.5
−1
Z
1
0.5
0
−0.5
−1
−1.50
10
2
8
4
X
6
6
4
8
2
Y
Figure 4.7: The discriminant function computed by the SVM on the P2 data set with
σ = 0.01, C = 1.
SVM is the following:
f (x, α, b) =
X
yi αi exp(−kxi − xk2 /σ 2 ) + b
i∈SV
where the αi are the Lagrange multipliers found by the solution of the dual optimization
problem, the xi ∈ SV are the support vectors, that is the points for which αi > 0.
We plotted the surface computed by the gaussian SVM with the synthetic data set P2.
Indeed it is the only surface that can be easily visualized, as the data are bidimensional
and the resulting real valued function can be easily represented through a wireframe threedimensional surface. The SVMs are trained with exactly the same training set composed
by 100 examples. The outputs are referred to a test set of 10000 examples, selected in an
uniform way through all the data domain. In particular we considered a grid of equi-spaced
data at 0.1 interval in a two dimensional 10 × 10 input space. If f (x, α, b) > 0 then the
SVM matches up the example x with class 1, otherwise with class 2.
With small values of σ we have ”spiky” functions: the response is high around the support
vectors, and is close to 0 in all the other regions of the input domain (Fig. 4.7). In this
case we have overfitting: a large error on the test set (about 46 % with σ = 0.01 and 42.5
55
1
0.5
0
−0.5
−1
Z
1.5
1
0.5
0
−0.5
−1
−1.50
10
2
8
4
X
6
6
4
8
2
Y
Figure 4.8: The discriminant function computed by the SVM on the P2 data set, with
σ = 1, C = 1.
% with σ = 0.02 ), and a training error near to 0. If we enlarge the values of σ we obtain
a wider response on the input domain and the error decreases (with σ = 0.1 the error is
about 37 %). With σ = 1 we have a smooth function that fits quite well the data (Fig. 4.8).
In this case the error drops down to about 13 %.
Enlarging too much σ we have a too smooth function (Fig. 4.9 (a)), and the error increases
to about 37 %: in this case the high bias is due to an excessive smoothing of the function.
Increasing the values of the regularization parameter C (in order to better fit the data),
we can diminish the error to about 15 %: the shape of the function now is less smooth
(Fig. 4.9 (b)).
Finally using very large values of sigma (e.g. σ = 500), we have a very smooth (in practice
a plan) and a very biased function (error about 45 %), and if we increment C, we obtain
obtain better results, but always with a large error (about 35 %).
56
1
0
−1
Z
1.5
1
0.5
0
−0.5
−1
−1.5
−20
10
2
8
4
6
6
X
4
8
Y
2
(a)
10
0
−10
Z
20
15
10
5
0
−5
−10
−15
−200
10
2
8
4
X
6
6
4
8
2
Y
(b)
Figure 4.9: The discriminant function computed by the SVM on the P2 data set. (a)
σ = 20, C = 1, (b) σ = 20 C = 1000.
57
4.2.1.2
Behavior of SVMs with large σ values
Fig 4.4 and 4.5 show that the σ parameter plays a sort of smoothing effect, when the value
of σ increases. In particular with large values of σ we did not observe any increment of bias
nor decrement of variance. In order to get insights into this counter-intuitive behaviour we
tried to answer these two questions:
1. Does the bias increase while variance decrease with large values of σ, and what is the
combined effect of bias-variance on the error?
2. In this situation (large values for σ), what is the effect of the C parameter?
In Fig. 4.5 we do not observe an increment of bias with large values of σ, but we limited
our experiments to values of σ ≤ 100. Here we investigate the effect for larger values of σ
(from 100 to 1000).
In most cases, also increasing the values of σ right to 1000 we do not observe an increment
of the bias and a substantial decrement of the variance. Only for low values of C, that is
C < 1 the bias and the error increase with large values of σ (Fig. 4.10).
With the P2 data set the situation is different: in this case we observe an increment of the
bias and the error with large values of σ, even if with large value of C the increment rate is
lower (Fig. 4.11 a and b). Also with the musk data set we note an increment of the error
with very large values of σ, but surprisingly this is due to an increment of the unbiased
variance, while the bias is quite stable, at least for values of C > 1, (Fig. 4.11 c and d).
Larger values of C counter-balance the bias introduced by large values of σ. But with
some distributions of the data too large values of σ produce too smooth functions, and
also incrementing C it is very difficult to fit the data. Indeed, the real-valued function
computed by the RBF-SVM with the P2 data set (that is the function computed without
considering the sign function) is too smooth for large values of σ: for σ = 20, the error is
about 37%, due almost entirely to the large bias, (Fig. 4.9 a), and for σ = 500 the error is
about 45 % and also incrementing the C value to 1000, we obtain a surface that fits the
data better, but with an error that remains large (about 35%).
Summarizing with large σ values bias can increment, while net-variance tends to stabilize,
but this effect can be counter-balanced by larger C values.
4.2.1.3
Relationships between generalization error, training error, number of
support vectors and capacity
Looking at Fig. 4.4 and 4.5, we see that SVMs do not learn for small values of σ. Moreover
the low error region is relatively large with respect to σ and C.
58
0.5
C=1
C=0.1
avg. error
bias
net variance
unbiased var.
biased var
0.4
0.5
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
avg. error
bias
net variance
unbiased var.
biased var
0
0.01 0.02 0.1 0.2 0.5
1
2
5
10 20 50 100 200 300 400 500 1000
sigma
0.01 0.02 0.1 0.2 0.5
1
2
5
(a)
(b)
C=10
C=100
0.5
avg. error
bias
net variance
unbiased var.
biased var
0.4
0.5
0.3
0.2
0.2
0.1
0.1
0
0
1
2
5
avg. error
bias
net variance
unbiased var.
biased var
0.4
0.3
0.01 0.02 0.1 0.2 0.5
10 20 50 100 200 300 400 500 1000
sigma
10 20 50 100 200 300 400 500 1000
sigma
0.01 0.02 0.1 0.2 0.5
(c)
1
2
5
10 20 50 100 200 300 400 500 1000
sigma
(d)
Figure 4.10: Grey-Landsat data set. Bias-variance decomposition of error in bias, net
variance, unbiased and biased variance in SVM RBF, while varying σ and for some fixed
values of C: (a) C = 0.1, (b) C = 1, (c) C = 10, (d) C = 100.
59
C=1
C=1000
avg. error
bias
net variance
unbiased var.
biased var
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
0.01 0.02 0.1 0.2 0.5
1
2
5
avg. error
bias
net variance
unbiased var.
biased var
0.5
10 20 50 100 200 300 400 500 1000
sigma
0.01 0.02 0.1 0.2 0.5
1
2
5
(a)
10 20 50 100 200 300 400 500 1000
sigma
(b)
C=1
C=1000
avg. error
bias
net variance
unbiased var.
biased var
0.2
0.15
0.15
0.1
0.1
0.05
0.05
0
avg. error
bias
net variance
unbiased var.
biased var
0.2
0
0.01 0.02 0.1 0.2 0.5
1
2
5
10 20 50 100 200 300 400 500 1000
sigma
0.01 0.02 0.1 0.2 0.5
(c)
1
2
5
10 20 50 100 200 300 400 500 1000
sigma
(d)
Figure 4.11: Bias-variance decomposition of error in bias, net variance, unbiased and biased
variance in SVM RBF, while varying σ and for some fixed values of C: (a) P2, with C = 1,
(b) P2, with C = 1000, Musk, with C = 1, (d) Musk, with C = 1000.
In this section we evaluate the relationships between the estimated generalization error,
the bias, the training error, the number of support vectors and the estimated Vapnik
Chervonenkis dimension [187], in order to answer the following questions:
1. Why SVMs do not learn for small values of σ?
2. Why we have a so large bias for small values of σ?
3. Can we use the variation of the number of support vectors to predict the ”low error”
region?
4. Is there any relationship between the bias, variance and VC dimension, and can we
use this last one to individuate the ”low error” region?
60
The generalization error, bias, training error, number of support vectors and the Vapnik
Chervonenkis dimension are estimated averaging with respect to 400 SVMs (P2 data set)
or 200 SVMs (other data sets) trained with different bootstrapped training sets composed
by 100 examples each one. The test error and the bias are estimated with respect to an
independent and sufficiently large data set.
The VC dimension is estimated using the Vapnik’s bound based on the radius R of the
sphere that contains all the data (in the feature space), approximated through the sphere
centered in the origin, and on the norm of the weights in the feature space [187]. In this
way the VC dimension is overestimated but it is easy to compute and we are interested
mainly in the comparison of the VC dim. of different SVM models:
V C ≤ R2 · kwk2 + 1
where [37]
kwk2 =
X X
αi αj K(xi , xj )yi yj
i∈SV j∈SV
and
R2 = max K(xi , xi )
i
The number of support vectors is expressed as the halved ratio of the number (% SV ) of
support vectors with respect to the total number of the training data:
%SV =
#SV
#trainingdata ∗ 2
In the graphs shown in Fig. 4.12 and Fig. 4.13, on the left y axis is represented the error,
training error and bias, and the halved ratio of support vectors. On the right y axis is
reported the estimated Vapnik Chervonenkis dimension.
For very small values of σ the training error is very small (about 0), while the number of
support vectors is very high, and high is also the error and the bias (Fig.4.12 and 4.13).
These facts support the hypothesis of overfitting problems with small values of σ. Indeed
the real-valued function computed by the SVM (that is the function computed without
considering the sign function, Sect. 4.2.1.1) is very spiky with small values of σ (Fig. 4.7).
The response of the SVM is high only in small areas around the support vectors, while in all
the other areas ”not covered” by the gaussians centered to the support vectors the response
is very low (about 0), that is the SVM is not able to get a decision, with a consequently
very high bias. In the same region (small values for σ) the net variance is usually very
small, for either one of these reasons: 1) biased and unbiased variance are almost equal
because the SVM performs a sort of random guessing for the most part of the unknown
data; 2) both biased and unbiased variance are about 0, showing that all the SVMs tend to
61
Error
0.5
VC dim.
C=1
900
0.4
675
0.3
Error
0.5
VC dim.
900
C=10
gen. error
bias
train error
%SV
VC dim.
0.4
675
0.3
450
0.2
450
0.2
225
0.1
0
0.01 0.02 0.1 0.2 0.5
1
2
5
225
0.1
gen. error
bias
train error
%SV
VC dim.
0
0
10 20 50 100 200 300 400 500 1000
sigma
0
0.01 0.02 0.1 0.2 0.5
1
2
5
(a)
Error
0.5
(b)
C=100
gen. error
bias
train error
%SV
VC dim.
0.4
10 20 50 100 200 300 400 500 1000
sigma
VC dim.
900
Error
0.5
C=1000
gen. error
bias
train error
%SV
VC dim.
0.4
675
VC dim.
900
675
0.3
0.3
450
450
0.2
0.2
225
225
0.1
0.1
0
0
0.01 0.02 0.1 0.2 0.5
1
2
5
0
0
0.01 0.02 0.1 0.2 0.5
10 20 50 100 200 300 400 500 1000
sigma
(c)
1
2
5
10 20 50 100 200 300 400 500 1000
sigma
(d)
Figure 4.12: Letter-Two data set. Error, bias, training error, halved fraction of support
vectors, and estimated VC dimension while varying the σ parameter and for some fixed
values of C: (a) C = 1, (b) C = 10, (c) C = 100, and C = 1000.
62
Error
0.5
VC dim.
C=1
gen. error
bias
train error
%SV
VC dim.
0.4
400
Error
0.5
VC dim.
C=10
400
gen. error
bias
train error
%SV
VC dim.
0.4
300
300
0.3
0.3
200
200
0.2
0.2
100
100
0.1
0.1
0
0
0.01 0.02 0.1 0.2 0.5
1
2
5
0
0
0.01 0.02 0.1 0.2 0.5
10 20 50 100 200 300 400 500 1000
sigma
1
2
5
(a)
Error
0.5
(b)
VC dim.
C=100
gen. error
bias
train error
%SV
VC dim.
0.4
10 20 50 100 200 300 400 500 1000
sigma
400
Error
0.5
C=1000
gen. error
bias
train error
%SV
VC dim.
0.4
300
0.3
VC dim.
400
300
0.3
200
0.2
200
0.2
100
0.1
100
0.1
0
0
0.01 0.02 0.1 0.2 0.5
1
2
5
10 20 50 100 200 300 400 500 1000
sigma
0
0
0.01 0.02 0.1 0.2 0.5
(c)
1
2
5
10 20 50 100 200 300 400 500 1000
sigma
(d)
Figure 4.13: Grey-Landsat data set. Error, bias, training error, halved fraction of support
vectors, and estimated VC dimension while varying the σ parameter and for some fixed
values of C: (a) C = 1, (b) C = 10, (c) C = 100, and C = 1000.
63
answer in the same way independently of a particular instance of the training set (Fig. 4.5
a, b and f). Enlarging σ we obtain a wider response on the input domain: the real-valued
function computed by the SVM becomes smoother (Fig. 4.8), as the ”bumps” around the
support vectors become wider and the SVM can decide also on unknown examples. At the
same time the number of support vectors decreases (Fig. 4.12 and 4.13).
Considering the variation of the ratio of the support vectors with σ, in all data sets the
trend of the halved ratio of support vectors follows the error, with a sigmoid shape that
sometimes becomes an U shape for small values of C (Fig.4.12 and 4.13). This is not
surprising because it is known that the support vector ratio offers an approximation of the
generalization error of the SVMs [187]. Moreover, on all the data sets the halved ratio of
support vectors decreases in the ”stabilized” region, while in the transition region remains
high. As a consequence the decrement in the number of support vectors shows that we are
entering the ”low error” region, and in principle we can use this information to detect this
region.
In order to analyze the role of the VC dimension on the generalization ability of learning
machines, we know from Statistical learning Theory that the form of the bounds of the
generalization error E of SVMs is the following:
E(f (σ, C)kn )) ≤ Eemp (f (σ, C)kn )) + Φ(
hk
)
n
(4.1)
where f (σ, C)kn represents the set of functions computed by an RBF-SVM trained with n examples and with parameters (σk , Ck ) taken from a set of parameters S = {(σi , Ci ), i ∈ N},
Eemp represents the empirical error and Φ the confidence interval that depends on the
cardinality n of the data set and on the VC dimension hk of the set of functions identified
by the actual selection of the parameters (σk , Ck ). In order to obtain good generalization
capabilities we need to minimize both the empirical risk and the confidence interval. According to Vapnik’s bounds (eq. 4.1), in Fig. 4.12 and 4.13 the lowest generalization error
is obtained for a small empirical risk and a small estimated VC dimension.
But sometimes with relatively small values of V C we have a very large error, as the training
error and the number of support vectors increase with very large values of σ (Fig. 4.12 a
and 4.13 a). Moreover with a very large estimate of the VC dimension and low empirical
error (Fig. 4.12 and 4.13) we have a relatively low generalization error.
In conclusion it seems very difficult to use in practice these estimates of the VC dimension
to infer the generalization abilities of the SVM. In particular it seems unreliable to use the
VC dimension to infer the ”low error” region of the RBF-SVM.
64
4.2.2
Polynomial and dot-product kernels
In this section we analyze the characteristics of bias–variance decomposition of the error
in polynomial SVMs, varying the degree of the kernel and the regularization parameter C.
Error shows a U shape with respect to the degree. This shape depends on unbiased variance
(Fig. 4.14 a and b), or both by bias and unbiased variance (Fig. 4.14 c and d). The U
0.12
C=0.1
0.12
avg. error
bias
net variance
unbiased var.
biased var
0.1
0.08
0.06
0.06
0.04
0.04
0.02
0.02
1
2
3
4
5
6
7
polynomial degree
8
9
avg. error
bias
net variance
unbiased var.
biased var
0.1
0.08
0
C=50
10
0
1
2
3
4
(a)
0.2
C=0.1
0.2
avg. error
bias
net variance
unbiased var.
biased var
0.1
0.05
0.05
2
3
4
5
6
7
polynomial degree
9
10
8
9
C=50
avg. error
bias
net variance
unbiased var.
biased var
0.15
0.1
1
8
(b)
0.15
0
5
6
7
polynomial degree
10
0
(c)
1
2
3
4
5
6
7
polynomial degree
8
9
10
(d)
Figure 4.14: Bias-variance decomposition of error in bias, net variance, unbiased and biased
variance in polynomial SVM, while varying the degree and for some fixed values of C: (a)
Waveform, C = 0.1, (b) Waveform, C = 50, (c) Letter-Two, C = 0.1, (d) Letter-Two,
C = 50.
shape of the error with respect to the degree tends to be more flat for increasing values of
C, and net-variance and bias show often opposite trends (Fig. 4.15).
Average error and bias are higher for low C and degree values, but incrementing the degree
the error is less sensitive to C values (Fig. 4.16). Bias is flat (Fig. 4.17 a) or decreasing
65
Avg. err.
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
0.1
1
2
C
5
10
20
100
500
7
8
9
10
6
5 4
degree
3
2
(a)
Bias
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
0.1
Net var.
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
1
C
2
5
10
20
100
500
10
9
8
7
6
4
5
degree
3
2
0.1
1
C
2
5
10
20
100
500
(b)
10
9
8
7
6
5 4
degree
3
2
(c)
Figure 4.15: P2 data set. Error (a) and its decomposition in bias (b) and net variance (c),
varying both C and the polynomial degree.
with respect to the degree (Fig. 4.15 b), or it can be constant or decreasing, depending on
C (Fig. 4.17 b). Unbiased variance shows an U shape (Fig. 4.14 a and b) or it increases
(Fig. 4.14 c) with respect to the degree, and the net–variance follows the shape of the
unbiased variance. Note that in the P2 data set (Fig. 4.15) bias and net–variance follow
the classical opposite trends with respect to the degree. This is not the case with other
data sets (see, e.g. Fig. 4.14).
For large values of C bias and net–variance tend to converge, as a result of the bias
reduction and net–variance increment (Fig. 4.18), or because both stabilize at similar
values (Fig. 4.16).
In dot–product SVMs bias and net–variance show opposite trends: bias decreases, while
66
degree=2
degree=3
avg. error
bias
net variance
unbiased var.
biased var
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
-0.1
0.01
0.1
1
2
5
10
20
50
100
200
500 1000
avg. error
bias
net variance
unbiased var.
biased var
0.5
-0.1
0.01
0.1
1
2
5
10
C
(a)
degree=5
100
200
500 1000
degree=10
avg. error
bias
net variance
unbiased var.
biased var
0.4
0.4
0.3
0.2
0.2
0.1
0.1
0
0
0.1
1
2
5
10
20
50
100
200
500 1000
avg. error
bias
net variance
unbiased var.
biased var
0.5
0.3
0.01
50
(b)
0.5
-0.1
20
C
-0.1
C
0.01
0.1
1
2
5
10
20
50
100
200
500 1000
C
(c)
(d)
Figure 4.16: Letter-Two data set. Bias-variance decomposition of error in bias, net variance, unbiased and biased variance in polynomial SVM, while varying C and for some
polynomial degrees: (a) degree = 2, (b) degree = 3, (c) degree = 5, (d) degree = 10
67
Bias
Bias
0.5
0.2
0.4
0.15
0.3
0.1
0.2
0.05
0.1
0
0.01
0
0.01
1
1
5
20
C
100
1000
7
8
10 9
6
5 4
degree
3
2
1
5
20
C
100
1000
10 9
(a)
8
7
6
5 4
degree
3
2
1
(b)
Figure 4.17: Bias in polynomial SVMs with (a) Waveform and (b) Spam data sets, varying
both C and polynomial degree.
degree=6
degree=3
0.4
avg. error
bias
net variance
unbiased var.
biased var
0.35
0.3
0.4
avg. error
bias
net variance
unbiased var.
biased var
0.3
0.25
0.2
0.2
0.1
0.15
0.1
0
0.05
0
0.01
0.1
1
2
5
10
20
50
100
200
500 1000
-0.1
C
0.01
0.1
1
2
5
10
20
50
100
200
500 1000
C
(a)
(b)
Figure 4.18: Bias-variance decomposition of error in bias, net variance, unbiased and biased
variance in polynomial SVM, varying C: (a) P2 data set with degree = 6, (b) Spam data
set with degree = 3.
68
net–variance and unbiased variance tend to increase with C (Fig. 4.19). On the data set
P2 this trend is not observed, as in this task the bias is very high and the SVM does not
perform better than random guessing (Fig. 4.19a). The minimum of the average loss for
relatively low values of C is the result of the decrement of the bias and the increment of the
net–variance: it is achieved usually before the crossover of bias and net–variance curves
and before the stabilization of the bias and the net–variance for large values of C. The
biased variance remains small independently of C.
4.2.3
Comparing kernels
In this section we compare the bias–variance decomposition of the error with respect to the
C parameter, considering gaussian, polynomial and dot–product kernels. For each kernel
and for each data set the best results are selected. Tab. 4.2 shows the best results achieved
by the SVM, considering each kernel and each data set used in the experiments. Interestingly enough in 3 data sets there are not significant differences in the error (Waveform,
Letter-Two with added noise and Spam), but there are differences in bias, net–variance,
unbiased or biased variance. In the other data sets gaussian kernels outperform polynomial
and dot–product kernels, as bias, net–variance or both are lower. Considering bias and
net–variance, in some cases they are lower for polynomial or dot–product kernel, showing
that different kernels learn in different ways with different data.
Considering the data set P2 (Fig. 4.20 a, c, e), RBF-SVM achieves the best results, as
a consequence of a lower bias. Unbiased variance is comparable between polynomial and
gaussian kernel, while net–variance is lower, as biased variance is higher for polynomialSVM. In this task the bias of dot–product SVM is very high. Also in the data set Musk
(Fig. 4.20 b, d, f) RBF-SVM obtains the best results, but in this case unbiased variance
is responsible for this fact, while bias is similar. With the other data sets the bias is
similar between RBF-SVM and polynomial-SVM, but for dot–product SVM often the bias
is higher (Fig. 4.21 b, d, f).
Interestingly enough RBF-SVM seems to be more sensible to the C value with respect
to both polynomial and dot–product SVM: for C < 0.1 in some data sets the bias is
much higher (Fig. 4.21 a, c, e). With respect to C bias and unbiased variance show
sometimes opposite trends, independently of the kernel: bias decreases, while unbiased
variance increases, but this does not occur in some data sets. We outline also that the
shape of the error, bias and variance curves is similar between different kernels in all the
considered data sets: that is, well-tuned SVM having different kernels tend to show similar
trends of the bias and variance curves with respect to the C parameter.
69
avg. error
bias
net variance
unbiased var.
biased var
0.6
0.5
0.08
avg. error
bias
net variance
unbiased var.
biased var
0.07
0.06
0.05
0.4
0.04
0.3
0.03
0.2
0.02
0.1
0
0.01
0.01
0.1
1
2
5
10
20
50
100
200
500 1000
0
0.01
0.1
1
2
5
10
C
20
50
100
200
500 1000
C
(a)
(b)
0.2
avg. error
bias
net variance
unbiased var.
biased var
0.15
0.5
avg. error
bias
net variance
unbiased var.
biased var
0.4
0.3
0.1
0.2
0.05
0.1
0
0.01
0.1
1
2
5
10
20
50
100
200
500 1000
0
0.01
0.1
1
2
5
10
C
20
50
100
200
500 1000
C
(c)
(d)
0.25
avg. error
bias
net variance
unbiased var.
biased var
0.2
avg. error
bias
net variance
unbiased var.
biased var
0.2
0.15
0.15
0.1
0.1
0.05
0.05
0
0.01
0.1
1
2
5
10
20
50
100
200
500 1000
0
0.01
0.1
1
2
5
10
20
50
100
200
500 1000
C
C
(e)
(f)
Figure 4.19: Bias-variance decomposition of error in bias, net variance, unbiased and biased
variance in dot-product SVM, varying C: (a) P2, (b) Grey-Landsat, (c) Letter-Two, (d)
Letter-Two with added noise, (e) Spam, (f) Musk.
70
σ=1
0.5
avg. error
bias
net variance
unbiased var.
biased var
0.4
0.2
0.1
0.2
0.05
0.1
0
0.01
0.1
1
2
5
10
20
50
100
200
500 1000
avg. error
bias
net variance
unbiased var.
biased var
0.15
0.3
0
σ=100
-0.05
0.01
0.1
1
2
5
10
C
(a)
degree=5
100
200
500 1000
degree=2
avg. error
bias
net variance
unbiased var.
biased var
0.4
0.2
0.1
0.2
0.05
0.1
0
0.1
1
2
5
10
20
50
100
200
500 1000
avg. error
bias
net variance
unbiased var.
biased var
0.15
0.3
0.01
50
(b)
0.5
0
20
C
-0.05
0.01
0.1
1
2
5
10
C
20
50
100
200
500 1000
C
(c)
(d)
0.25
avg. error
bias
net variance
unbiased var.
biased var
0.6
0.5
avg. error
bias
net variance
unbiased var.
biased var.
0.2
0.15
0.4
0.3
0.1
0.2
0.05
0.1
0
0
0.01
0.1
1
2
5
10
20
50
100
200
1
500 1000
2
5
10
20
50
100
200
500
1000
C
C
(e)
(f)
Figure 4.20: Bias-variance decomposition of error in bias, net variance, unbiased and
biased variance with respect to C, considering different kernels. (a) P2, gaussian; (b)
Musk, gaussian (c) P2, polynomial; (d) Musk, polynomial; (e) P2, dot–product; (f) Musk,
dot–product.
71
0.2
σ=20
avg. error
bias
net variance
unbiased var.
biased var
0.15
0.3
σ=20
avg. error
bias
net variance
unbiased var.
biased var
0.25
0.2
0.1
0.15
0.1
0.05
0.05
0
0.01
0.1
1
2
5
10
20
50
100
200
500 1000
0
0.01
0.1
1
2
5
10
20
50
100
200
500 1000
C
C
(a)
(b)
degree=2
degree=3
0.2
avg. error
bias
net variance
unbiased var.
biased var
0.15
0.3
avg. error
bias
net variance
unbiased var.
biased var
0.25
0.2
0.15
0.1
0.1
0.05
0.05
0
0.01
0.1
1
2
5
10
20
50
100
200
500 1000
0
0.01
0.1
1
2
5
10
(c)
50
100
200
500 1000
(d)
avg. error
bias
net variance
unbiased var.
biased var.
0.12
0.1
0.3
avg. error
bias
net variance
unbiased var.
biased var
0.25
0.2
0.08
0.06
0.15
0.04
0.1
0.02
0.05
0
20
C
C
1
2
5
10
20
50
100
200
500
1000
0
0.01
0.1
1
2
5
10
20
50
100
200
500 1000
C
C
(e)
(f)
Figure 4.21: Bias-variance decomposition of error in bias, net variance, unbiased and biased variance with respect to C, considering different kernels. (a) Waveform, gaussian; (b)
Letter-Two, gaussian (c) Waveform, polynomial; (d) Letter-Two, polynomial; (e) Waveform, dot–product; (f) Letter-Two, dot–product.
72
Table 4.2: Compared best results with different kernels and data sets. RBF-SVM stands
for SVM with gaussian kernel; Poly-SVM for SVM with polynomial kernel and D-prod
SVM for SVM with dot-product kernel. Var unb. and Var. bias. stand for unbiased and
biased variance.
Parameters
Data set P2
RBF-SVM
C = 20, σ = 2
Poly-SVM
C = 10, degree = 5
D-prod SVM C = 500
Data set Waveform
RBF-SVM
C = 1, σ = 50
Poly-SVM
C = 1, degree = 1
D-prod SVM C = 0.1
Data set Grey-Landsat
RBF-SVM
C = 2, σ = 20
Poly-SVM
C = 0.1, degree = 5
D-prod SVM C = 0.1
Data set Letter-Two
RBF-SVM
C = 5, σ = 20
Poly-SVM
C = 2, degree = 2
D-prod SVM C = 0.1
Data set Letter-Two with added noise
RBF-SVM
C = 10, σ = 100
Poly-SVM
C = 1, degree = 2
D-prod SVM C = 0.1
Data set Spam
RBF-SVM
C = 5, σ = 100
Poly-SVM
C = 2, degree = 2
D-prod SVM C = 0.1
Data set Musk
RBF-SVM
C = 2, σ = 100
Poly-SVM
C = 2, degree = 2
D-prod SVM C = 0.01
4.3
Avg.
Error
Bias
Var.
unb.
Var.
bias.
Net
Var.
0.1516
0.2108
0.4711
0.0500
0.1309
0.4504
0.1221
0.1261
0.1317
0.0205
0.0461
0.1109
0.1016
0.0799
0.0207
0.0706
0.0760
0.0746
0.0508
0.0509
0.0512
0.0356
0.0417
0.0397
0.0157
0.0165
0.0163
0.0198
0.0251
0.0234
0.0382
0.0402
0.0450
0.0315
0.0355
0.0415
0.0137
0.0116
0.0113
0.0069
0.0069
0.0078
0.0068
0.0047
0.0035
0.0743
0.0745
0.0908
0.0359
0.0391
0.0767
0.0483
0.0465
0.0347
0.0098
0.0111
0.0205
0.0384
0.0353
0.0142
0.3362
0.3432
0.3410
0.2799
0.2799
0.3109
0.0988
0.1094
0.0828
0.0425
0.0461
0.0527
0.0563
0.0633
0.0301
0.1263
0.1292
0.1306
0.0987
0.0969
0.0965
0.0488
0.0510
0.0547
0.0213
0.0188
0.0205
0.0275
0.0323
0.0341
0.0884
0.1163
0.1229
0.0800
0.0785
0.1118
0.0217
0.0553
0.0264
0.0133
0.0175
0.0154
0.0084
0.0378
0.0110
Characterization of Bias–Variance Decomposition
of the Error
Despite the differences observed in different data sets, common trends of bias and variance
can be individuated for each of the kernels considered in this study. Each kernel presents
73
a specific characterization of bias and variance with respect to its specific parameters, as
explained in the following sections.
4.3.1
Gaussian kernels
Error, bias, net–variance, unbiased and biased variance show a common trend in the 7 data
sets we used in the experiments. Some differences, of course, arise in the different data
sets, but we can distinguish three different regions in the error analysis of RBF-SVM, with
respect to increasing values of σ (Fig. 4.22):
1. High bias region. For low values of σ, error is high: it depends on a high bias.
Net–variance is about 0 as biased and unbiased variance are equivalent. In this region
there are no remarkable fluctuations of bias and variance: both remain constant, with
high values of bias and comparable values of unbiased and biased variance, leading to
net–variance values near to 0. In some cases biased and unbiased variance are about
equal, but different from 0, in other cases they are equal, but near to 0.
2. Transition region. Suddenly, for a critical value of σ, the bias decreases rapidly.
This critical value depends also on C: for very low values of C, we have no learning,
then for higher values the bias drops. Higher values of C cause the critical value of σ
to decrease (Fig. 4.4 (b) and 4.5). In this region the increase in net–variance is less
than the decrease in bias: so the average error decreases. The boundary of this region
can be determined at the point where the error stops decrementing. This region is
characterized also by a particular trend of the net–variance. We can distinguish two
main behaviours:
(a) Wave-shaped net–variance. Net–variance first increases and then decreases,
producing a wave-shaped curve with respect to σ. The initial increment of
the net–variance is due to the simultaneous increment of the unbiased variance
and decrement of the biased variance. In the second part of the transition
region, biased variance stabilizes and unbiased variance decreases, producing a
parallel decrement of the net–variance. The rapid decrement of the error with
σ is due to the rapid decrement of the bias, after which the bias stabilizes and
the further decrement of the error with σ is determined by the net–variance
reduction (Fig. 4.4c, 4.5).
(b) Semi-wave-shaped net–variance. In other cases the net–variance curve with
σ is not so clearly wave-shaped: the descending part is very reduced (Fig. 4.5
e, f). In particular in the musk data set we have a continuous increment of the
net–variance (due to the continuous growing of the unbiased variance with σ),
and no wave-shaped curve is observed (at least for C > 10, Fig. 4.11 d).
74
In both cases the increment of the net–variance is slower than the increment in bias:
so the average error decreases.
3. Stabilized region. This region is characterized by small or no variations in bias
and net–variance. For high values of σ both bias and net–variance stabilize and the
average error is constant (Fig. 4.4, 4.5). In other data sets the error increases
with σ, because of the increment of the bias (Fig. 4.11 a,b) or the unbiased variance
(Fig. 4.11 c,d).
0.5
avg. error
bias
net variance
unbiased var.
biased var
0.4
Bias drops down
0.3
Wave-shaped net-variance
0.2
Comparable unbiased
biased variance
No bias-variance variations
0.1
0
-0.1
0.01 0.02
0.1
0.2
0.5
1
2
5
10
20
50
100
sigma
High bias region
Transition region
Stabilized region
Figure 4.22: The 3 regions of error in RBF-SVM with respect to σ.
In the first region, bias rules SVM behavior: in most cases the bias is constant and close to
0.5, showing that we have a sort of random guessing, without effective learning. It appears
that the area of influence of each support vector is too small (Fig. 4.7), and the learning
machine overfits the data. This is confirmed by the fact that in this region the training
error is about 0 and almost all the training examples are support vectors.
In the transition region, the SVM starts to learn, adapting itself to the data characteristics.
Bias rapidly goes down (at the expenses of a growing net–variance), but for higher values
of σ (in the second part of the transition region), sometimes net–variance also goes down,
working to lower the error(Fig. 4.5).
75
0.2
avg. error
bias
net variance
unbiased var.
biased var
The error depends both of
bias and unb. variance
0.15
U-shape of the error
0.1
Bias and net-var. switch
the main contribution to the error
0.05
0
1
2
3
4
5
6
7
polynomial degree
8
9
10
Figure 4.23: Behaviour of polynomial SVM with respect of the bias–variance decomposition
of the error.
Even if the third region is characterized by no variations in bias and variance, sometimes
for low values of C, the error increases with σ (Fig. 4.10 a, 4.12 a), as a result of the bias
increment; on the whole RBF-SVMs are sensitive to low values of C: if C is too low, then
bias can grow quickly. High values of C lower the bias (Fig. 4.12 c, d).
4.3.2
Polynomial and dot-product kernels
For polynomial and dot–product SVMs, we have also characterized the behavior of SVMs
in terms of average error, bias, net–variance, unbiased and biased variance, even if we
cannot distinguish between different regions clearly defined.
However, common trends of the error curves with respect to the polynomial degree, considering bias, net–variance and unbiased and biased variance can be noticed.
The average loss curve shows in general a U shape with respect to the polynomial degree,
and this shape may depend on both bias and unbiased variance or in some cases mostly on
the unbiased variance according to the characteristics of the data set. From these general
observations we can schematically distinguish two main global pictures of the behaviour of
polynomial SVM with respect to the bias–variance decomposition of the error:
76
1. Error curve shape bias–variance dependent.
In this case the shape of the error curve is dependent both on the unbiased variance
and the bias. The trend of bias and net–variance can be symmetric or they can also
have non coincident paraboloid shape, depending on C parameter values (Fig. 4.14 c,
d and 4.15). Note that bias and net variance show often opposite trends (Fig. 4.15).
2. Error curve shape unbiased variance dependent.
In this case the shape of the error curve is mainly dependent on the unbiased variance.
The bias (and the biased variance) tend to be degree independent, especially for high
values of C (Fig. 4.14 a, b) .
Fig. 4.23 schematically summarizes the main characteristics of the bias–variance decomposition of error in polynomial SVM. Note however that the error curve depends for the
most part on both variance and bias: the prevalence of the unbiased variance (Fig. 4.14 a,
b) or the bias seems to depend mostly on the distribution of the data. The increment of
0.25
avg. error
bias
net variance
unbiased var.
biased var
Minimum of the error
due to large decrement of bias
0.2
0.15
Stabilized region
Opposite trends of
bias and net-var.
0.1
0.05
0
Low biased var. independent of C
0.01
0.1
1
2
5
10
20
50
100
200
500 1000
C
Figure 4.24: Behaviour of the dot–product SVM with respect of the bias–variance decomposition of the error.
the values of C tends to flatten the U shape of the error curve: in particular for large C
values bias becomes independent with respect to the degree (Fig. 4.17). Moreover the C
parameter plays also a regularization role (Fig. 4.18)
77
Dot–product SVM are characterized by opposite trends of bias and net–variance: bias
decrements, while net–variance grows with respect to C; then, for higher values of C both
stabilize. The combined effect of these symmetric curves produces a minimum of the
error for low values of C, as the initial decrement of bias with C is larger than the initial
increment of net–variance. Then the error slightly increases and stabilizes with C (Fig.
4.19). The shape of the net–variance curve is determined mainly by the unbiased variance:
it increases and then stabilizes with respect to C. On the other hand the biased variance
curve is flat, remaining small for all values of C. A schematic picture of this behaviour is
given in Fig. 4.24.
78
Chapter 5
Bias–variance analysis in random
aggregated and bagged ensembles of
SVMs
Methods based on resampling techniques, and in particular on bootstrap aggregating (bagging) of sets of base learners trained on repeated bootstrap samples drawn from a given
learning set, have been introduced in the nineties by Breiman [15, 16, 17].
The effectiveness of this approach, that have been shown to improve the accuracy of a
single predictor [15, 63, 128, 44], is to be found in its property of reducing the variance
component of the error.
Bagging can be seen as an approximation of random aggregating, that is a process by
which base learners, trained on samples drawn accordingly to an unknown probability
distribution from the entire universe population, are aggregated through majority voting
(classification) or averaging between them (regression).
Breiman showed that in regression problem, aggregation of predictors always improve the
performance of single predictors, while in classification problems this is not always the
case, if poor base predictors are used [15].
The improvement depends on the stability of the base learner: random aggregating and
bagging are effective with unstable learning algorithms, that is when small changes in the
training set can result in large changes in the predictions of the base learners.
Random aggregating always reduce variance in regression and in classification with reasonably good base classifiers, while bias remains substantially unchanged. With bagging we
can have a variance reduction but bias can also slightly increases, as the average sample
size used by each base learner is only about 2/3 of the training set from which the samples
79
are bootstrapped.
In general bagging unstable base learners is a good idea. As bagging is substantially a
variance reduction method, we could also select low bias base learner in order to reduce
both the bias and variance components of the error.
Random aggregating is only a theoretical ensemble method, as we need the entire universe
of data from which random samples are drawn according to an usually unknown probability
distribution. But with very large data sets, using a uniform probability distribution and
undersampling techniques, we can simulate random aggregating (with the assumptions
that the very large available data set and the uniform probability distribution are good
approximations respectively of the ”universe” population and the unknown probability
distribution).
In the next sections we discuss some theoretical issues about the relationships between
random aggregating and bagging. Then we verify the theoretical results of the expected
variance reduction in bagging and random aggregating performing an extended experimental bias–variance decomposition of the error in bagged and random aggregated ensembles
of SVMs. Finally, we consider an approximation of random aggregated ensembles for very
large scale data mining problems.
5.1
Random aggregating and bagging
Let D be a set of m points drawn identically and independently from U according to P ,
where U is a population of labeled training data points (xj , tj ), and P (x, t) is the joint
distribution of the data points in U , with x ∈ Rd .
Let L be a learning algorithm, and define fD = L(D) as the predictor produced by L
applied to a training set D. The model produces a prediction fD (x) = y. Suppose that a
sequence of learning sets {Dk } is given, each i.i.d. from the same underlying distribution P .
According to [15] we can aggregate the fD trained with different samples drawn from U to
get a better predictor fA (x, P ). For regression problems tj ∈ R and fA (x, P ) = ED [fD (x)],
where ED [·] indicates the expected value with respect to the distribution of D, while in
classification problems tj ∈ S ⊂ N and fA (x, P ) = arg maxj |{k|fDk (x) = j}|.
As the training sets D are randomly drawn from U , we name the procedure to build fA
random aggregating. In order to simplify the notation, we denote fA (x, P ) as fA (x).
80
5.1.1
Random aggregating in regression
If T and X are random variables having joint distribution P the expected squared loss EL
for the single predictor fD (X) is:
EL = ED [ET,X [(T − fD (X))2 ]]
(5.1)
while the expected squared loss ELA for the aggregated predictor is:
ELA = ET,X [(T − fA (X))2 ]
(5.2)
Developing the square in eq. 5.1 we have:
EL = ED [ET,X [T 2 + fD2 (X) − 2T fD (X)]]
= ET [T 2 ] + ED [EX [fD2 (X)]] − 2ET [T ]ED [EX [fD (X)]]
= ET [T 2 ] + EX [ED [fD2 (X)]] − 2ET [T ]EX [fA (X)]
(5.3)
In a similar way, developing the square in eq. 5.2 we have:
ELA = ET,X [T 2 + fA2 (X) − 2T fA (X)]
= ET [T 2 ] + EX [fA2 (X)] − 2ET [T ]EX [fA (X)]
= ET [T 2 ] + EX [ED [fD (X)]2 ] − 2ET [T ]EX [fA (X)]
(5.4)
Let be Z = ED [fD (X)]. Using E[Z 2 ] ≥ E[Z]2 , considering eq. 5.3 and 5.4 we have that
ED [fD2 (X)] ≥ ED [fD (X)]2 and hence EL ≥ ELA .
The reduction of the error in random aggregated ensembles depends on how much differ the
two terms EX [ED [fD2 (X)]] and EX [ED [fD (X)]2 ] of eq. 5.3 and 5.4. As outlined by Breiman,
the effect of instability is clear: if fD (X) does not change too much with replicate data sets
D, the two terms will be nearly equal and aggregation will not help. The more variable
the fD (X) are, the more improvement aggregation may produce.
In other words the reduction of the error depends on the instability of the prediction, that
is on how unequal the two sides of eq. 5.5 are:
ED [fD (X)]2 ≤ ED [fD2 (X)]
(5.5)
There is a strict relationship between the instability and the variance of the base predictor.
Indeed the variance V (X) of the base predictor is:
V (X) = ED [(fD (X) − ED [fD (X)])2 ]
= ED [fD2 (X) + ED [fD (X)]2 − 2fD (X)ED [fD (X)]]
= ED [fD2 (X)] − ED [fD (X)]2
81
(5.6)
Comparing eq.5.5 and 5.6 we see that higher the instability of the base classifiers, higher
their variance is. The reduction of the error in random aggregation is due to the reduction
of the variance component (eq. 5.6) of the error, as V (X) will be small if and only if
ED [fD2 (X)] > ED [fD (X)]2 , that is if and only if the base predictor is unstable (eq. 5.5).
5.1.2
Random aggregating in classification
With random aggregation of base classifiers, the same behaviour regarding stability holds,
but in this case a more complex situation arises.
Indeed let be fD (X) a base classifier that predicts a class label t ∈ C, C = {1, . . . , C}, and
let be X a random variable as in previous regression case and T a random variable with
values in C.
Then the probability p(D) of correct classification for a fixed data set D, considering a non
deterministic assignment for the labels of the class, is:
p(D) = P (fD (X) = T ) =
C
X
P (fD (X) = j|T = j)P (T = j)
(5.7)
j=1
In order to make independent the probability p of correct classification from the choice of
a specific learning set we average over D:
p =
C
X
ED [P (fD (X) = j|T = j)]P (T = j)
j=1
=
C Z
X
P (fD (X) = j|X = x, T = j)P (T = j|X = x)PX (dx)
(5.8)
j=1
Recalling that fA (X) = arg maxi PD (fD (x) = i), the probability pA of correct classification
for random aggregation is:
pA =
C
X
P (fA (X) = j|T = j)P (T = j)
j=1
=
C Z
X
P (fA (X) = j|T = j)P (T = j|X = x)PX (dx)
C Z
X
P (arg max[PD (fD (X) = i)] = j|T = j)P (T = j|X = x)PX (dx)
j=1
=
j=1
i
82
C Z
X
=
I(arg max[PD (fD (X) = i] = j)P (T = j|X = x)PX (dx)
i
j=1
(5.9)
where I is the indicator function.
The optimal prediction for a pattern x is the Bayesian prediction:
B ∗ (x) = arg max P (T = j|X = x)
j
(5.10)
We split now the patterns in a set O corresponding to the optimal predictions performed
by the aggregated classifier and in a set O′ corresponding to non-optimal predictions. The
set O of the optimally classified patterns is:
O = {x| arg max P (T = j|X = x) = arg max PD (fD (x) = j)}
j
j
According to the proposed partition of the data we can split the probability pA of correct
classification for random aggregation in two terms:
Z
Z
C
X
pA =
max P (T = j|X = x)PX (dx) +
I(fA (x) = j)P (T = j|X = x)PX (dx)
x∈O
j
x∈O′ j=1
(5.11)
If x ∈ O we have:
arg max P (T = j|X = x) = arg max PD (fD (x) = j)
j
j
(5.12)
In this case, considering eq. 5.8 and 5.9:
C
X
P (fD (X) = j|T = j)P (T = j|X = x) ≤ arg max PD (fD (x) = j)
j
j=1
and hence pA (X) ≥ p(X). On the contrary, if x ∈ O′ eq. 5.12 does not hold, and it may
occur that:
C
X
j=1
I(fA (x) = j)P (T = j|X = x) <
C
X
P (fD (X) = j|T = j)P (T = j|X = x)
j=1
As a consequence if the set O of the optimally predicted patterns is large, that is, if we
have relatively good predictors, aggregation improves performances. On the contrary, if
the set O′ is large, that is if we have poor predictors, aggregation can worsen performances.
Summarizing, unlike regression, aggregating poor predictors can lower performances, whereas,
as in regression, aggregating relatively good predictors can lead to better performances, as
long as the base predictor is unstable [15].
83
5.1.3
Bagging
In most cases we dispose only of data sets of limited size, and moreover we do not know
the probability distribution underlying the data. In these case we could try to simulate
random aggregation by bootstrap replicates of the data [56] and successively aggregating
the predictors trained on the bootstrapped data.
The bootstrap aggregating method (bagging) [15] can be applied both to regression and
classification problems: the only difference is in the aggregation phase.
Consider, for instance, a classification problem. Let C be the set of class labels. Let
{Dj }nj=1 be the set of n bootstrapped samples drawn with replacement from the learning
set D according to an uniform probability distribution. Let fDj = L(Dj ) be the decision
function of the classifier trained by a learning algorithm L using the bootstrapped sample
Dj .
Then the classical decision function fB (x) applied for aggregating the base learners in
bagging is [15]:
n
X
fB (x) = arg max
(5.13)
I(fDj (x) = c)
c∈C
j=1
where I(z) = 1 if the boolean expression z is true, otherwise I(z) = 0. In words, the
bagged ensemble selects the most voted class.
In regression the aggregation is performed averaging between the real values computed by
the real function valued base learners gDj : Rd → R:
n
1X
fB (x) =
gD (x)
n j=1 j
(5.14)
Fig. 5.1 show the pseudo-code for bagging.
The learning algorithm L generates an hypothesis ht : X → Y using a sample Dt bootstrapped from D, and hf in is the final hypothesis computed by the bagged ensemble,
aggregating the base learners through majority voting (Fig. 5.1).
Bagging shows the same limits of random aggregating: only if the base learners are unstable
we can achieve reduction of the error with respect to the single base learners. Of course if
the base learner is near to the Bayes error we cannot expect improvements by bagging.
Moreover bagging is an approximation of random aggregating, for at least two reasons.
84
Algorithm Bagging
Input arguments:
- Data set D = {zi = (xi , yi )}ni=1 , xi ∈ X ⊂ Rd , yi ∈ Y = {1, . . . , k}
- A learning algorithm L
- Number of iterations (base learners) T
Output:
- Final hypothesis hf in : X → Y computed by the ensemble.
begin
for t = 1 to T
begin
Dt = Bootstrap replicate(D)
ht = L(Dt )
end
P
hf in (x) = arg maxy∈Y Tt=1 ||ht (x) = y||
end.
Figure 5.1: Bagging for classification problems.
First, bootstrap samples are not real data samples: they are drawn from a data set D that
is in turn a sample from the population U . On the contrary fA uses samples drawn directly
from U .
Second, bootstrap samples are drawn from D through an uniform probability distribution
that is only an approximation of the unknown true distribution P .
For these reasons we can only hope that this is a good enough approximation to fA that
considerable variance reduction (eq. 5.2) will result [17].
Moreover with bagging each base learner, on the average, uses only 63.2% of the available
data for training and so we can expect for each base learner a larger bias, as the effective
size of the learning set is reduced. This can also affect the bias of the bagged ensemble that
critically depends on the bias of the component base learners: we could expect sometimes
a slight increment of the bias of the bagged ensemble with respect to the unaggregated
predictor trained on the entire available training set.
Bagging is a variance reduction method, but we cannot expect so large decrements of
variance as in random aggregating. The intuitive reason consists in the fact that in random
aggregating the base learners use more variable training sets drawn from U according to
the distribution P . In this way random aggregating exploits more information from the
population U , while bagging can exploit only the information from a single data set D
drawn from U , through bootstrap replicates of the data from D.
85
5.2
Bias–variance analysis in bagged SVM ensembles
In this section we deal with the problem of understanding the effect of bagging on bias
and variance components of the error in SVMs. Our aim consists in getting insights into
the way bagged ensembles learn, in order to characterize learning in terms of bias–variance
components of the error.
In particular we try to verify the theoretical property of the expected variance reduction
in bagging, through an extended experimental bias–variance decomposition of the error in
bagged SVM ensembles.
The plan of the experiments, whose results are summarized in the next sections, is the
following. We performed experiments with gaussian, polynomial and dot-product kernels.
At first, for each kernel, we evaluated the expected error and its decomposition in bias, netvariance, unbiased and biased variance with respect to the learning parameters of the base
learners. Then we analyzed the bias–variance decomposition as a function of the number
of the base learners employed. Finally we compared bias and variance with respect to the
learning parameters in bagged SVM ensembles and in the corresponding single SVMs, in
order to study the effect of bootstrap aggregation on the bias and variance components of
the error.
The next sections reported only some examples and a summary of the results of bias–
variance analysis in bagged SVM ensembles. Full data, results and graphics of the experimentation on bias–variance analysis in bagged ensembles of SVMs are reported in [180].
5.2.1
Experimental setup
To estimate the decomposition of the error in bias, net-variance, unbiased and biased
variance with bagged ensembles of SVMs, we performed a bias-variance decomposition of
the error on the data sets described in Chap. 4. At first we split the data in a separated
learning set D and testing set T . Then we drew with replacement from D n samples Si
of size s, according to a uniform probability distribution. From each Di , 1 ≤ i ≤ n we
generated by bootstrap m replicates Dij , collecting them in n different sets D̄i = {Dij }m
j=1 .
We used the n sets D̄i to train n bagged ensembles, each composed by m SVMs, each
one trained with different bootstrapped data, repeating this process for all the considered
SVM models. In order to properly compare the effect of different choices of the learning
parameters on bias–variance decomposition of the error, each SVM model is represented
by a different choice of the kernel type and parameters and is trained with the same sets
D̄i , 1 ≤ i ≤ n of bootstrapped samples.
For each SVM model, bias–variance decomposition of the error is evaluated on a separated
86
test set T , significantly larger than the training sets, using the bagged ensembles trained
on the n sets D̄i .
The experimental procedure we adopted to generate the data and to manage bias–variance
analysis are summarized in Fig. 5.2 and 5.3. For more detailed information on how to
compute bias–variance decomposition of the error see Chap. 3.2.
Procedure Generate samples
Input arguments:
- Data set S
- Number of samples n
- Size of samples s
- Number of bootstrap replicate m
Output:
- Sets D̄i = {Dij }m
j=1 , 1 ≤ i ≤ n of bootstrapped samples
begin procedure
[D, T ] = Split(S)
for i = 1 to n
begin
Di = Draw with replacement(D, s)
D̄i = ∅
for j = 1 to m
begin
Dij = Bootstrap replicate(Di )
D̄i = D̄i + Dij
end
end
end procedure.
Figure 5.2: Procedure to generate samples to be used for bias–variance analysis in bagging
The procedure Generate samples (Fig. 5.2) generates sets D̄i of bootstrapped samples,
drawing at first a sample Di from the training set D according to an uniform probability
distribution (procedure Draw with replacement) and then drawing from Di m bootstrap
replicates (procedure Bootstrap replicate). Note that D̄i is a set of sets, and the plus
sign in Fig. 5.2 indicates that the entire set Dij is added as a new element of the set of
sets D̄i .
87
Procedure Bias–Variance analysis
Input arguments:
- Test set T
- Number of bagged ensembles n
- Number of bootstrap replicate m
- Set of learning parameters A
Output:
- Error, bias, net-variance, unbiased and biased variance BV = {bv(α)}α∈A
of the bagged ensemble having base learners with learning parameters α ∈ A.
begin procedure
For each α ∈ A
begin
Ensemble Set(α) = ∅
for i = 1 to n
begin
bag(α, D̄i ) = Ensemble train (α, D̄i )
Ensemble Set(α) = Ensemble Set(α) ∪ bag(α, D̄i )
end
bv(α) = Perform BV analysis(Ensemble Set (α), T )
BV = BV ∪ bv(α)
end
end procedure.
Figure 5.3: Procedure to perform bias–variance analysis on bagged SVM ensembles
The procedure Bias-Variance analysis (Fig. 5.3) trains different ensembles of bagged
SVMs (procedure Ensemble train) using the same sets of bootstrap samples generated
through the procedure Generate samples. Then bias–variance decomposition of the error is performed on the separated test set T using the previously trained bagged SVM
ensembles (procedure Perform BV analysis).
In our experiments we employed gaussian, polynomial and dot-product kernels evaluating
110 different SVM models, considering different combinations of the type of the kernel and
learning parameters for each data set. For each model we set s = 100, n = 100, m = 60,
training for each data set 110×100 = 11000 bagged ensembles and a total of 110×100×60 =
660000 different SVMs. Considering all the data sets we trained and tested about 80000
different bagged SVM ensembles and a total of about 5 millions of single SVMs.
88
5.2.2
Bagged RBF-SVM ensembles
In this section are reported the results of bias–variance analysis in bagged SVMs, using
base learners with gaussian kernels.
5.2.2.1
Bias–variance decomposition of the error
The decomposition of the error is represented with respect to different values of σ and
for fixed values of C. The error follows and ”U” (Fig. 5.4 a and b) or a ”sigmoid” trend
(Fig. 5.4 c and d). This trend is visible also in other data sets (data not shown). In the P2
data set we can observe opposite trends of bias and net-variance varying the σ parameter,
while in Letter-Two for large values of σ both bias and net-variance remain constant. The
net–variance, for small σ values is about 0, as unbiased and biased variance are about
equal, then rapidly increases, as at the same time unbiased variance increases and biased
variance decreases. Anyway, when net–variance increases the error goes down as the bias
decreases more quickly. Then, for slightly larger values of σ the error diminishes, mainly
for the reduction of the net-variance. For large values of σ (especially if C also is relatively
large) both bias and net-variance stabilizes at low level and the error tends to be low, but
in another data sets (e.g. P2) the bias increases inducing a larger error rate.
5.2.2.2
Decomposition with respect to the number of base learners
Considering the bias–variance decomposition of the error with respect to the number of
base learners, we can observe that the error reduction arises in the first 10 iterations,
especially for the reduction of the unbiased variance. The bias and the biased variance
remain substantially unchanged for all the iterations (Fig. 5.5)
5.2.2.3
Comparison of bias–variance decomposition in single and bagged RBFSVMs
Here are reported the graphics comparing bias–variance decomposition in single SVMs and
bagged ensembles of SVMs. In all graphics of this section the data referred to single SVMs
are labeled with crosses, while bagged SVMs are labeled with triangles. The corresponding
quantities (e.g. bias, net-variance, etc.) are represented with the same type of line both in
single and bagged SVMs.
We analyze the relationships between bias-variance decomposition of the error in single
and bagged RBF-SVMs for each different region that characterizes the bias-variance decomposition itself. In bagged SVM ensembles are also visible the three different regions
89
C=1
C=100
0.5
0.5
avg. error
bias
net variance
unbiased var.
biased var
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
0.01 0.02
0.1
0.2
0.5
1
2
sigma
5
10
20
50
avg. error
bias
net variance
unbiased var.
biased var
100
0.01 0.02
0.1
0.2
0.5
(a)
1
2
sigma
5
10
20
50
100
(b)
C=1
C=100
0.5
0.5
avg. error
bias
net variance
unbiased var.
biased var
0.4
avg. error
bias
net variance
unbiased var.
biased var
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
0.01 0.02
0.1
0.2
0.5
1
2
sigma
5
10
20
50
100
0.01 0.02
(c)
0.1
0.2
0.5
1
2
sigma
5
10
20
50
100
(d)
Figure 5.4: Bias-variance decomposition of error in bias, net variance, unbiased and biased
variance in bagged RBF-SVMs, while varying σ and for some fixed values of C. P2 data
set: (a) C = 1, (b) C = 100. Letter-Two data set: (c) C = 1, (d) C = 100
that characterize bias–variance decomposition in single SVMs (Sect. 4.3.1).
High bias region. In this region the error of single and bagged SVMs is about equal,
and it is characterized by a very high bias. The net-variance is close to 0, because biased
variance is about equal to the unbiased variance. In some cases they are both close to 0.
In other cases they are equal but greater than 0 with slightly larger values in single that
in in bagged SVMs (Fig. 5.6).
Transition region. In this region the bias goes down very quickly both in single and
bagged SVMs. The net-variance maintains the wave-shape also in bagged SVMs, but it is
slightly lower. The error drops down at about the same rate in single and bagged SVMs
(Fig. 5.6 a and b).
90
Sigma=2, C=100
Sigma=50, C=100
avg. error
bias
net variance
unbiased var.
biased var.
0.14
0.12
avg. error
bias
net variance
unbiased var.
biased var.
0.2
0.15
0.1
0.08
0.1
0.06
0.05
0.04
0.02
0
0
0
10
20
30
iterations
40
50
60
0
10
(a)
20
30
iterations
40
50
60
(b)
Figure 5.5: Bias-variance decomposition of error in bias, net variance, unbiased and biased
variance in bagged SVM RBF, with respect to the number of iterations. (a) Grey-Landsat
data set (b) Spam data set.
Stabilized region. For relatively large values of σ the net-variance tends to stabilize
(Fig. 5.6). In this region the net-variance of the bagged SVMs is equal or less than the
net-variance of the single SVMs, while bias remains substantially unchanged in both. With
some data sets (Fig. 5.4 a and b) the bias tends to increase with σ, especially with low
values of C. As a result, bagged SVMs show equal or lower average error with respect to
single SVMs (Fig. 5.6)
5.2.3
Bagged polynomial SVM ensembles
In this section are reported the results of the experiments to evaluate the decomposition
of the error in bagged ensembles of polynomial SVMs.
5.2.3.1
Bias–variance decomposition of the error
The decomposition of the error is represented with respect to different values of the polynomial degree and for fixed values of C. Also in bagged polynomial ensembles the error
shows an ”U” shape w.r.t. to the degree (Fig. 5.7), such as in single polynomial SVM (see
Sect. 4.2.2). This shape depends both on bias and net-variance. The classical trade-off
between bias and variance is sometimes noticeable (Fig. 5.7 b), but in other cases both
bias and net-variance increase with the degree (Fig. 5.7 c and d ). As a general rule for low
91
0.5
C=1
0.5
avg. error
bias
net variance
unbiased var.
biased var.
0.4
0.3
0.2
0.2
0.1
0.1
0
0
0.1
0.2
0.5
1
2
sigma
5
10
20
50
avg. error
bias
net variance
unbiased var.
biased var.
0.4
0.3
0.01 0.02
C=100
100
0.01 0.02
0.1
0.2
0.5
(a)
1
2
sigma
5
10
20
50
100
(b)
C=1
C=100
avg. error
bias
net variance
unbiased var.
biased var.
0.5
0.4
avg. error
bias
net variance
unbiased var.
biased var.
0.5
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
0.01 0.02
0.1
0.2
0.5
1
2
sigma
5
10
20
50
100
0.01 0.02
(c)
0.1
0.2
0.5
1
2
sigma
5
10
20
50
100
(d)
Figure 5.6: Comparison between bias-variance decomposition between single RBF-SVMs
(lines labeled with crosses) and bagged SVM RBF ensembles (lines labeled with triangles),
while varying σ and for some fixed values of C. Letter-Two data set: (a) C = 1, (b)
C = 100. Waveform data set: (c) C = 1, (d) C = 100.
degree polynomial kernel the bias is relatively large and the net variance is low, while the
opposite occurs with high degree polynomials (Fig. 5.7 a). The regularization parameter
C plays also an important role: large C values tends to decrease the bias also for relatively
low degree (Fig. 5.7 d). Of course these results depend also on the specific characteristics
of the data sets.
92
C=0.1
C=100
0.5
0.5
avg. error
bias
net variance
unbiased var.
biased var
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
1
2
3
4
5
6
7
8
9
avg. error
bias
net variance
unbiased var.
biased var
10
1
2
3
4
degree
5
6
7
8
9
10
degree
(a)
(b)
C=0.1
C=100
0.2
0.2
avg. error
bias
net variance
unbiased var.
biased var
0.15
avg. error
bias
net variance
unbiased var.
biased var
0.15
0.1
0.1
0.05
0.05
0
0
1
2
3
4
5
6
7
8
9
10
1
degree
2
3
4
5
6
7
8
9
10
degree
(c)
(d)
Figure 5.7: Bias-variance decomposition of error in bias, net variance, unbiased and biased
variance in bagged polynomial SVM, while varying the degree and for some fixed values of
C. P2 data set: (a) C = 0.1, (b) C = 100. Letter-Two data set: (c) C = 0.1, (d) C = 100
5.2.3.2
Decomposition with respect to the number of base learners
This section reports data and graphs about the decomposition of bias–variance in bagged
SVMs with respect to the number of iterations of bagging, that is the number of base
learners used. The error decreases in first 10 iterations, for the reduction of the unbiased
variance, while bias and net-variance remain substantially unchanged (Fig. 5.8). This
behavior is similar to that shown by bagged ensembles of gaussian SVMs (Fig. 5.5).
93
Degree=6, C=100
Degree=6, C=1
0.25
avg. error
bias
net variance
unbiased var.
biased var.
0.2
avg. error
bias
net variance
unbiased var.
biased var.
0.14
0.12
0.1
0.15
0.08
0.06
0.1
0.04
0.05
0.02
0
0
0
10
20
30
iterations
40
50
60
0
10
(a)
20
30
iterations
40
50
60
(b)
Figure 5.8: Bias-variance decomposition of error in bias, net variance, unbiased and biased
variance in bagged polynomial SVMs, with respect to the number of iterations. (a) P2
data set (b)Letter-Two data set.
5.2.3.3
Comparison of bias–variance decomposition in single and bagged polynomial SVMs
In bagged SVMs, the trend of the error with respect to the degree shows an ”U” shape
similar to that of single polynomial SVMs(Fig. 5.9). It depends both on bias and unbiased
variance. Bias and biased variance are unchanged with respect to single SVMs, while netvariance is slightly reduced (for the reduction of the unbiased variance). As a result we
have a slight reduction of the overall error.
5.2.4
Bagged dot-product SVM ensembles
In this section are reported the results of bias–variance analysis in bagged SVMs, using
base learners with dot-product kernels.
5.2.4.1
Bias–variance decomposition of the error
The decomposition of the error is represented with respect to different values of C. The
error seems to be relatively independent of C (Fig. 5.10), and no changes are observed
both for bias and variance components of the error. In some data sets the bias slightly
decreases with C while unbiased variance slightly increases.
94
0.4
C=1
0.4
avg. error
bias
net variance
unbiased var.
biased var.
0.35
0.3
C=100
avg. error
bias
net variance
unbiased var.
biased var.
0.35
0.3
0.25
0.25
0.2
0.2
0.15
0.15
0.1
0.1
0.05
0.05
0
0
-0.05
-0.05
1
2
3
4
5
6
degree
7
8
9
10
1
2
3
4
(a)
0.08
0.08
avg. error
bias
net variance
unbiased var.
biased var.
0.07
0.06
0.05
0.04
0.03
0.03
0.02
0.02
0.01
0.01
0
0
-0.01
-0.01
4
5
6
degree
9
10
7
8
9
avg. error
bias
net variance
unbiased var.
biased var.
0.06
0.04
3
8
C=100
0.07
0.05
2
7
(b)
C=1
1
5
6
degree
10
(c)
1
2
3
4
5
6
degree
7
8
9
10
(d)
Figure 5.9: Comparison between bias-variance decomposition between single polynomial
SVMs (lines labeled with crosses) and bagged polynomial SVM ensembles (lines labeled
with triangles), while varying the degree and for some fixed values of C. P2 data set: (a)
C = 1, (b) C = 100. Grey-Landsat data set: (c) C = 1, (d) C = 100.
5.2.4.2
Decomposition with respect to the number of base learners
Considering the bias–variance decomposition of the error with respect to the number of
base learners, we can observe that the error reduction arises in the first 10-20 iterations,
especially for the reduction of the unbiased variance. The bias and the biased variance
remain substantially unchanged for all the iterations (Fig. 5.11)
95
0.1
0.08
avg. error
bias
net variance
unbiased var.
biased var
0.08
avg. error
bias
net variance
unbiased var.
biased var
0.07
0.06
0.05
0.06
0.04
0.04
0.03
0.02
0.02
0.01
0
1
2
5
10
20
50
100
200
500
0
1000
50
0.2
0.1
0.1
0.05
20
50
100
200
500
100
0
1000
C
200
500
1000
avg. error
bias
net variance
unbiased var.
biased var
0.2
0.15
10
20
0.25
0.3
5
10
(b)
0.4
2
5
(a)
avg. error
bias
net variance
unbiased var.
biased var
1
2
C
0.5
0
1
C
1
2
5
10
20
50
100
200
500
1000
C
(c)
(d)
Figure 5.10: Bias-variance decomposition of error in bias, net variance, unbiased and biased
variance in bagged dot-product SVM, while varying C. (a) Waveform data set (b) GreyLandsat (c) Letter-Two with noise (d) Spam
5.2.4.3
Comparison of bias–variance decomposition in single and bagged dotproduct SVMs
Fig. 5.12 shows the comparison between bias-variance decomposition between single dotproduct SVMs. The reduction of the error in bagged ensembles is due to the reduction on
the unbiased variance, while bias is unchanged or slightly increases in bagged dot-product
SVMs. The biased variance also remains substantially unchanged. The shape of the error
curve is quite independent of the C values, at least for C ≥ 1.
96
C=100
C=100
avg. error
bias
net variance
unbiased var.
biased var.
0.08
0.07
avg. error
bias
net variance
unbiased var.
biased var.
0.14
0.12
0.06
0.1
0.05
0.08
0.04
0.06
0.03
0.04
0.02
0.02
0.01
0
0
10
20
30
iterations
40
50
0
60
(a)
0
10
20
30
iterations
40
50
60
(b)
Figure 5.11: Bias-variance decomposition of error in bias, net variance, unbiased and biased
variance in bagged dot-product SVMs, with respect to the number of iterations. (a) GreyLandsat data set (b) Letter-Two data set.
5.2.5
Bias–variance characteristics of bagged SVM ensembles
In Tab. 5.1 are summarized the compared results of bias–variance decomposition between
single SVMs and bagged SVM ensembles. ESV M stands for the estimated error of single
SVMs, Ebag for the estimated error of bagged ensembles of SVMs, % Error reduction
stands for the percent error reduction of the error between single and bagged ensembles,
that is:
ESV M − Ebag
%Error reduction =
ESV M
% Bias reduction, % N etV ar reduction and % U nbV ar reduction corresponds respectively to the percent bias, net–variance and unbiased variance reduction between single and
bagged ensemble of SVMs. The negative signs means that we have a larger error in the
bagged ensemble. Note that sometimes the decrement of the net–variance can be larger
than 100 %: the net–variance can be negative, when the biased variance is larger than the
unbiased variance.
As expected, bagging does not reduce the bias (on the contrary, sometimes bias slightly
increases). The net-variance is not eliminated but only partially reduced, and its decrement
ranges from 0 to about 40 % with respect to single SVMs. Its reduction is due to the
unbiased variance reduction, while biased variance is unchanged. As a result the error
decreases, but its decrement it is not so noticeable, as it ranges from 0 to about 15 % with
respect to single SVMs, depending on the kernel and the data set. The overall shape of
97
0.08
avg. error
bias
net variance
unbiased var.
biased var.
0.12
0.1
avg. error
bias
net variance
unbiased var.
biased var.
0.07
0.06
0.08
0.05
0.06
0.04
0.03
0.04
0.02
0.02
0
0.01
1
2
5
10
20
50
100
200
500
0
1000
1
2
5
10
20
50
C
C
(a)
(b)
0.25
0.25
avg. error
bias
net variance
unbiased var.
biased var.
0.2
100
0.15
0.1
0.1
0.05
0.05
0
500
1000
avg. error
bias
net variance
unbiased var.
biased var.
0.2
0.15
200
0
1
2
5
10
20
50
100
200
500
1000
1
2
5
10
20
50
C
C
(c)
(d)
100
200
500
1000
Figure 5.12: Comparison between bias-variance decomposition between single dot-product
SVMs (lines labeled with crosses) and bagged dot-product SVM ensembles (lines labeled
with triangles), while varying the values of C. (a) Waveform (b) Grey-Landsat (c) Spam
(d) Musk.
the curves of the error, bias and variance are very close to that of single SVMs.
5.3
Bias–variance analysis in random aggregated ensembles of SVMs
This section investigates the effect of random aggregation of SVMs on bias and variance
components of the error.
98
Table 5.1: Comparison of the results between single and bagged SVMs.
ESVM
Ebag
Data set P2
RBF-SVM
0.1517
0.1500
Poly-SVM
0.2088
0.1985
D-prod SVM 0.4715
0.4590
Data set Waveform
RBF-SVM
0.0707
0.0662
Poly-SVM
0.0761
0.0699
D-prod SVM 0.0886
0.0750
Data set Grey-Landsat
RBF-SVM
0.0384
0.0378
Poly-SVM
0.0392
0.0388
D-prod SVM 0.0450
0.0439
Data set Letter-Two
RBF-SVM
0.0745
0.0736
Poly-SVM
0.0745
0.0733
D-prod SVM 0.0955
0.0878
Data set Letter-Two with added noise
RBF-SVM
0.3362
0.3345
Poly-SVM
0.3432
0.3429
D-prod SVM 0.3486
0.3444
Data set Spam
RBF-SVM
0.1292
0.1290
Poly-SVM
0.1323
0.1318
D-prod SVM 0.1495
0.1389
Data set Musk
RBF-SVM
0.0898
0.0920
Poly-SVM
0.1225
0.1128
D-prod SVM 0.1501
0.1261
% Error
reduction
% Bias
reduction
% NetVar
reduction
% UnbVar
reduction
1.14
4.95
2.65
-2.64
4.85
1.11
3.18
5.08
34.09
2.19
5.91
15.28
6.30
8.11
15.37
-1.41
0.36
-0.22
26.03
23.78
37.00
17.82
17.94
28.20
1.74
1.05
2.58
2.94
-4.76
16.87
-7.46
24.80
-165.72
3.94
12.06
-62.21
1.20
1.55
8.09
-25.00
-15.79
2.22
21.63
13.92
27.55
12.29
10.41
23.06
0.49
0.09
1.21
1.75
-0.58
-0.56
-5.78
3.06
10.23
0.40
0.91
6.09
0.14
0.35
7.15
-0.48
2.11
-3.16
1.57
-5.83
19.87
2.22
-1.19
16.38
-2.36
7.92
15.97
-6.72
-10.49
-2.41
22.91
38.17
34.56
13.67
37.26
29.38
Our aim consists in getting insights into the way random aggregated ensembles learn, in
order to characterize learning in terms of the bias–variance components of the error.
In particular, an extended experimental bias–variance decomposition of the error in random
aggregated SVM ensembles is performed in order to verify the theoretical property of
canceled variance in random aggregation.
The plan of the experiments replicates the previous one we followed for bagged SVM
99
ensembles, using dot-product, polynomial and gaussian kernels.
We evaluated for each kernel the expected error and its decomposition in bias, net-variance,
unbiased and biased variance with respect to the learning parameters of the base learners.
Then we analyzed the bias–variance decomposition as a function of the number of the base
learners employed. Finally we compared bias and variance with respect to the learning
parameters in random aggregated SVM ensembles and in the corresponding single SVMs,
in order to study the effect of random aggregation on the bias and variance components of
the error.
Here are reported only some examples and a summary of the results of bias–variance analysis in random aggregated SVM ensembles. The experiments we performed with random
aggregated ensembles of SVMs are detailed in [181].
5.3.1
Experimental setup
In order to estimate the decomposition of the error in bias, unbiased and biased variance
with random aggregated ensembles of SVMs, we used a bootstrap approximation of the
unknown distribution P , that is, we drew samples of relatively small size from a relatively
large training set, according to an uniform probability distribution. From this standpoint
we approximated random aggregation by a sort of undersampled bagging, drawing data
from the universe population U represented by a comfortable large training set. The
bias-variance decomposition of the error is computed with respect to a separate test set
significantly larger than the undersampled training sets.
To estimate the decomposition of the error in bias, net-variance, unbiased and biased
variance with random aggregated ensembles of SVMs, we performed a bias-variance decomposition of the error on the data sets described in Chap. 4.
We split the data in a separated learning set D and testing set T . Then we drew with
replacement from D n set of samples D̄i , according to a uniform probability distribution.
Each set of samples D̄i is composed by m samples Dij drawn with replacement from D,
using an uniform probability distribution. Each sample Dij is composed by s samples. The
Dij samples are in turn collected in n sets D̄i = {Dij }m
j=1 .
We used the n sets D̄i to train n random aggregated ensembles, repeating this process for
all the considered SVM models. In order to properly compare the effect of different choices
of the learning parameters on bias–variance decomposition of the error, each SVM model
is represented by a different choice of the kernel type and parameters and it is trained with
the same sets D̄i , 1 ≤ i ≤ n of samples.
Fig. 5.13 summarizes the experimental procedure we adopted to generate the data and
Fig. 5.14 the experimental procedure to evaluate bias–variance decomposition of the error.
100
Procedure Generate samples
Input arguments:
- Data set S
- Number n of set of samples
- Size of samples s
- Number m of samples collected in each set
Output:
- Sets D̄i = {Dij }m
j=1 , 1 ≤ i ≤ n of samples
begin procedure
[D, T ] = Split(S)
for i = 1 to n
begin
D̄i = ∅
for j = 1 to m
begin
Dij = Draw with replacement(D, s)
D̄i = D̄i + Dij
end
end
end procedure.
Figure 5.13: Procedure to generate samples to be used for bias–variance analysis in random
aggregation
Sets D̄i of samples are drawn with replacement according to an uniform probability distribution from the training set D by the procedure Draw with replacement. This process is repeated n times (procedure Generate samples, Fig. 5.13). Then the procedure
Bias-Variance analysis (Fig. 5.14) trains different ensembles of random aggregated
SVMs (procedure Ensemble train) using the sets of samples generated by the procedure Generate samples. The bias–variance decomposition of the error is performed on
the separated test set T using the previously trained bagged SVM ensembles (procedure
Perform BV analysis).
We employed gaussian, polynomial and dot-product kernels evaluating 110 different SVM
models, considering different combinations of the type of the kernel and learning parameters
for each data set. For each model we set s = 100, n = 100, m = 60, training for each data
set 110 ×100 = 11000 random aggregated ensembles and a total of 110×100×60 = 660000
101
Procedure Bias–Variance analysis
Input arguments:
- Test set T
- Number of random aggregated ensembles n
- Number of bootstrap replicate m
- Set of learning parameters A
Output:
- Error, bias, net-variance, unbiased and biased variance BV = {bv(α)}α∈A
of the random aggregated ensemble having base learners with learning parameters α ∈
A.
begin procedure
For each α ∈ A
begin
Ensemble Set(α) = ∅
for i = 1 to n
begin
rand aggr(α, D̄i ) = Ensemble train (α, D̄i )
Ensemble Set(α) = Ensemble Set(α) ∪ rand aggr(α, D̄i )
end
bv(α) = Perform BV analysis(Ensemble Set (α), T )
BV = BV ∪ bv(α)
end
end procedure.
Figure 5.14: Procedure to perform bias–variance analysis on random aggregated SVM
ensembles
different SVMs. Considering all the data sets we trained and tested about 80000 different
random aggregated SVM ensembles and a total of about 5 millions of single SVMs.
5.3.2
Random aggregated RBF-SVM ensembles
In this section are reported the results of bias–variance analysis in random aggregated
RBF-SVMs.
5.3.2.1
Bias–variance decomposition of the error
The decomposition of the error is represented with respect to different values of σ and for
fixed values of C.
102
Schematically we can observe the following facts:
• The shape of the error is mostly determined by the bias and it is in general sigmoid
with respect to σ, but sometimes an ”U” shape is observed, as the error can increase
with σ (Fig. 5.15).
• In all the data sets the net-variance is about 0 for all the values of σ, and the waveshape of the net-variance is very reduced or completely absent.
• The net-variance is 0 as both biased and unbiased variance are very low, with values
close to 0. Only in the Waveform data set unbiased and biased variance are both
quite large in the ”high bias” region (and partially also in Letter-Two).
• The net-variance is always 0 in the ”stabilized region” in all the considered data sets.
• The error is determined almost entirely by the bias: in Fig. 5.15 it is difficult to the
distinguish the error and bias curves.
5.3.2.2
Decomposition with respect to the number of base learners
Fig. 5.16 a, b and d refer to bias–variance decomposition of the error with respect to the
number of base learners for random aggregated SVMs of the ”stabilized region”. In these
cases we can observe the following facts:
• Most of the decrement of the error occurs within the first iterations (from 10 to 30,
depending on the data set).
• The bias remains unchanged during all the iterations
• The decrement of the error is almost entirely due to the decrement of the unbiased
variance, and it is larger than in bagged ensembles of SVMs.
On the contrary Fig. 5.16 c refers to σ values in the the ”transition region”. Also in
this case the bias remains unchanged in average (higher than the bias of SVMs of the
”stabilized region”), but oscillates largely, especially during the first 20 iterations. The
unbiased variance also oscillates, but tends to decrement with the iterations, lowering
the error. The biased variance oscillates in the same way (that is with the same phase)
with respect to the bias, but with a lower amplitude, while the unbiased variance and in
particular the net-variance oscillates in a specular way (opposite phase) with respect to
the bias. This is observed also in the other data sets (except in Letter-Two with noise). I
have no explanations for this behaviour.
103
C=100
C=1
0.5
0.5
avg. error
bias
net variance
unbiased var.
biased var
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
-0.1
0.01 0.02
0.1
0.2
0.5
1
2
sigma
5
10
20
50
avg. error
bias
net variance
unbiased var.
biased var
0.4
-0.1
100
0.01 0.02
0.1
0.2
0.5
(a)
5
10
20
50
100
(b)
C=1
C=100
0.5
0.5
avg. error
bias
net variance
unbiased var.
biased var
0.4
0.3
0.2
0.2
0.1
0.1
0
0
0.01 0.02
0.1
0.2
0.5
1
2
sigma
5
10
20
50
avg. error
bias
net variance
unbiased var.
biased var
0.4
0.3
-0.1
1
2
sigma
-0.1
100
(c)
0.01 0.02
0.1
0.2
0.5
1
2
sigma
5
10
20
50
100
(d)
Figure 5.15: Bias-variance decomposition of error in bias, net variance, unbiased and biased
variance in random aggregated gaussian SVMs, while varying σ and for some fixed values
of C. P2 data set: (a) C = 1, (b) C = 100. Letter-Two data set: (c) C = 1, (d) C = 100
5.3.2.3
Comparison of bias–variance decomposition in single and random aggregated RBF-SVMs
In all the graphics of this section the data referred to single SVMs are labeled with crosses,
while random aggregated SVMs are labeled with triangles. The corresponding quantities
(e.g. bias, net-variance, etc.) are represented with the same type of line both in single and
random aggregated SVMs.
In random aggregated ensembles net-variance is very close to 0. As a consequence, the
error is in practice reduced to the bias. As in single SVMs, we can distinguish three main
regions with respect to σ:
104
Sigma=0.5, C=100
Sigma=0.2, C=1
0.2
avg. error
bias
net variance
unbiased var.
biased var.
0.2
avg. error
bias
net variance
unbiased var.
biased var.
0.15
0.15
0.1
0.1
0.05
0.05
0
0
0
10
20
30
iterations
40
50
0
60
10
20
(a)
30
iterations
40
50
60
(b)
Sigma=2, C=100
Sigma=1, C=100
0.16
avg. error
bias
net variance
unbiased var.
biased var.
0.3
0.25
avg. error
bias
net variance
unbiased var.
biased var.
0.14
0.12
0.2
0.1
0.08
0.15
0.06
0.1
0.04
0.05
0.02
0
0
0
10
20
30
iterations
40
50
0
60
(c)
10
20
30
iterations
40
50
60
(d)
Figure 5.16: Bias-variance decomposition of error in bias, net variance, unbiased and
biased variance in random aggregated SVM RBF, with respect to the number of iterations.
P2 dataset: (a) C = 1, σ = 0.2, (b) C = 100, σ = 0.5. Letter-Two data set: (c)
C = 100, σ = 1, (d) C = 100, σ = 2
High bias region. In this region the the error of single and random aggregated SVMs
is about equal, and it is characterized by a very high bias. The net-variance is close to 0,
because biased variance is about equal to the unbiased variance. In most cases they are
both close to 0 (Fig. 5.17 a and b). In some cases they are equal but greater than 0 with
significantly larger values in single that in random aggragated SVMs (Fig. 5.17 c and d).
Transition region. The bias decreases in the transition region at about the same rate in
single and random aggregated SVM ensembles. The net-variance maintains the wave-shape
105
also in random aggregated SVMs, but it is lower. In some data sets (Fig. 5.17 a and b),
the net-variance remains low with no significant variations also for small values of σ. For
these reasons the error decreases more quickly in random aggregated SVMs, and the error
of the ensemble is about equal to the bias.
Stabilized region. The net-variance stabilizes, but at lower values (very close to 0)
compared with net-variance of single SVMs. Hence we have a reduction of the error for
random aggregated SVM ensembles in this region. Note that the reduction of the error
depends heavily on the level of the unbiased variance of dingle SVMs in the stabilized
region. If it is sufficiently high, we can achieve substantial reduction of the error in random
aggregated SVM ensembles. With some data sets the error increases for large values of σ,
mainly for the increment of the bias (Fig. 5.15 a and b).
5.3.3
Random aggregated polynomial SVM ensembles
5.3.3.1
Bias–variance decomposition of the error
The decomposition of the error is represented with respect to different values of the polynomial degree and for fixed values of C.
Schematically we can observe the following facts:
• In all the data sets the net-variance is about 0 for all the values of polynomial degree,
as both biased and unbiased variance are very low close to 0. Only in some data sets
(e.g. P2), with low values of C we can observe a certain level on unbiased variance,
especially with low degree polynomials (Fig. 5.18 a).
• In almost all the considered data sets the error shows an ”U” shape with respect to
the degree. This shape tends to a flat line if C is relatively large. With the P2 data
set the error decreases with the degree (Fig. 5.18).
• The error is determined almost entirely by the bias: its minimum is reached for
specific values of the degree of the polynomial and depends on the characteristics of
the data set.
5.3.3.2
Decomposition with respect to the number of base learners
This section reports data and graphs about the decomposition of bias–variance in random
aggregated SVMs with respect to the number of iterations, that is the number of base
learners used.
106
C=1
C=100
avg. error
bias
net variance
unbiased var.
biased var.
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
0.01 0.02
0.1
0.2
0.5
1
2
sigma
5
10
20
50
avg. error
bias
net variance
unbiased var.
biased var.
0.5
100
0.01 0.02
0.1
0.2
0.5
(a)
1
2
sigma
5
10
20
50
100
(b)
C=1
C=100
avg. error
bias
net variance
unbiased var.
biased var.
0.5
0.4
avg. error
bias
net variance
unbiased var.
biased var.
0.5
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
0.01 0.02
0.1
0.2
0.5
1
2
sigma
5
10
20
50
100
0.01 0.02
(c)
0.1
0.2
0.5
1
2
sigma
5
10
20
50
100
(d)
Figure 5.17: Comparison of bias-variance decomposition between single RBF-SVMs (lines
labeled with crosses) and random aggregated ensembles of RBF-SVMs (lines labeled with
triangles), while varying σ and for some fixed values of C. Letter-Two data set: (a) C = 1,
(b) C = 100. Waveform data set: (c) C = 1, (d) C = 100.
Fig. 5.19 shows that the bias remains constant throughout the iterations. Most of the error
decrement is achieved within the first 10-20 iterations, and it is almost entirely due to the
decrement of the unbiased variance. The error is reduced to the bias, when the number
of iterations is sufficiently large. The biased variance is low and slowly decreases at each
iteration, while the unbiased variance continues to decrease at each iteration, but most of
its decrement occurs within the first 20 iterations (Fig. 5.19).
107
0.5
C=1
0.5
avg. error
bias
net variance
unbiased var.
biased var
0.4
C=100
avg. error
bias
net variance
unbiased var.
biased var
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
1
2
3
4
5
6
7
8
9
10
1
2
3
4
degree
5
6
7
8
9
10
degree
(a)
(b)
C=1
C=100
avg. error
bias
net variance
unbiased var.
biased var
0.14
0.12
avg. error
bias
net variance
unbiased var.
biased var
0.14
0.12
0.1
0.1
0.08
0.08
0.06
0.06
0.04
0.04
0.02
0.02
0
0
1
2
3
4
5
6
7
8
9
10
1
degree
2
3
4
5
6
7
8
9
10
degree
(c)
(d)
Figure 5.18: Bias-variance decomposition of error in bias, net variance, unbiased and biased
variance in random aggregated polynomial SVM, while varying the degree and for some
fixed values of C. P2 data set: (a) C = 1, (b) C = 100. Letter-Two data set: (c) C = 1,
(d) C = 100
5.3.3.3
Comparison of bias–variance decomposition in single and random aggregated polynomial SVMs
In random aggregated polynomial SVMs the error is due almost entirely to the bias. The
bias component is about equal in random aggregated and single SVMs.
In single SVMs sometimes are observed opposite trends between bias and unbiased variance:
the bias decreases, while the unbiased variance increases with the degree (Fig. 5.20 a and
b). On the contrary in random aggregated ensembles the net-variance is very close to 0
and the error is due almost entirely to the bias (Fig. 5.20).
108
Degree=6, C=1
Degree=9, C=100
avg. error
bias
net variance
unbiased var.
biased var.
0.2
avg. error
bias
net variance
unbiased var.
biased var.
0.2
0.15
0.15
0.1
0.1
0.05
0.05
0
0
0
10
20
30
iterations
40
50
60
0
10
20
(a)
30
iterations
40
50
60
(b)
Degree=3, C=1
Degree=9, C=100
0.08
0.12
avg. error
bias
net variance
unbiased var.
biased var.
0.07
avg. error
bias
net variance
unbiased var.
biased var.
0.1
0.06
0.08
0.05
0.06
0.04
0.03
0.04
0.02
0.02
0.01
0
0
0
10
20
30
iterations
40
50
60
0
(c)
10
20
30
iterations
40
50
60
(d)
Figure 5.19: Bias-variance decomposition of error in bias, net variance, unbiased and biased
variance in random aggregated polynomial SVMs, with respect to the number of iterations.
P2 dataset: (a) C = 1, degree = 6 (b) C = 100, degree = 9. Letter-Two data set: (c)
C = 1, degree = 3, (d) C = 100, degree = 9.
Hence in random aggregated SVMs, the shape of the error with respect to the degree
depends on the shape of the bias, and consequently the error curve shape is bias-dependent,
while in single SVMs it is variance or bias-variance dependent.
The general shape of the error with respect to the degree resembles an ”U” curve, or can
be flatted in dependence of the bias trend, especially with relatively large C values.
109
C=1
C=100
0.4
0.4
avg. error
bias
net variance
unbiased var.
biased var.
0.35
0.3
avg. error
bias
net variance
unbiased var.
biased var.
0.35
0.3
0.25
0.25
0.2
0.2
0.15
0.15
0.1
0.1
0.05
0.05
0
0
−0.05
−0.05
2
3
4
5
6
degree
7
8
9
10
2
3
4
5
(a)
C=1
0.06
0.06
0.05
0.04
0.04
0.03
0.03
0.02
0.02
0.01
0.01
0
0
−0.01
−0.01
5
6
degree
10
7
8
9
avg. error
bias
net variance
unbiased var.
biased var.
0.07
0.05
4
9
C=100
0.07
3
8
0.08
avg. error
bias
net variance
unbiased var.
biased var.
2
7
(b)
0.08
1
6
degree
10
1
2
3
(c)
4
5
6
degree
7
8
9
10
(d)
Figure 5.20: Comparison of bias-variance decomposition between single polynomial SVMs
(lines labeled with crosses) and random aggregated polynomial SVM ensembles (lines labeled with triangles), while varying the degree and for some fixed values of C. P2 data
set: (a) C = 1, (b) C = 100. Grey-Landsat data set: (c) C = 1, (d) C = 100.
5.3.4
Random aggregated dot-product SVM ensembles
5.3.4.1
Bias–variance decomposition of the error
The decomposition of the error is represented with respect to different values of C.
Schematically we can observe the following facts (Fig. 5.21):
110
• Net-variance is about 0 for all the values of C, as both biased and unbiased variance
are very low close to 0.
• The error, bias and variance seem to be independent of the values of C. Anyway,
note that in the experiments we used only values of C ≥ 1.
• The error is determined almost totally by the bias.
0.08
0.2
avg. error
bias
net variance
unbiased var.
biased var
0.07
avg. error
bias
net variance
unbiased var.
biased var
0.15
0.06
0.05
0.1
0.04
0.03
0.05
0.02
0.01
0
0
1
2
5
10
20
50
100
200
500
1
1000
2
5
10
20
50
C
C
(a)
(b)
0.5
0.25
avg. error
bias
net variance
unbiased var.
biased var
0.4
100
0.15
0.2
0.1
0.1
0.05
500
1000
avg. error
bias
net variance
unbiased var.
biased var
0.2
0.3
200
0
0
1
2
5
10
20
50
100
200
500
1000
1
C
2
5
10
20
50
100
200
500
1000
C
(c)
(d)
Figure 5.21: Bias-variance decomposition of error in bias, net variance, unbiased and biased
variance in random aggregated dot-product SVM, while varying C. (a) Grey-Landsat data
set (b) Letter-Two (c) Letter-Two with noise (d) Spam
111
5.3.4.2
Decomposition with respect to the number of base learners
Considering the number of iterations, the most important facts with dot-product random
aggregated SVM ensembles are the following (Fig. 5.22):
• The bias remains constant
• Most of the error decrement is achieved within the first 10-20 iterations
• Error decrement is due to the decrement of the unbiased variance
• The error is determined almost totally by the bias.
• The biased variance slowly decreases at each iteration
5.3.4.3
Comparison of bias–variance decomposition in single and random aggregated dot-product SVMs
In all cases the error is about equal to the bias, that remains unchanged with respect to
the single SVMs. As a consequence the error shape is equal to the shape of the bias and
it is independent of the C values, at least for C ≥ 1. As a result we a have a significant
reduction of the error due to decrement of the unbiased variance (Fig. 5.23).
5.3.5
Bias–variance characteristics of random aggregated SVM
ensembles
In the following tables are summarized the compared results of bias–variance decomposition between single SVMs and random aggregated SVM ensembles. ESV M stands for
the estimated error of single SVMs, Eagg for the estimated error of random aggregated
ensembles of SVMs, % Error reduction stands for the percent error reduction of the error
between single and random aggregated ensembles, that is:
%Error reduction =
ESV M − Eagg
ESV M
% Bias reduction, % N etV ar reduction and % U nbV ar reduction corresponds respectively to the percent bias, net–variance and unbiased variance reduction between single
and random aggregated ensemble of SVMs. The negative signs means that we have a
larger error in the random aggregated ensemble. Note that sometimes the decrement of
112
C=100
C=100
0.1
0.12
avg. error
bias
net variance
unbiased var.
biased var.
0.08
avg. error
bias
net variance
unbiased var.
biased var.
0.1
0.08
0.06
0.06
0.04
0.04
0.02
0.02
0
0
0
10
20
30
iterations
40
50
60
0
10
20
(a)
30
iterations
40
50
60
(b)
C=100
C=100
0.2
0.16
avg. error
bias
net variance
unbiased var.
biased var.
0.15
avg. error
bias
net variance
unbiased var.
biased var.
0.14
0.12
0.1
0.1
0.08
0.06
0.05
0.04
0.02
0
0
0
10
20
30
iterations
40
50
60
0
(c)
10
20
30
iterations
40
50
60
(d)
Figure 5.22: Bias-variance decomposition of error in bias, net variance, unbiased and biased
variance in random aggregated dot-product SVM, with respect to the number of iterations.
(a) Waveform (b) Letter-Two (c) Spam (d) Musk.
the net–variance can be larger than 100 %: recall that net–variance can be negative (when
the biased variance is larger than the unbiased variance).
Random aggregated ensembles of SVMs strongly reduce net-variance. Indeed in all data
sets the net-variance is near to 0, with a reduction close to 100 % with respect to single
SVMs, confirming the ideal behavior of random aggregating (Sect. 5.1). Unbiased variance
reduction is responsible for this fact, as in all data sets its decrement amounts to about
90 % with respect to single SVMs (Tab. 5.2). As expected bias remains substantially
unchanged, but with the P2 data set with polynomial and gaussian kernels we register a
113
0.12
0.08
avg. error
bias
net variance
unbiased var.
biased var.
0.1
avg. error
bias
net variance
unbiased var.
biased var.
0.07
0.06
0.08
0.05
0.04
0.06
0.03
0.04
0.02
0.02
0.01
0
0
1
2
5
10
20
50
100
200
500
1000
1
2
5
10
20
50
C
C
(a)
(b)
0.25
0.25
avg. error
bias
net variance
unbiased var.
biased var.
0.2
100
0.15
0.1
0.1
0.05
0.05
0
500
1000
avg. error
bias
net variance
unbiased var.
biased var.
0.2
0.15
200
0
1
2
5
10
20
50
100
200
500
1000
1
2
5
10
20
50
C
C
(c)
(d)
100
200
500
1000
Figure 5.23: Comparison of bias-variance decomposition between single dot-product SVMs
(lines labeled with crosses) and random aggregated dot-product SVM ensembles (lines
labeled with triangles), while varying the values of C. (a) Waveform (b) Grey-Landsat (c)
Spam (d) Musk.
not negligible decrement of the bias. As a result the error decreases from 15 to about 70
% with respect to single SVMs, depending on the kernel and on the characteristics of the
data set. The overall shape of the curves of the error resembles that of the bias of single
SVMs, with a characteristics sigmoid shape for gaussian kernels (that can also become an
”U” shape for certain data sets) with respect to the σ width values (Fig. 5.15 and 5.16),
an ”U” shape with respect to the degree for polynomial kernels (Fig. 5.18 and 5.19), while
it is relatively independent of the C values (at least for sufficiently large values of C) for
random aggregated linear SVMs (Fig. 5.21).
114
Table 5.2: Comparison of the results between single and random aggregated SVMs.
ESVM
Eagg
Data set P2
RBF-SVM
0.1517
0.0495
Poly-SVM
0.2088
0.1030
D-prod SVM 0.4715
0.4611
Data set Waveform
RBF-SVM
0.0707
0.0501
Poly-SVM
0.0761
0.0497
D-prod SVM 0.0886
0.0498
Data set Grey-Landsat
RBF-SVM
0.0384
0.0300
Poly-SVM
0.0392
0.0317
D-prod SVM 0.0450
0.0345
Data set Letter-Two
RBF-SVM
0.0745
0.0345
Poly-SVM
0.0745
0.0346
D-prod SVM 0.0955
0.0696
Data set Letter-Two with added noise
RBF-SVM
0.3362
0.2770
Poly-SVM
0.3432
0.2775
D-prod SVM 0.3486
0.2925
Data set Spam
RBF-SVM
0.1292
0.0844
Poly-SVM
0.1323
0.0814
D-prod SVM 0.1495
0.0804
Data set Musk
RBF-SVM
0.0898
0.0754
Poly-SVM
0.1225
0.0758
D-prod SVM 0.1501
0.0761
5.4
% Error
reduction
% Bias
reduction
% NetVar
reduction
% UnbVar
reduction
67.37
50.65
2.21
24.52
19.56
0.89
99.04
92.26
142.65
85.26
83.93
91.08
29.08
34.59
43.74
1.14
3.68
3.84
100.58
97.12
99.12
89.63
89.44
90.69
21.87
19.13
23.33
3.22
3.17
19.27
99.95
83.79
69.88
85.42
80.95
72.57
53.69
53.54
27.11
0.00
-5.26
2.22
95.32
95.46
109.73
92.48
92.71
92.31
17.55
19.13
16.07
2.92
1.75
-1.68
90.26
95.96
106.4
87.04
89.42
89.97
34.67
38.47
46.22
6.75
22.33
6.90
99.74
95.22
94.91
90.05
86.03
90.24
16.02
38.12
49.28
0.39
1.53
0.80
106.70
97.52
98.30
93.85
94.02
93.03
Undersampled bagging
While bagging had been successfully applied to different classification and regression problems [8, 44, 5, 102, 185], random aggregating is almost ideal, because in most cases the true
distribution P is unknown and we can access only a limited and often small sized data set.
From a theoretical standpoint we need to know the usually unknown true distribution of
the data, and we should be able to access the (possibly infinite) universe U of the data
115
From a different standpoint random aggregating (using a bootstrap approximation of P )
can be viewed as a form of undersampled bagging if we consider the universe U as a data
set from which undersampled data, that is data sets whose cardinality is much less than
the cardinality of U , are randomly drawn with replacement. For instance this is the way
by which we approximated random aggregating in our experiments described in Sect. 5.3.
In real problems, if we have very large learning sets, or on-line available learning data,
we could use undersampled bagging in order to overcome the space complexity problem
raising from learning too large data sets, or to allow on-line learning [34].
Indeed in most data mining problems we have very large data sets, and ordinary learning
algorithms cannot directly process the data set as a whole. For instance most of the
implementations of the SVM learning algorithm have a O(n2 ) space complexity, where n
is the number of examples. If n is relatively large (e.g. n = 100000) we need room for 1010
elements, a too costly memory requirement for most current computers. In these cases
we could use relatively small data sets randomly drawn form the large available data set,
using undersampled bagging methods to improve performances.
This situation is very similar to the ideal random aggregation setting: the only difference
is that we use only limited data and an uniform probability distribution to draw the data.
In these cases we could expect a strong decrement of the variance, while bias should remain
substantially unchanged. Indeed our experiments (Sect. 5.3) reported a reduction of the
net-variance over 90 %, as well as no substantial changes in bias.
Moreover the inherent parallelism of this process should permit to obtain a significant speed
up using, for instance, simple cluster of workstations using message passing interface [140].
On the other hand we could use this approach for incremental learning strategies, collecting
on-line samples in small data sets and aggregating the resulting classifiers. Of course this
approach holds if the on-line samples are distributed according to an uniform probability
distribution along time.
5.5
Summary of bias–variance analysis results in random aggregated and bagged ensembles of SVMs
We conducted an extensive experimental analysis of bias–variance decomposition of the
error in random aggregated and bagged ensembles of SVMs, involving training an testing
of more than 10 millions of SVMs. In both cases we used relatively small data sets (100
examples) bootstrapped from a relatively large data set and reasonably large test sets to
perform a reliable evaluation of bias and variance.
116
0.4
SVM error
bag error
agg. error
0.35
0.3
Error
0.25
0.2
0.15
0.1
0.05
0
P2
Wave Grey
Let
Let-n Spam Musk
(a)
0.4
SVM error
bag error
agg. error
0.35
0.3
Error
0.25
0.2
0.15
0.1
0.05
0
P2
Wave Grey
Let
Let-n Spam Musk
(b)
0.4
SVM error
bag error
agg. error
0.35
0.3
Error
0.25
0.2
0.15
0.1
0.05
0
P2
Wave Grey
Let
Let-n Spam Musk
(c)
Figure 5.24: Comparison of the error between single SVMs, bagged and random aggregated
ensembles of SVMs. Results refers to 7 different data sets. (a) Gaussian kernels (b)
Polynomial kernels (c) Dot-product kernels.
117
100
80
Relative change
60
40
20
0
-20
P2
Wave Grey
Let
Let-n Spam Musk
(a)
100
80
Relative change
60
40
20
0
-20
P2
Wave Grey
Let
Let-n Spam Musk
(b)
100
B/S err.
B/S bias
B/S unb.var.
R/S err.
R/S bias
R/S unb.var.
80
Relative change
60
red.
red.
red.
red.
red.
red.
40
20
0
-20
P2
Wave Grey
Let
Let-n Spam Musk
(c)
Figure 5.25: Comparison of the relative error, bias and unbiased variance reduction between
bagged and single SVMs (lines labeled with triangles), and between random aggregated
and single SVMs (lines labeled with squares). B/S stands for Bagged versus Single SVMs,
and R/S for random aggregated versus Single SVMs. Results refers to 7 different data sets.
(a) Gaussian kernels (b) Polynomial kernels (c) Dot-product kernels.
118
Considering random aggregated ensembles, the most important fact we can observe consists
in a very large reduction of the net-variance. It is always reduced close to 0, independently
of the type of kernel used (Fig. 5.15, 5.18, 5.21). This behaviour is due primarily to the
unbiased variance reduction, while the bias remains unchanged with respect to the single
SVMs (Fig. 5.17, 5.20, 5.23).
Comparing bias–variance decomposition of the error between single and random aggregated
ensembles of SVMs, we note that the relative error reduction varies from 10 to about 70
%, depending on the data set (Tab. 5.2). This reduction is slightly larger for high values
of the C parameter (that reduces the bias of the base learners) and is due primarily to the
reduction of the unbiased variance. Indeed in all data sets the relative reduction of the
unbiased variance amounts to about 90 %, while bias remains substantially unchanged.
The error of the ensemble is reduced to the bias of the single SVMs, because net and
unbiased variance are largely reduced and close to 0.
Considering the bias-variance decomposition with respect to the number of base learners,
we can observe that most of the decrement of the error occurs within the first iterations
(from 10 to 30, depending on the data set), while the bias and the biased variance remains
unchanged during all the iterations. The decrement of the error is almost entirely due to
the decrement of the net and unbiased variance (Fig. 5.16, 5.19, 5.22).
With bagging also we have a reduction of the error, but not so large as with random
aggregated ensembles (Fig. 5.24).
Indeed, unlike random aggregating, net and unbiased variance, although reduced, are not
actually dropped to 0 (Fig. 5.4, 5.7, 5.10).
In particular, in our experiments, we obtained a smaller reduction of the average error
(from 0 to 20 %) due to a lower decrement of the net-variance (about 35% against a
reduction over 90 % with random aggregated ensembles), while bias remains unchanged or
slightly increases (Fig. 5.25).
Random aggregating, approximated through undersampled bagging of sufficiently large
training sets, shows a behavior very close to that predicted by theory (Sect. 5.1.1 and
5.1.2): eliminated variance and bias unchanged with respect to single base SVMs.
On the other hand experimental results confirm that bagging can be interpreted as an
approximation of random aggregating (Sect. 5.1.3), as net-variance is reduced, but not
canceled by bootstrap aggregating techniques, while bias remains unchanged or slightly
increases.
The generalization error reduction provided by bootstrap aggregating techniques depends
critically on the variance component of the error and on the bias proper of the base learner
used. Using base learners with low bias and aggregating them through bootstrap replicates
of the data can potentially reduce both the bias and variance components of the error.
119
Undersampled bagging, as an approximation of random aggregating can also provide very
significant reduction of the variance and can be in practice applied to data mining problems
when learning algorithms cannot comfortably manage very large data sets.
120
Chapter 6
SVM ensemble methods based on
bias–variance analysis
Bias–variance theory provides a way to analyze the behavior of learning algorithms and
to explain the properties of ensembles of classifiers [66, 48, 94]. Some ensemble methods
increase expressive power of learning algorithms, thereby reducing bias [63, 33]. Other
ensemble methods, such as methods based on random selection of input examples and
input features [19, 24] reduce mainly the variance component of the error. In addition
to providing insights into the behavior of learning algorithms, the analysis of the bias–
variance decomposition of the error can identify the situations in which ensemble methods
might improve base learner performances. Indeed the decomposition of the error into bias
and variance can guide the design of ensemble methods by relating measurable properties
of algorithms to the expected performance of ensembles [182]. In particular, bias–variance
theory can tell us how to tune the individual base classifiers so as to optimize the overall
performance of the ensemble.
The experiments on bias–variance decomposition of the error in SVMs gave us interestingly
insights into the way SVMs learn (Chap. 4). Indeed, with single SVMs, we provided
a bias–variance characterization of their learning properties, showing and explaining the
relationships between kernel and SVMs parameters and their bias–variance characteristics
(Sect. 4.3). Moreover bias–variance analysis in random aggregated and bagged ensembles
(Chap. 5) showed how ensemble methods based on resampling techniques influence learning
characteristics and generalization capabilities of single SVMs.
From a general standpoint, considering different kernels and different parameters of the
kernel, we can observe that the minimum of the error, bias and net–variance (and in
particular unbiased variance) do not match. For instance, considering RBF-SVM we see
that we achieve the minimum of the error, bias and net–variance for different values of σ
121
(see, for instance, Fig. 4.5). Similar considerations can also be applied to polynomial and
dot–product SVMs. Often, modifying parameters of the kernel, if we gain in bias we lose
in variance and vice versa, even if this is not a rule.
Moreover, in our experiments, comparing bias–variance decomposition of the error between single and random aggregated ensembles of SVMs, we showed that the relative error
reduction varies from 10 to about 70 %, depending on the data set. This reduction is
due primarily to the reduction of the unbiased variance( about 90 %), while bias remains
substantially unchanged.
With bagging also we have a reduction of the error, but not so large as with random
aggregated ensembles. In particular the error with bagged ensembles of SVMs depends
mainly on the bias component, but, unlike random aggregating, net and unbiased variance,
although reduced, are not actually reduced to 0. In particular, in our experiments, we
obtained a smaller reduction of the average error (from 0 to 20 %) due to a lower decrement
of the net-variance (about 35% on the average against a reduction over 90 % with random
aggregated ensembles), while bias remains unchanged or slightly increases.
Hence in both cases we have significant net-variance reduction (due to the unbiased variance
decrement), while bias remains substantially unchanged.
In the light of the results of our extensive analysis for single SVMs and ensembles of SVMs,
we propose two possible ways of applying bias–variance analysis to develop SVM-based
ensemble methods.
The first approach tries to apply bias–variance analysis to enhance both accuracy and
diversity of the base learners. The second research direction consists in bootstrap aggregating low bias base learners in order to lower both bias and variance. Regarding the first
approach, only some very general research lines are depicted. About the second direction,
a specific new method that we named Lobag, that is Low bias bagged SVMs, is introduced, considering also different variants. Lobag applies bias–variance analysis to direct
the tuning of Support Vector Machines to optimize the performances of bagged ensembles.
Specifically, since bagging is primarily a variance-reduction method, and since the overall
error is (to a first approximation) the sum of bias and variance, this suggests that SVMs
should be tuned to minimize bias before being combined by bagging.
Numerical experiments show that Lobag compares favorably with bagging, and some preliminary results show that Lobag can be successfully applied to gene expression data analysis.
122
6.1
Heterogeneous Ensembles of SVMs
The analysis of bias–variance decomposition of the error in SVMs shows that the minimum of the overall error, bias, net–variance, unbiased and biased variance occurs often
in different SVM models. These different behaviors of different SVM models could be in
principle exploited to produce diversity in ensembles of SVMs. Although the diversity
of base learner itself does not assure the error of the ensemble will be reduced [121], the
combination of accuracy and diversity in most cases does [43]. As a consequence, we could
select different SVM models as base learners by evaluating their accuracy and diversity
through the bias-variance decomposition of the error.
For instance, our results show that the “optimal region” (low average loss region) is quite
large in RBF-SVMs (Fig. 4.4). This means that C and σ do not need to be tuned extremely carefully. From this point of view, we can avoid time-consuming model selection
by combining RBF-SVMs trained with different σ values all chosen from within the “optimal region.” For instance, if we know that the error curve looks like the one depicted in
Fig. 4.22, we could try to fit a sigmoid-like curve using only few values to estimate where
the stabilized region is located. Then we could train an heterogeneous ensemble of SVMs
with different σ parameters (located in the low bias region) and average them according
to their estimated accuracy.
A high-level algorithm for Heterogeneous Ensembles of SVMs could include the following
steps:
1. Individuate the ”optimal region” through bias–variance analysis of the error
2. Select the SVMs with parameters chosen from within the optimal region defined by
bias-variance analysis.
3. Combine the selected SVMs by majority or weighted voting according to their estimated accuracy.
We could use different methods or heuristics to find the ”optimal region” (see Sect. 4.2.1.3)
and we have to define also the criterion used to select the SVM models inside the ”optimal
region”. The combination could be performed using also other approaches, such as minimum, maximum, average and OWA aggregating operators [105] or Behavior-Knowledge
space method [87], Fuzzy aggregation rules [188], Decision templates [118] or Meta-learning
techniques [150]. Bagging and boosting [63] methods can also be combined with this approach to further improve diversity and accuracy of the base learners.
If we apply bootstrap aggregating methods to the previous approach, exploiting also the
fact that the most important learning parameter in gaussian kernels is represented by the
spread σ (Sect. 4.2.1), we obtain the following σ-Heterogeneous Ensembles of bagged SVMs:
123
1. Apply bias-variance analysis to SVMs, in order to individuate the low bias region
with respect to kernel parameter σ
2. Select a subset of values for σ, chosen from within the optimal region defined by
bias-variance analysis (for instance n values).
3. For each σ value, selected in the previous step, train a bagged ensembles, for a total
of n bagged ensembles
4. Combine the n ensembles by majority or weighted voting according to their estimated
accuracy.
In Sect. 4.2.1 we showed that with gaussian kernels the optimal region with respect to σ is
quite large, and also that σ is the most relevant parameter affecting the performances of
gaussian SVMs. Hence we could select different σ values in order to improve diversity in
the ensemble, while maintaining a high accuracy. We could also apply explicit measures of
diversity [121] to select appropriate subsets of σ values. Then the variance of the set of σheterogeneous SVMs is lowered using bagging. Training multiple bagged SVM ensembles is
computationally feasible, as in our experiments we showed that usually the error stabilizes
within the first 20 − 30 iterations (Sect. 5.2.2). These results were confirmed also in other
experimental applications of bagged SVMs, for instance in bioinformatics [185].
This approach presents several open problems. Even if we discuss this point in Chapter 4,
we need to choose an appropriate criterion to define the ”optimal region”: for instance,
optimal in the sense of minimum overall error or minimum bias? Moreover, we the need to
define the relationships between diversity and accuracy in selecting the ”optimal” subset of
σ values. Other questions are which diversity measure should be more appropriate in this
context and whether the combination in the last step has to be performed at base learner
or ensemble level.
Another more general approach, Breiman’s random forests [19] ”inspired”, could use randomness at different levels to improve performances of ensemble methods. For instance,
besides random selection of input samples, we could consider random selection of features,
or also other types of randomness. In this context bias–variance analysis could select ”appropriate” subsets of learning parameters, while randomness at different levels could be
used to reduce the variance and/or the bias components of the error.
124
6.2
Bagged Ensemble of Selected Low-Bias SVMs
In chapter 5 we showed that random aggregating removes all variance, leaving only bias
and noise. Hence, if bagging is a good approximation to random aggregating, it will also
remove most of the variance. As a consequence, to minimize the overall error, bagging
should be applied to base learners with minimum bias.
6.2.1
Parameters controlling bias in SVMs
We propose to tune SVMs to minimize the bias and then apply bagging to reduce (if not
eliminate) variance, resulting in an ensemble with very low error. The key challenge, then,
is to find a reasonable way of tuning SVMs to minimize their bias. The bias of SVMs is
typically controlled by two parameters. P
First, recall that the objective function for (soft
margin) SVMs has the form: kwk2 + C i ξi , where w is the vector of weights computed
by the SVM and the ξi are the margin slacks, which are non-zero for data points that are
not sufficiently separated by the decision boundary. The parameter C controls the tradeoff
between fitting the data (achieved by driving the ξi ’s to zero) and maximizing the margin
(achieved by driving kwk to zero). Setting C large should tend to minimize bias.
The second parameter that controls bias arises only in SVMs that employ parameterized
kernels such as the polynomial kernel (where the parameter is the degree d of the polynomial) and RBF kernels (where the parameter is the width σ of the gaussian kernel). In
Chap. 4 we showed that in gaussian and polynomial SVMs bias depends critically on these
parameters.
6.2.2
Aggregating low bias base learners by bootstrap replicates
Bagging is an ensemble method effective for unstable learners. Under the bootstrap assumption, it reduces only variance. From bias-variance decomposition we know that unbiased variance reduces the error, while biased variance increases the error.
In theory, the bagged ensemble having a base learner with the minimum estimated bias will
be the one with the minimum estimated generalization error, as the variance of the single
base learner will be eliminated by the bagged ensemble, and the estimated generalization
error will be reduced to the estimated bias of the single base learner. Indeed the bias
(without noise) is B(x) = L(t, ym ), where L is the loss function, t is the target and the
main prediction ym = arg miny ED [L(yD , y)], for a classification problem is the most voted
class, that is the class selected by the bagged ensemble.
Hence bagging should be applied to low-bias classifiers, because the biased variance will
125
be small, while bagging is essentially a variance reduction method, especially if well-tuned
low bias base classifiers are used.
Summarizing, we can schematically consider the following observations:
• We know that bagging lowers net–variance (in particular unbiased variance) but not
bias.
• From Domingos bias-variance decomposition we know that unbiased variance reduces
the error, while biased variance increases the error. Hence bagging should be applied
to low-bias classifiers, because the biased variance will be small.
• For single SVMs, the minimum of the error and the minimum of the bias are often
achieved for different values of the tuning parameters C, d, and σ.
• SVMs are strong, low-biased learners, but this property depends on the proper selection of the kernel and its parameters.
• If we can identify low-biased base learners with no negligible unbiased variance,
bagging can lower the error.
• Bias–variance analysis can identify SVMs with low bias.
We could try to exploit the low bias of a base learner to build a bagged ensemble that
combines the reduced variance peculiar to bagging with low bias in order to reduce the
generalization error. This is the key idea of Lobag, Low bias bagged ensembles, that is
bagged ensembles of low bias learning machines:
1. Estimate bias-variance decomposition of the error for different SVM models
2. Select the SVM model with the lowest bias
3. Perform bagging using as base learner the SVM with the estimated lowest bias.
From this algorithmic scheme, a major problem is the selection of a base learner with minimum estimated bias for a given data set. That is, given a learning set D and a parametric
learning algorithm L(·, α) that generates a model fα = L(·, α), with α representing the
parameters of the learning algorithm L, we need to find:
fB∗ = arg min B(fα , D)
α
(6.1)
where B() represents the bias of the model fα estimated using the learning data set D.
This in turn requires an efficient way to estimate the bias–variance decomposition of the
error.
126
6.2.3
Measuring Bias and Variance
To estimate bias and variance, we could use cross-validation in conjunction with bootstrap,
or out-of-bag estimates (especially if we have small training sets), or hold-out techniques
in conjunction with bootstrap techniques if we have sufficiently large training sets.
We propose to apply out-of-bag procedures [19] to estimate the bias and variance of SVMs
trained with various parameter settings (see also Sect. 3.2.2). The procedure works as
follows. First, we construct B bootstrap replicates of the available training data set D
(e. g., B = 200): D1 , . . . , DB . Then we apply a learning algorithm L to each replicate Sb
to obtain an hypothesis fb = L(Db ). For each bootstrap replicate Db , let Tb = D\Db be
the (“out-of-bag”) data points that do not appear in Db . We apply hypothesis fb to the
examples in Tb and collect the results.
Consider a particular training example (x, t). On the average, this point will be in 63.2%
of the bootstrap replicates Db and hence in about 36.8% of the out-of-bag sets Tb . Let
K be the number of times that (x, t) was out-of-bag; K will be approximately 0.368B.
The optimal prediction at x is just t. The main prediction ym is the class that is most
frequently predicted among the K predictions for x. Hence, the bias is 0 if ym = t and 1
otherwise. The variance V (x) is the fraction of times that fb (x) 6= ym . Once the bias and
variance have been computed for each individual point x, they can be aggregated to give
B, Vu , Vb , and Vn for the entire data set D.
6.2.4
Selecting low-bias base learners.
Considering the second step of the Lobag algorithm (Sect. 6.2.2), that is the selection of
the low bias SVM model, depending on the type of kernel and parameters considered, and
on the way the bias is estimated for the different SVM models, different variants can be
provided:
1. Selecting the RBF-SVM with the lowest bias with respect to the C and σ parameters.
2. Selecting the polynomial-SVM with the lowest bias with respect to the C and degree
parameters.
3. Selecting the dot–prod-SVM with the lowest bias with respect to the C parameter.
4. Selecting the SVM with the lowest bias with respect both to the kernel and kernel
parameters.
Note that here we propose SVMs as base learners, but other low bias base learners could
in principle be used (for instance MLPs), as long as an analysis of their bias-variance
127
characteristics suggests to apply them with bootstrap aggregating techniques. Of course,
we cannot expect a high error reduction if the bias–variance analysis shows that the base
learner has a high bias and a low unbiased variance.
A problem uncovered in this work is the estimate of the noise in real data sets. A straightforward approach simply consists in disregarding it, but in this way we could overestimate
the bias. Some heuristic are proposed in [94], but the problem remains substantially unresolved.
6.2.5
Previous related work
Lobag can be interpreted as a variant of bagging: it estimates the bias of the SVM classifiers,
selects low-bias classifiers, and then combines them by bootstrap aggregating.
Previous work with other classifiers is consistent with this approach. For example, several
studies have reported that bagged ensembles of decision trees often give better results
when the trees are not pruned [8, 41]. Unpruned trees have low bias and high variance.
Similarly, studies with neural networks have found that they should be trained with lower
weight decay and/or larger numbers of epochs before bagging to maximize accuracy of the
bagged ensemble [5].
Unlike most learning algorithms, support vector machines have a built-in mechanism for
variance reduction: from among all possible linear separators, they seek the maximum
margin classifier. Hence, one might expect that bagging would not be very effective with
SVMs. Previous work has produced varying results. On several real-world problems,
bagged SVM ensembles are reported to give improvements over single SVMs [102, 185].
But for face detection, Buciu et al. [23] report negative results for bagged SVMs.
A few other authors have explored methods for tuning SVMs in ensembles. Collobert et
al. [34] proposed solving very large scale classification problems by using meta-learning
techniques combined with bagging. Derbeko et al. [40] applied an optimization technique
from mathematical finance to reduce the variance of SVMs.
6.3
The lobag algorithm
The Lobag algorithm accepts the following inputs: (a) a data set D = {(xi , yi )}ni=1 , with
xi ∈ R and yi ∈ {−1, 1}, (b) a learning algorithm L(·, α), with tuning parameters α, and
(c) a set A of possible settings of the α parameters to try. Lobag estimates the bias of
each parameter setting α ∈ A, chooses the setting that minimizes the estimated bias, and
applies the standard bagging algorithm to construct a bag of classifiers using L(·, α) with
128
the chosen α value.
Unlike bagging, lobag selects the hypothesis with the estimated lowest bias to build the
bootstrap aggregated classifier. As a consequence the core of the algorithm consists in
evaluating bias–variance decomposition of the error varying the learning parameters α.
The remainder of this section provides the pseudo-code for Lobag.
6.3.1
The Bias–variance decomposition procedure
This procedure estimates the bias–variance decomposition of the error for a given learning
algorithm L and learning parameters α.
The learning algorithm L returns a hypothesis fα = L(D, α) using a learning set D, and it
is applied to multiple bootstrap replicates Db of the learning set D in order to generate a
b
set Fα = {fαb }B
b=1 of hypotheses fα . The procedure returns the models Fα and the estimate
of their loss and bias. For each learning parameter it calls Evaluate BV, a procedure that
provides an out-of-bag estimate of the bias–variance decomposition.
Procedure [V, F] BV decomposition (L, A, D, B)
Input:
- Learning algorithm L
- Set of algorithm parameters A
- Data set D
- Number of bootstrap samples B
Output:
- Set V of triplets (α, loss, bias), where loss and bias are the estimated loss and bias of
the model trained through the learning algorithm L with algorithm parameters α.
- Set of ensembles F = {Fα }α∈A with Fα = {fαb }B
b=1
begin procedure
V =∅
F =∅
for each α ∈ A
begin
Fα = ∅
Tα = ∅
for each b from 1 to B
begin
Db = Bootstrap replicate(D)
fαb = L(Db , α)
Tb = D\Db
Fα = Fα ∪ fαb
129
Tα = Tα ∪ Tb
end
F = F ∪ Fα
[loss, bias, variance] = Evaluate BV (Fα , Tα )
V = V ∪ (α, loss, bias)
end
end procedure.
The following procedure Evaluate BV provides an out-of-bag estimate of the bias–variance
decomposition of the error for a given model. The function ||z|| is equal to 1 if z is true, and
0 otherwise. Ex [Q(x)] represents the expected value of Q(x) with respect to the random
variable x.
Procedure [ls, bs, var] Evaluate BV (F, T )
Input:
- Set F = {fb }B
b=1 of models trained on bootstrapped data
- Set T = {Tb }B
b=1 of out-of-bag data sets
Output:
- Out-of-bag estimate of the loss ls of model F
- Out-of-bag estimate of the bias bs of model F
- Out-of-bag estimate of the net variance var
of model F
begin procedure
for each x ∈ ∪b Tb
begin
K = |{Tb |xP
∈ Tb , 1 ≤ b ≤ B}|
1
p1 (x) = K B
b=1 ||(x ∈ Tb ) and (fb (x) = 1)||
P
1
p−1 (x) = K B
b=1 ||(x ∈ Tb ) and (fb (x) = −1)||
ym = arg¯ max(p
¯ 1 , p−1 )
B(x) = ¯ ym2−t ¯
P
Vu (x) = K1 B
b=1 ||(x ∈ Tb ) and (B(x) = 0)
and (ym 6= fb (x))||
PB
1
Vb (x) = K b=1 ||(x ∈ Tb ) and (B(x) = 1)
and (ym 6= fb (x))||
Vn (x) = Vu (x) − Vb (x)
Err(x) = B(x) + Vn (x)
end
ls = Ex [Err(x)]
bs = Ex [B(x)]
var = Ex [Vn (x)]
end procedure.
130
Even if the bias–variance decomposition of the error error could be, in principle, evaluated
using other methods, such as multiple hold-out sets or cross-validation, the out-of-bag
estimation is cheaper and allows us to exploit all the available data without separating
the learning set in a training and a validation data set. Moreover the bias estimated for a
single learning machine corresponds to the estimated error of the bagged ensemble having
the same learning machine as base learner.
6.3.2
The Model selection procedure
After the estimate of the bias–variance decomposition of the error for different models, we
need to select the model with the lowest bias. This procedure chooses in a straightforward
way the learning parameters corresponding to the model with the lowest estimated bias
and loss.
Procedure [αB , αL , Bmin , Lmin , BLmin ] Select model (V )
Input:
- Set V of triplets (α, loss, bias), where loss and bias are the estimated loss and bias of
the model trained through the learning algorithm L with algorithm parameters α.
Output:
- Learning parameter αB corresponding to the model with the estimated minimum bias
- Learning parameter αL corresponding to the model with the estimated minimum loss
- Minimum Bmin of the bias values collected in V
- Minimum Lmin of the loss values collected in V
- Bias BLmin corresponding to the minimum loss Lmin
begin procedure
Lmin = minv∈V v.loss
Bmin = minv∈V v.bias
αL = v.α s.t. v.loss = Lmin
αB = v.α s.t. v.bias = Bmin
BLmin = v.bias s.t. v.loss = Lmin
end procedure.
6.3.3
The overall Lobag algorithm
Using the procedure BV decomposition we can implement a version of the Lobag algorithm
that exhaustively explores a given set of learning parameters in order to build a low bias
bagged ensemble.
Using out-of-bag estimate of the bias–variance decomposition of the error, the procedure
Select model selects the model with the minimum bias and/or minimum loss and returns
131
the parameter values αB and αL that correspond respectively to the model with minimum
bias and minimum loss. Then the Lobag and bagged ensembles are chosen through the
procedure Select ensemble: the Lobag ensemble has base learners with the minimum
estimated bias, while the bagged ensemble has base learners with the minimum estimated
loss.
Algorithm Lobag exhaustive
Input:
- Learning algorithm L
- Set of algorithm parameters A
- Data set D
- Number of bootstrap samples B
Output:
- Selected Lobag ensemble : FLob = {fαb B }B
b=1
- Selected bagged ensemble : FBag = {fαb L }B
b=1
- Oob error of the Lobag ensemble : Bmin
- Oob error of the bagged ensemble : BLmin
- Oob error of the single model : Lmin
begin algorithm
V =∅
F =∅
[V, F] = BV decomposition (L, A, D, B)
[αB , αL , Bmin , Lmin , BLmin ] = Select model (V )
FLob = {fαb B }B
b=1 = Select ensemble (F, αB )
b
FBag = {fαL }B
b=1 = Select ensemble (F, αL )
end algorithm.
In order to speed up the computation, we could design variants of the exhaustive Lobag
algorithm. For example, we could apply multidimensional search methods, such as the
Powell’s method [149], to select the tuning values that minimize bias.
Lobag presents several limitations. Such as classical bagging it is only an approximation of
random aggregating: there is no guarantee of canceled net-variance. Moreover if variance
is small, we cannot expect a significant decrement of the error. For data sets where the
minimum of the bias and loss are achieved for the same learning parameters, lobag cannot
improve bagging.
6.3.4
Multiple hold-out Lobag algorithm
This procedure shows how to apply lobag in a multiple-hold-out experimental setting,
that is when multiple random splits of the data in a separated training and test set are
132
provided, in order to reduce the effect of a particular split of the data on the evaluation of
the generalization performance of learning machines.
Algorithm Multiple hold-out Lobag
Input:
- Learning algorithm L
- Set of algorithm parameters A
- Data set S
- Number of bootstrap samples B
- Number of splits n
Output:
- Oob estimate of the error of the Lobag ensemble : Bmin
- Oob estimate of the error of the bagged ensemble : BLmin
- Oob estimate of the error of the single model : Lmin
- Hold-out estimate of the error of the Lobag ensemble : Llobag
- Hold-out estimate of the error of the bagged ensemble : Lbag
- Hold-out estimate of the error of the single model : Lsingle
begin algorithm
for each i from 1 to n
begin
Vi = ∅
Fi = ∅
[Di , Ti ] = Split(S)
[Vi , Fi ] = BV decomposition (L, A, Di , B)
end
for each α ∈ A
begin
P
loss = n1 Pni=1 (v.loss|v ∈ Vi , v.α = α)
bias = n1 ni=1 (v.bias|v ∈ Vi , v.α = α)
V = V ∪ (α, loss, bias)
end
[αB , αL , Bmin , Lmin , BLmin ] = Select model (V )
for each i from 1 to n
begin
i
FLob
= {fαi,bB }B
b=1 = Select ensemble (Fi , αB )
i
FBag = {fαi,bL }B
b=1 = Select ensemble (Fi , αL )
end
Lsingle = Calc avg loss({fαi L }ni=1 , {Ti }ni=1 )
i
}ni=1 , {Ti }ni=1 )
Lbag = Calc avg loss({FBag
i
Llobag = Calc avg loss({FLob
}ni=1 , {Ti }ni=1 )
end algorithm
133
The algorithm generates the lobag ensemble using multiple splits of a given data set S
(procedure Split) in a separated learning Di and test Ti sets. On each learning set Di
it is performed an out-of-bag estimate of the bias–variance decomposition of the error
(procedure BV decomposition, Sect. 6.3.1). The bias and the loss for each model are
evaluated averaging the estimated bias and loss over each training set Di . Then the
SVMs with parameter αB that corresponds to the minimum estimated bias are selected
as base learners for the Lobag ensemble (procedure Select model). Lobag and bagged
ensembles are built through the procedure Select ensemble, and the loss Llobag of the
i
}ni=1 over the
Lobag ensemble is estimated averaging the error of the n ensembles {FLob
i
i,b
th
test sets {Ti }ni=1 , where FLob
= {fαi,bB }B
b=1 , and the fαB is the SVM trained on the b
th
bootstrap sample obtained from the i training set Di using the learning parameter αB .
The algorithm provides an hold-out estimate of the generalization error of the lobag and
bagged ensembles, averaging between the resulting loss on the different test sets Ti . The
procedure Calc avg loss simply returns the average of the loss of the ensemble tested on
different test sets:
Procedure [Err] Calc avg loss ({fi }ni=1 , {Ti }ni=1 )
Input arguments:
- Set ({fi }ni=1 of the models trained on the different S\Ti learning sets
- Set {Ti }ni=1 of the multiple hold-out test sets Ti
Output:
- Estimated average loss Err
begin procedure
Err = 0
for each i from 1 to n
begin
e = fi (Ti )
Err = Err + e
end
Err = Err/n
end procedure.
6.3.5
Cross-validated Lobag algorithm
This procedure applies the lobag algorithm in the experimental framework of cross-validation.
Algorithm Cross-validated Lobag
Input:
134
- Learning algorithm L
- Set of algorithm parameters A
- Data set D
- Number of bootstrap samples B
- Number of folds n
Output:
- Oob estimate of the error of the Lobag ensemble : Bmin
- Oob estimate of the error of the bagged ensemble : BLmin
- Oob estimate of the error of the single model : Lmin
- Cross-validated estimate of the error of the Lobag ensemble : Llobag
- Cross-validated estimate of the error of the bagged ensemble : Lbag
- Cross-validated estimate of the error of the single model : Lsingle
begin algorithm
{Di }ki=1 = Generate folds (D, k)
for each i from 1 to n
begin
Vi = ∅
Fi = ∅
[Vi , Fi ] = BV decomposition (L, A, D\Di , B)
end
for each α ∈ A
begin
P
loss = n1 Pni=1 (loss of the element v ∈ Vi s.t. v.α = α)
bias = n1 ni=1 (bias of the element v ∈ Vi s.t. v.α = α)
V = V ∪ (α, loss, bias)
end
[αB , αL , Bmin , Lmin , BLmin ] = Select model (V )
for each i from 1 to n
begin
i
FLob
= {fαi,bB }B
b=1 = Select ensemble (Fi , αB )
i
i,b B
FBag = {fαL }b=1 = Select ensemble (Fi , αL )
end
Lsingle = Calc avg loss({fαi L }ni=1 , {Di }ni=1 )
i
}ni=1 , {Di }ni=1 )
Lbag = Calc avg loss({FBag
i
Llobag = Calc avg loss({FLob }ni=1 , {Di }ni=1 )
end algorithm
The selection of the lobag ensemble is performed through a cross-validated out-of-bag
estimate of the bias–variance decomposition of the error. The data set is divided in k
separated folds through the procedure Generate folds. The oob estimate of the bias–
135
variance decomposition of the error is performed on each fold, and the overall estimate
of bias and loss are computed averaging over the different folds. The algorithm provides
also a cross-validated estimate of the generalization error of the resulting lobag and bagged
ensembles (procedure Calc avg loss).
6.3.6
A heterogeneous Lobag approach
Using cross-validation or multiple hold-out techniques to evaluate the error, instead of
using the ”best” low bias model, obtained averaging the bias over all the folds or the
different splits of the data, we could select each time the model with the lowest bias for
each fold/split. In this way we could in principle to obtain different models, each one
well-tuned for a specific fold/split.
Then we could combine them by majority or weighted voting, or we could combine them
by multiple bootstrap aggregating in order to lower the variance. According to this second
approach, we could bag each model selected on each different fold/split combining the
different ensembles by majority or weighted voting. We could also introduce a secondlevel meta-learner in order to combine the base learners and the ensembles. This general
approach could introduce diversity in the ensemble, while preserving at the same time the
accuracy of the different ”heterogeneous” base learners.
6.4
Experiments with lobag
We performed numerical experiments on different data sets to test the Lobag ensemble
method using SVMs as base learners. We compared the results with single SVMs and
classical bagged SVM ensembles.
6.4.1
Experimental setup
We employed 7 different two-class data sets, both synthetic and “real”. We selected two
synthetic data sets (P2 and a two-class version of Waveform) and 5 “real” data sets (GreyLandsat, Letter, reduced to the two-class problem of discriminating between the letters B
and R, Letter with added 20% noise, Spam, and Musk). Most of them are from the UCI
repository [135].
We applied two different experimental settings, using the same data sets, in order to
compare lobag, classical bagging and single SVMs.
At first, we employed small D training sets and large test T sets in order to obtain a reliable
136
Table 6.1: Results of the experiments using pairs of train D and test T sets. Elobag , Ebag
and ESV M stand respectively for estimated error of lobag, bagged and single SVMs on
the test set T . The three last columns show the confidence level according to the Mc
Nemar test. L/B, L/S and B/S stand respectively for the comparison Lobag/Bagging,
Lobag/Single SVM and Bagging/Single SVM. If the confidence level is equal to 1, no
significant difference is registered.
Kernel
type
Polyn.
Gauss.
Linear
Polyn.
Gauss.
Linear
Polyn.
Gauss.
Linear
Polyn.
Gauss.
Linear
Polyn.
Gauss.
Linear
Polyn.
Gauss.
Linear
Polyn.
Gauss.
Elobag
Ebag
Esingle
Data set P2
0.1735 0.2008 0.2097
0.1375 0.1530 0.1703
Data set Waveform
0.0740 0.0726 0.0939
0.0693 0.0707 0.0724
0.0601 0.0652 0.0692
Data set Grey-Landsat
0.0540 0.0540 0.0650
0.0400 0.0440 0.0480
0.0435 0.0470 0.0475
Data set Letter-Two
0.0881 0.0929 0.1011
0.0701 0.0717 0.0831
0.0668 0.0717 0.0799
Data set Letter-Two with
0.3535 0.3518 0.3747
0.3404 0.3715 0.3993
0.3338 0.3764 0.3829
Data set Spam
0.1408 0.1352 0.1760
0.0960 0.1034 0.1069
0.1130 0.1256 0.1282
Data set Musk
0.1291 0.1291 0.1458
0.1018 0.1157 0.1154
0.0985 0.1036 0.0936
Confidence level
L/B L/S
B/S
0.001
0.001
0.001
0.001
0.001
0.001
1
1
0.001
0.001
0.1
0.001
0.001
0.1
0.001
1
1
0.1
0.001
0.1
0.1
0.001
1
1
1
0.025
1
0.05
1
1
added noise
1
1
1
0.05
0.05
0.025
0.05
0.1
1
0.1
0.1
1
0.05
0.1
0.005
0.001
0.025
0.001
0.001
1
1
1
0.001
0.05
0.001
0.001
1
0.001
1
0.05
estimate of the generalization error: the number of examples for D was set to 100, while
the size of T ranged from a few thousands for the “real” data sets to ten thousands for
synthetic data sets. Then we applied the Lobag algorithm described in Sect. 6.3, setting the
137
number of samples bootstrapped from D to 100, and performing an out-of-bag estimate of
the bias–variance decomposition of the error. The selected lobag, bagged and single SVMs
were finally tested on the separated test set T .
Then, using a different experimental set-up, we divided the data into a separated training
D and test T sets. We then drew 30 data sets Di from D, each consisting of 100 examples
drawn uniformly with replacement. Then we applied the lobag algorithm described in
Sect. 6.3 to each of the Di , setting the number of examples bootstrapped from each Di to
100, and averaging both the out-of-bag estimation of the error and the error estimated on
the separated test sets T .
We developed new C++ classes and applications using the NEURObjects library [184] to
implement the lobag algorithm and to analyze the results.
6.4.2
Results
Table 6.1 shows the results of the experiments with small D training sets and large T test
sets. We measured 20 outcomes for each method: 7 data sets, and 3 kernels (gaussian,
polynomial, and dot-product) applied to each data set except P2 for which we did not apply
the dot-product kernel (because it was obviously inappropriate). For each pair of methods,
we applied McNemar test [42] to determine whether there was a significant difference in
predictive accuracy on the test set.
On nearly all the data sets, both bagging and Lobag outperform the single SVMs independently of the kernel used. The null hypothesis that Lobag has the same error rate as a
single SVM is rejected at or below the 0.1 significance level in 17 of the 20 cases. Similarly,
the null hypothesis that bagging has the same error rate as a single SVM is rejected at or
below the 0.1 level in 13 of the 20 cases.
Most importantly, Lobag generally outperforms standard bagging. Lobag is statistically
significantly better than bagging in 9 of the 20 cases, and significantly inferior only once.
These experiments are also shown graphically in Fig. 6.1. In this figure, each pair of points
(joined by a line) corresponds to one of the 20 cases. The x coordinate of the point is the
error rate of Lobag, the y coordinate is the error rate of either a single SVM (for the “star”
shapes) or of standard bagging (for the “+” shapes). The line y = x is plotted as well.
Points above the line correspond to cases where Lobag had a smaller error rate. In most
cases, the “star” is above the “+”, which indicates that bagging had lower error than a
single SVM.
Tab 6.2 summarizes the results of the comparison between bagging, lobag and single SVMs,
according to the second experimental set-up (Sect. 6.4.1), using directly the out-of-bag
estimate of the generalization error, averaged over the 30 different splits of the data. On
138
bagging error (cross); single SVM error (star)
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
lobag error
Figure 6.1: Graphical comparison of Lobag, bagging, and single SVM.
all the data sets both bagging and lobag outperform the single SVM, independently of
the kernel used. The null hypothesis (no difference between the considered classifiers), is
rejected at a 0.01 confidence level according to the resampled paired t test.
Moreover Lobag compares favorably to bagging. The average relative error reduction
with respect to single SVMs is about 23 % for lobag and 18 % for bagged ensembles
of SVMs. Using SVMs with gaussian kernels as base learners the difference of accuracy
between lobag and bagging is significant at 0.01 confidence level on all the 7 data sets.
We achieve the same results with polynomial kernels, except for the Grey-Landsat data
set, where the difference is significant only at 0.05 level. With linear kernels there is no
significant statistical difference in the data sets Waveform, Grey-Landsat and Musk. Using
the separated test sets to evaluate the generalization error, the differences between bagging,
lobag and also single SVMs become less significant, but also in this case lobag slightly tends
to outperform bagging.
The outcomes of the second experimental approach confirm the results of the first one,
even if they must be considered with caution, as the resampled t test suffers of a relatively
large type I error, and consequently it can incorrectly detect a difference when no difference
exists [42].
The results show that despite the ability of SVMs to manage the bias–variance tradeoff,
139
Table 6.2: Comparison of the results between lobag, bagging and single SVMs. Elobag , Ebag
and ESV M stand respectively for average error of lobag, bagging and single SVMs. r.e.r.
stands for relative error reduction between lobag and single SVMs and between bagging
and single SVMs.
Kernel
type
Polynomial
Gaussian
Linear
Polynomial
Gaussian
Linear
Polynomial
Gaussian
Linear
Polynomial
Gaussian
Linear
Polynomial
Gaussian
Linear
Polynomial
Gaussian
Linear
Polynomial
Gaussian
Elobag
Ebag
ESVM
Data set P2
0.1593 ± 0.0293 0.1753 ± 0.0323 0.2161
0.1313 ± 0.0337 0.1400 ± 0.0367 0.1887
Data set Waveform
0.0713 ± 0.0312 0.0716 ± 0.0318 0.0956
0.0520 ± 0.0210 0.0597 ± 0.0214 0.0695
0.0496 ± 0.0193 0.0553 ± 0.0204 0.0668
Data set Grey-Landsat
0.0483 ± 0.0252 0.0487 ± 0.0252 0.0570
0.0413 ± 0.0252 0.0430 ± 0.0257 0.0472
0.0360 ± 0.0209 0.0390 ± 0.0229 0.0449
Data set Letter-Two
0.0890 ± 0.0302 0.0930 ± 0.0310 0.1183
0.0616 ± 0.0221 0.0656 ± 0.0247 0.0914
0.0553 ± 0.0213 0.0597 ± 0.0238 0.0875
Data set Letter-Two with added noise
0.2880 ± 0.0586 0.2993 ± 0.0604 0.3362
0.2576 ± 0.0549 0.2756 ± 0.0633 0.3122
0.2580 ± 0.0560 0.2706 ± 0.0607 0.3064
Data set Spam
0.1273 ± 0.0374 0.1353 ± 0.0400 0.1704
0.1073 ± 0.0379 0.1163 ± 0.0400 0.1407
0.1120 ± 0.0352 0.1190 ± 0.0380 0.1392
Data set Musk
0.1250 ± 0.0447 0.1250 ± 0.0447 0.1612
0.0960 ± 0.0331 0.1070 ± 0.0364 0.1295
0.0756 ± 0.0252 0.0793 ± 0.0253 0.0948
r.e.r
L/S
r.e.r
B/S
± 0.0321
± 0.0282
26.28
30.41
18.88
25.80
± 0.0307
± 0.0200
± 0.0198
25.41
25.17
25.74
25.10
14.10
17.21
± 0.0261
± 0.0257
± 0.0221
15.26
12.50
19.82
14.56
8.89
13.14
± 0.0281
± 0.0233
± 0.0244
24.76
32.60
36.80
21.38
28.22
31.77
± 0.0519
± 0.0502
± 0.0512
14.34
17.49
15.79
10.97
11.72
11.68
± 0.0423
± 0.0369
± 0.0375
25.29
23.74
19.54
20.59
17.34
14.51
± 0.0446
± 0.0357
± 0.0247
22.45
25.87
20.25
22.45
17.37
16.35
SVM performance can generally be improved by bagging, at least for small training sets.
Furthermore, the best way to tune the SVM parameters is to adjust them to minimize bias
and then allow bagging to reduce variance.
140
6.5
Application of lobag to DNA microarray data analysis
As an application of Lobag to ”real world” problems, we consider a challenging classification
problem in functional bioinformatics. In particular we applied Lobag to the analysis of
DNA microarray data, in order to preliminary evaluate the effectiveness of the proposed
ensemble method to small-sized and high-dimensional data, characterized also by a large
biological variability.
DNA hybridization microarrays [57, 127] supply information about gene expression through
measurements of mRNA levels of a large amount of genes in a cell. After extracting mRNA
samples from the cells, preparing and marking the targets with fluorescent dyes, hybridizing
with the probes printed on the microarrays and scanning the microarrays with a laser
beam, the obtained TIFF images are processed with image analysis computer programs to
translate the images into sets of fluorescent intensities proportional to the mRNA levels of
the analyzed samples. After preprocessing and normalization stages, gene expression data
of different cells or different experimental/functional conditions are collected in matrices
for numerical processing: each row corresponds to the gene expression levels of a specific
gene relative to all the examples, and each column corresponds to the expression data of
all the considered genes relative to a specific cell example. Typically thousands of genes
are used and analyzed for each microarray experiment.
Several supervised methods have been applied to the analysis of cDNA microarrays and
high density oligonucleotide chips. These methods include decision trees, Fisher linear
discriminant, multi-layer perceptrons (MLP), nearest-neighbors classifiers, linear discriminant analysis, Parzen windows and others [22, 53, 75, 101, 146]. In particular Support
Vector Machines are well suited to manage and classify high dimensional data [187], as
microarray data usually are, and have been recently applied to the classification of normal
and malignant tissues using dot-product (linear) kernels [67], or polynomial and gaussian
kernels in order to classify normal and tumoural tissues [179]. These types of kernels have
also been successfully applied to the separation of functional classes of yeast genes using
microarray expression data [22].
Furthermore, ensembles of learning machines are well-suited for gene expression data analysis, as they can reduce the variance due to the low cardinality of the available training
sets, and the bias due to specific characteristics of the learning algorithm [43]. Indeed,
in recent works, combinations of binary classifiers (one-versus-all and all-pairs) and Error
Correcting Output Coding (ECOC) ensembles of MLP, as well as ensemble methods based
on resampling techniques, such as bagging and boosting, have been applied to the analysis
of DNA microarray data [192, 158, 54, 178, 185].
141
6.5.1
Data set and experimental set-up.
We used DNA microarray data available on-line. In particular we used the GCM data set
obtained from the Whitehead Institute, Massachusetts Institute of Technology Center for
Genome Research [158]. It is constituted of 300 human normal and tumor tissue specimens
spanning 14 different malignant classes. In particular it contains 190 tumoral samples
pertaining to 14 different classes, plus other 20 poorly differentiated tumor samples and 90
normal samples.
We grouped together the 14 different tumor classes and the poorly differentiated tumor
samples to reduce the multi-class classification problem to a dichotomy in order to separate
normal from malignant tissues. The 300 samples sequentially hybridized to oligonucleotide
microarrays contain a total of 16063 probe sets (genes or ESTs) and we performed a
stratified random splitting of these data in a training and test set of equal size. We
preprocessed raw data using thresholding, filtering and normalization methods as suggested
in [158]. Performances of Lobag ensembles of SVMs were compared with a standard bagging
approach and with single SVMs, using subsets of genes selected through a simple featurefiltering method.
6.5.2
Gene selection.
We used a simple filter method, that is a gene selection method applied before and independently of the induction algorithm, originally proposed in [75]. The mean gene expression
value across all the positive (µ+ ) and negative (µ− ) examples are computed separately for
each gene, together with their corresponding standard deviations (σ+ and σ− ). Then the
following statistic (a sort of signal-to-noise ratio) ci is computed:
ci =
µ+ − µ−
σ+ + σ−
(6.2)
The larger is the distance between the mean values with respect to the sum of the spread of
the corresponding values, more related is the gene to the discrimination of the positive and
negative classes. Then the genes are ranked according to their ci value, and the first and last
m genes are selected. The main problem of this approach is the underlying independence
assumption of the expression patterns of each gene: indeed it fails in detecting the role of
coordinately expressed genes in carcinogenic processes. Eq. 6.2 can also be used to compute
the weights for weighted gene voting [75], a minor variant of diagonal linear discriminant
analysis [54].
With the GCM data set we applied a permutation test to automatically select a set of
marker genes. It is a gene-specific variant of the neighborhood analysis proposed in [75]:
142
1. Calculate for each gene the signal-to-noise ratio (eq. 6.2)
2. Perform a gene-specific random permutation test:
(a) Generate n random permutations of the class labels computing each time the
signal-to-noise ratio for each gene.
(b) Select a p significance level (e.g. 0 < p < 0.1)
(c) If the randomized signal-to-noise ratio is larger than the actual one in less than
p · n random permutations, select that gene as significant for discrimination at
p significance level.
This is a simple method to estimate the significance of the matching of a given phenotype
to a particular set of marker genes: its time complexity is O(nd), where n is the number
of examples and d the number of features (genes). Moreover the permutation test is distribution independent: no assumptions about the functional form of the gene distribution
are supposed.
6.5.3
Results.
Using the above gene-specific neighborhood analysis, we selected 592 genes correlated with
tumoral examples (p = 0.01) (set A) and about 3000 genes correlated with normal examples
(p = 0.01) (set B). Then we used the genes of set A and the 592 genes with highest signalto-noise ratio values of set B to assemble a selected set composed by 1184 genes. The results
of the classifications with single SVMs, with and without gene selection are summarized
in Tab. 6.3.
Table 6.3: GCM data set: results with single SVMs
Kernel type
and parameters
Dot-product, C=20
Polynomial, deg=6 C=5
Polynomial, deg=2 C=10
Gaussian, σ=2 C=50
Err.all
genes
0.2600
0.7000
0.6900
0.3000
Err.sel.
genes
0.2279
0.2275
0.2282
0.2185
Relative
err.red.
12.31 %
——27.33 %
There is a significant increment in accuracy using only a selected subset of genes for classification. According to the McNemar test [44], in all cases there is a statistical significant
143
difference at 0.05 confidence level between SVMs trained with and without feature selection. Polynomial kernels without feature selection fail to classify normal from malignant
tissues.
Tab. 6.4 summarizes the results of bagged SVMs on the GCM data set. Even if not always
Table 6.4: GCM data set: compared results of single and bagged SVMs
Kernel type
and parameters
Dot-product, C=10
Dot-product, C=20
Polynomial, degree=6 C=5
Polynomial, degree=2 C=10
Gaussian, sigma=2 C=50
Gaussian, sigma=10 C=200
Error
SVMs
0.2293
0.2279
0.2275
0.2282
0.2185
0.2233
Error
bagged
0.2200
0.2133
0.2000
0.2133
0.2067
0.2067
Relative
err.red.
4.06 %
6.41 %
12.09 %
6.53 %
5.40 %
7.44 %
there is a statistical significant difference (according to Mc Nemar test) between single and
bagged SVMs, in all cases bagged ensembles of SVMs outperform single SVMs. The degree
of enhancement depends heavily on the possibility to reduce the variance component of
the error, as bagging is mainly a variance-reduction ensemble method.
Indeed, performing a bias–variance analysis of the error of single SVMs on the GCM
data set, we note that bias largely overrides the variance components of the error, and
in this case we cannot expect a very large reduction of the error with bagging (Fig. 6.2).
Nonetheless we can see that both with linear SVMs (Fig. 6.2 a), polynomial (Fig. 6.2 b),
and gaussian (Fig. 6.2 c) SVMs, the minimum of the estimated error and the estimated
bias are achieved for different learning parameters, showing that in this case Lobag could
improve the performance, even if we cannot expect a large reduction of the overall error,
as the bias largely dominates the variance component of the error (Fig. 6.2).
Indeed with Lobag the error is lowered, both with respect to single and bagged SVMs
(Tab. 6.5). As expected, both bagged and lobag ensembles of SVMs outperform single
SVMs, but with lobag the reduction of the error is significant at 0.05 confidence level,
according to Mc Nemar’s test, for all the applied kernels, while for bagging it is significant
only for the polynomial kernel. Moreover Lobag always outperforms bagging, even if the
error reduction is significant only if linear or polynomial kernels are used. Summarizing,
Lobag achieves significant enhancements with respect to single SVMs in analyzing DNA
microarray data, and also lowers the error with respect to classical bagging.
Even if these results seem quite encouraging, they must be considered only as preliminary,
and we need more experiments, using different data sets and using more reliable cross144
Linear SVM
0.35
avg. error
bias
net variance
unbiased var.
biased var
0.3
0.25
0.2
0.15
0.1
0.05
0
0.01
0.1
1
2
5
10
20
50
100
200
500 1000
C
(a)
Polynomial kernel
avg. error
bias
net variance
unbiased var.
biased var
0.3
0.25
0.2
0.15
0.1
0.05
0
1
2
3
4
5
6
7
polynomial degree
8
9
10
(b)
Gaussian kernel, C=20
avg. error
bias
net variance
unbiased var.
biased var
0.3
0.25
0.2
0.15
0.1
0.05
0
0.01 0.02 0.1 0.2 0.5
1
2
5
10 20 50 100 200 300 400 500 1000
sigma
(c)
Figure 6.2: GCM data set: bias-variance decomposition of the error in bias, net-variance,
unbiased and biased variance, while varying the regularization parameter with linear SVMs
(a), the degree with polynomial kernels (b), and the kernel parameter σ with gaussian SVMs
(c).
145
validated estimates of the error, in order to evaluate more carefully the applicability of
the lobag method to DNA microarray data analysis. Moreover we need also to assess the
quality of the classifiers using for instance ROC curves or appropriate quality measures as
shown, for instance, in [76].
Table 6.5: GCM data set: compared results of single, bagged and Lobag SVMs on gene expression data. An asterisk in the last three columns points out that a statistical significant
difference is registered (p = 0.05) according to the Mc Nemar test.
Kernel type
Dot-product
Polynomial
Gaussian
Error
SVMs
0.2279
0.2275
0.2185
Error
bagged
0.2133
0.2000
0.2067
Error
Lobag
0.1933
0.1867
0.1933
Err.red.
SVM− > bag
6.41 %
12.09 % ∗
5.40 %
146
Err.red.
SVM− > Lobag
15.18 % ∗
17.93 % ∗
11.53 % ∗
Err.red.
bag− > Lobag
9.38 % ∗
6.65 % ∗
6.48 %
Conclusions
Cosa volevo dire, non lo so,
però ho ragione, e i fatti,
mi cosano.
Palmiro Cangini
Research on ensemble methods focused on the combination/aggregation of learning machines, while the specific characteristics of the base learners that build them up have been
only partially considered. On the contrary we started from the learning properties and the
behavior of the learning algorithms used to generate the base predictors, in order to build
around them ensemble methods well-tuned to their learning characteristics.
To this purpose we showed that bias–variance theory provides a way to analyze the behavior
of learning algorithms and to explain the properties of ensembles of classifiers. Moreover
we showed that the analysis of the bias–variance decomposition of the error can identify
the situations in which ensemble methods might improve base learner performances.
We conducted an extended bias–variance analysis of the error in single SVMs (Chap. 4),
bagged and random aggregated ensembles of SVMs (Chap. 5), involving training and testing of over 10 million of SVMs, in order to gain insights into the way single and ensembles
of SVMs learn. To this purpose we developed procedures to measure bias and variance in
classification problems according to Domingos bias–variance theory.
In particular we performed an analysis of bias and variance in single SVMs, considering
gaussian, polynomial, and dot–product kernels. The relationships between parameters of
the kernel and bias, net–variance, unbiased and biased variance were studied, discovering
regular patterns and specific trends. We provided a characterization of bias–variance decomposition of the error, showing that in gaussian kernels we can individuate at least three
different regions with respect to the σ (spread) parameter, while in polynomial kernels the
U shape of the error can be determined by the combined effects of bias and unbiased variance. The analysis also revealed that the expected trade-off between bias and variance
holds only for dot product kernels, while other kernels showed more complex relationships.
We discovered that the minimum of bias, variance and overall error are often achieved
147
for different values of the regularization and kernel parameters, as a result of a different
learning behavior of the trained SVM.
According to Breiman’s theoretical results, we showed that bagging can be interpreted as
an approximation of random aggregating, that is a process by which base learners, trained
on samples drawn accordingly to an unknown probability distribution from the entire
universe population, are aggregated through majority voting or averaging their outputs.
These experiments showed that the theoretical property of a very large variance reduction
holds for random aggregating, while for bagging we registered a smaller reduction, but not
a total elimination as in random aggregating.
Bias–variance analysis in random aggregated SVM ensembles suggested also to aggregate
ensembles of SVMs for very large scale data mining problems using undersampled bagging.
Unfortunately, time was not sufficient to follow this promising research line.
On the basis of the information supplied by bias-variance analysis we proposed two research
lines for designing ensembles of SVMs. The first one applies bias-variance analysis to construct a heterogeneous, diverse set of low-bias classifiers. The second presents an ensemble
method, that is Lobag, that selects low bias base learners (well tuned - low bias SVMs)
and then combines them through bagging. The key issue of the bias–variance evaluation is
performed through an efficient out-of-bag estimate of the bias–variance decomposition of
the error. This approach affects both bias, through the selection of low bias base learners,
and variance, through bootstrap aggregation of the selected low bias base learners. Numerical experiments showed that low bias bagged ensembles of SVMs compare favorably both
to single and bagged SVM ensembles, and preliminary experiments with DNA microarray
data suggested that this approach might be effective with high-dimensional low sized data,
as gene expression data usually are.
Open questions, related to some topics only partially developed in this thesis, delineate
possible future works and developments.
In our research planning, we pursued to execute a bias–variance analysis for ensemble
methods based on resampling techniques. However we performed only a bias–variance
analysis in bagged SVMs, but we plan to perform the same analysis in boosted ensembles
of SVMs, in order to gain insights into the behavior of boosted SVMs with ”strong” welltuned SVMs, comparing them with ”weak” not-optimally-tuned SVMs.
We showed that bias–variance analysis is an effective tool to design new ensemble methods
tuned to specific bias-variance characteristics of base learners. In particular ”strong” base
learners such as SVMs work well with lobag. We expect that this will be true for base
learners that exhibit relatively large variance and low bias, especially with relatively small
data sets. Hence we plan to experiment with other low bias base learners (e.g. Multi Layer
Perceptrons) in order to gain insights into their learning behavior and to evaluate if we can
apply them with Lobag or to evaluate if we can design other base learner specific ensemble
148
methods.
In order to speed-up the computation, we plan to implement variants of the basic Lobag
algorithm. For instance we could apply multidimensional search methods, such as the
Powell method, to select the tuning values that minimize bias.
In our experiments we did not consider noise, but it is present in most real data sets. As a
result noise is embodied into bias, and bias itself is overestimated. Even if the evaluation of
noise in real data sets is an open problem, we plan to evaluate the role of noise in synthetic
and real data sets, in order to develop variants of lobag specific for noisy data.
The peculiar characteristics of Lobag and the preliminary results relative to its application
to DNA microarray data, encourage us to continue along this research line (Sect. 6.5). In
particular we plan to perform an extended experimental analysis with high-dimensional
low-sized gene expression data, evaluating Lobag with respect to single SVMs (largely
applied in bioinformatics) and to other ensemble methods (bagging and boosting, for instance), assessing carefully the quality and the reliability of the classifiers
We provided only high-level algorithmic schemes for heterogeneous ensembles of SVMs
(Sect. 6.1). We plan to design and implement these algorithms, possibly integrating this
approach with an explicit evaluation of the diversity of the base learners, using measures
and approaches similar to those proposed by Kuncheva [121].
In our experiments with bagged and random aggregated ensembles of SVMs we used relatively small and fixed sized bootstrap samples. A natural development of these experiments
could be to explicitly consider the cardinality of the data, setting-up a series of experiments
with increasing number of examples for each randomly drawn data set, in order to evaluate
the effect of the sample size on bias, variance and instability of base learners.
Experiments with random aggregated ensembles of SVMs showed that we could use undersampled bagging with large data sets in order to obtain large reduction of the unbiased
variance, without significant increment in bias (Sect. 5.4). We plan to develop this approach, also in relation with the above research on the effect of the cardinality of the data
in random aggregating. The main goal of this research line is the development of ensemble
methods for very large data mining problems.
In our experiments we did not explicitly consider the characteristics of the data. Nonetheless, as we expected and our experiments suggested, different data characteristics influence
bias–variance patterns in learning machines. To this purpose we plan to explicitly analyze
the relationships between bias–variance decomposition of the error and data characteristics, using data complexity measures based on geometrical and topological characteristics
of the data [126, 84].
149
Bibliography
[1] D. Aha and R. Bankert. Cloud classification using error-correcting output codes. In
Artificial Intelligence Applications: Natural Science, Agriculture and Environmental
Science, volume 11, pages 13–28. 1997.
[2] E.L. Allwein, R.E. Schapire, and Y. Singer. Reducing multiclass to binary: a unifying
approach for margin classifiers. Journal of Machine Learning Research, 1:113–141,
2000.
[3] E. Alpaydin and E. Mayoraz. Learning error-correcting output codes from data. In
ICANN’99, pages 743–748, Edinburgh, UK, 1999.
[4] R. Anand, G. Mehrotra, C.K. Mohan, and S. Ranka. Efficient classification for
multiclass problems using modular neural networks. IEEE Transactions on Neural
Networks, 6:117–124, 1995.
[5] T. Andersen, M. Rimer, and T. R. Martinez. Optimal artificial neural network
architecture selection for voting. In Proceedings of the IEEE International Joint
Conference on Neural Networks IJCNN’01, pages 790–795. IEEE, 2001.
[6] G. Bakiri and T.G. Dietterich. Achieving high accuracy text-to-speech with machine
learning. In Data mining in speech synthesis. 1999.
[7] R. Battiti and A.M. Colla. Democracy in neural nets: Voting schemes for classification. Neural Networks, 7:691–707, 1994.
[8] E. Bauer and R. Kohavi. An empirical comparison of voting classification algorithms:
Bagging, boosting and variants. Machine Learning, 36(1/2):525–536, 1999.
[9] J. Benediktsson, J. Sveinsson, O. Ersoy, and P. Swain. Parallel consensual neural
networks. IEEE Transactions on Neural Networks, 8:54–65, 1997.
[10] J. Benediktsson and P. Swain. Consensus theoretic classification methods. IEEE
Transactions on Systems, Man and Cybernetics, 22:688–704, 1992.
150
[11] A. Berger. Error correcting output coding for text classification. In IJCAI’99: Workshop on machine learning for information filtering, 1999.
[12] C. M. Bishop. Neural Networks for Pattern Recognition. Clarendon Press, Oxford,
1995.
[13] A. Blum and R.L. Rivest. Training a 3-node neural network is NP-complete. In Proc.
of the 1988 Workshop ob Computational Learning Learning Theory, pages 9–18, San
Francisco, CA, 1988. Morgan Kaufmann.
[14] R.C. Bose and D.K. Ray-Chauduri. On a class of error correcting binary group codes.
Information and Control, (3):68–79, 1960.
[15] L. Breiman. Bagging predictors. Machine Learning, 24(2):123–140, 1996.
[16] L. Breiman. Bias, variance and arcing classifiers. Technical Report TR 460, Statistics
Department, University of California, Berkeley, CA, 1996.
[17] L. Breiman. Arcing classifiers. The Annals of Statistics, 26(3):801–849, 1998.
[18] L. Breiman. Prediction games and arcing classifiers.
11(7):1493–1517, 1999.
Neural Computation,
[19] L. Breiman. Random Forests. Machine Learning, 45(1):5–32, 2001.
[20] M. Breukelen van, R.P.W. Duin, D. Tax, and J.E. Hartog den. Combining classifiers
fir the recognition of handwritten digits. In Ist IAPR TC1 Workshop on Statistical
Techniques in Pattern Recognition, pages 13–18, Prague, Czech republic, 1997.
[21] G.J. Briem, J.A. Benediktsson, and J.R. Sveinsson. Boosting. Bagging and Consensus Based Classification of Multisource Remote Sensing Data. In J. Kittler and
F. Roli, editors, Multiple Classifier Systems. Second International Workshop, MCS
2001, Cambridge, UK, volume 2096 of Lecture Notes in Computer Science, pages
279–288. Springer-Verlag, 2001.
[22] M. Brown et al. Knowledge-base analysis of microarray gene expression data by using
support vector machines. PNAS, 97(1):262–267, 2000.
[23] I. Buciu, C. Kotropoulos, and I. Pitas. Combining Support Vector Machines for
Accurate Face Detection. In Proc. of ICIP’01, volume 1, pages 1054–1057, 2001.
[24] P. Buhlmann and B. Yu. Analyzing bagging. Annals of Statistics, 30:927–961, 2002.
[25] P. Chan and S. Stolfo. Meta-learning for multistrategy and parallel learning. In Proc.
2th International Workshop on Multistrategy Learning, pages 150–165, 1993.
151
[26] P. Chan and S. Stolfo. A comparative evaluation of voting and meta-learning on
partitioned data. In Proc. 12th ICML, pages 90–98, 1995.
[27] D.. Chen. Statistical estimates for Kleinberg’s method of Stochastic Discrimination.
PhD thesis, The State University of New York, Buffalo, USA, 1998.
[28] K.J. Cherkauker. Human expert-level performance on a scientific image analysis task
by a system using combined artificial neural networks. In Chan P., editor, Working
notes of the AAAI Workshop on Integrating Multiple Learned Models, pages 15–21.
1996.
[29] S. Cho and J. Kim. Combining multiple neural networks by fuzzy integral and robust
classification. IEEE Transactions on Systems, Man and Cybernetics, 25:380–384,
1995.
[30] S. Cho and J. Kim. Multiple network fusion using fuzzy logic. IEEE Transactions
on Neural Networks, 6:497–501, 1995.
[31] S. Cohen and N. Intrator. A Hybrid Projection Based and Radial Basis Function
Architecture. In J. Kittler and F. Roli, editors, Multiple Classifier Systems. First
International Workshop, MCS 2000, Cagliari, Italy, volume 1857 of Lecture Notes in
Computer Science, pages 147–156. Springer-Verlag, 2000.
[32] S. Cohen and N. Intrator. Automatic Model Selection in a Hybrid Perceptron/Radial
Network. In Multiple Classifier Systems. Second International Workshop, MCS 2001,
Cambridge, UK, volume 2096 of Lecture Notes in Computer Science, pages 349–358.
Springer-Verlag, 2001.
[33] S. Cohen and N. Intrator. Hybrid Projection-based and Radial Basis Function Architecture: Initial Values and Global Optimisation. Pattern Analysis and Applications,
5(2):113–120, 2002.
[34] R. Collobert, S. Bengio, and Y. Bengio. A Parallel Mixture of SVMs for Very Large
Scale Problems. Neural Computation, 14(5):1105–1114, 2002.
[35] C. Cortes and V. Vapnik. Support vector networks. Machine Learning, 20:273–297,
1995.
[36] K. Crammer and Y. Singer. On the learnability and design of output codes for
multiclass problems. In Proceedings of the Thirteenth Annual Conference on Computational Learning Theory, pages 35–46, 2000.
[37] N. Cristianini and J. Shawe-Taylor. An introduction to Support Vector Machines and
other kernel-based learning methods. Cambridge University Press, Cambridge, UK,
2000.
152
[38] N.C. de Condorcet. Essai sur l’ application de l’ analyse à la probabilité des decisions
rendues à la pluralité des voix. Imprimerie Royale, Paris, 1785.
[39] A. Demiriz, K.P. Bennett, and J. Shawe-Taylor. Linear programming boosting via
column generation. Machine Learning, 46(1-3):225–254, 2002.
[40] P. Derbeko, R. El-Yaniv, and R. Meir. Variance Optimized Bagging. In Machine
Learning: ECML 2002, volume 2430 of Lecture Notes in Computer Science, pages
60–71. Springer-Verlag, 2002.
[41] T. G. Dietterich. An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine Learning,
40(2):139–158, 2000.
[42] T.G. Dietterich. Approximate statistical test for comparing supervised classification
learning algorithms. Neural Computation, (7):1895–1924, 1998.
[43] T.G. Dietterich. Ensemble methods in machine learning. In J. Kittler and F. Roli, editors, Multiple Classifier Systems. First International Workshop, MCS 2000, Cagliari,
Italy, volume 1857 of Lecture Notes in Computer Science, pages 1–15. SpringerVerlag, 2000.
[44] T.G. Dietterich. An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting and randomization. Machine Learning,
40(2):139–158, 2000.
[45] T.G. Dietterich and G. Bakiri. Error - correcting output codes: A general method
for improving multiclass inductive learning programs. In Proceedings of AAAI-91,
pages 572–577. AAAI Press / MIT Press, 1991.
[46] T.G. Dietterich and G. Bakiri. Solving multiclass learning problems via errorcorrecting output codes. Journal of Artificial Intelligence Research, (2):263–286,
1995.
[47] P. Domingos. A unified bias–variance decomposition. Technical report, Department
of Computer Science and Engineering, University of Washington, Seattle, WA, 2000.
[48] P. Domingos. A Unified Bias-Variance Decomposition and its Applications. In Proceedings of the Seventeenth International Conference on Machine Learning, pages
231–238, Stanford, CA, 2000. Morgan Kaufmann.
[49] P. Domingos. A Unified Bias-Variance Decomposition for Zero-One and Squared
Loss. In Proceedings of the Seventeenth National Conference on Artificial Intelligence,
pages 564–569, Austin, TX, 2000. AAAI Press.
153
[50] H. Drucker and C. Cortes. Boosting decision trees. In Advances in Neural Information
Processing Systems, volume 8. 1996.
[51] H. Drucker, C. Cortes, L. Jackel, Y. LeCun, and V. Vapnik. Boosting and other
ensemble methods. Neural Computation, 6(6):1289–1301, 1994.
[52] R.O. Duda and P.E. Hart. Pattern classification and scene analysis. Wiley & Sons,
New York, 1973.
[53] S. Dudoit, J. Fridlyand, and T. Speed. Comparison of Discrimination Methods for
the Classification of Tumors Using Gene Expression Data. Technical Report 576,
Department of Statistics, University of California, Berkeley, 2000.
[54] S. Dudoit, J. Fridlyand, and T. Speed. Comparison of discrimination methods for
the classification of tumors using gene expression data. JASA, 97(457):77–87, 2002.
[55] R.P.W. Duin and D.M.J. Tax. Experiments with Classifier Combination Rules. In
J. Kittler and F. Roli, editors, Multiple Classifier Systems. First International Workshop, MCS 2000, Cagliari, Italy, volume 1857 of Lecture Notes in Computer Science,
pages 16–29. Springer-Verlag, 2000.
[56] B. Efron and R. Tibshirani. An introduction to the Bootstrap. Chapman and Hall,
New York, 1993.
[57] M. Eisen and P. Brown. DNA arrays for analysis of gene expression. Methods
Enzymol., 303:179–205, 1999.
[58] S.E. Fahlman and C. Lebiere. The cascade-correlation learning architecture. In D.S.
Touretzky, editor, Advances in Neural Information Processing Systems, volume 2,
pages 524–532. Morgan Kauffman, San Mateo, CA, 1990.
[59] U.M. Fayyad, C. Reina, and P.S. Bradley. Initialization of iterative refinement clustering algorithms. In Proc. 14th ICML, pages 194–198, 1998.
[60] E. Filippi, M. Costa, and E. Pasero. Multi-layer perceptron ensembles for increased
performance and fault-tolerance in pattern recognition tasks. In IEEE International
Conference on Neural Networks, pages 2901–2906, Orlando, Florida, 1994.
[61] Y. Freund. Boosting a weak learning algorithm by majority. Information and Computation, 121(2):256–285, 1995.
[62] Y. Freund and R. Schapire. A decision-theoretic generalization of on-line learning and
an application to boosting. Journal of Computer and Systems Sciences, 55(1):119–
139, 1997.
154
[63] Y. Freund and R.E. Schapire. Experiments with a new boosting algorithm. In
Proceedings of the 13th International Conference on Machine Learning, pages 148–
156. Morgan Kauffman, 1996.
[64] J. Friedman. Greedy function approximation: A gradient boosting machine. The
Annals of Statistics, 39(5), 2001.
[65] J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: A statistical
view of boosting. The Annals of Statistics, 38(2):337–374, 2000.
[66] J.H. Friedman. On bias, variance, 0/1 loss and the curse of dimensionality. Data
Mining and Knowledge Discovery, 1:55–77, 1997.
[67] T.S. Furey, N. Cristianini, N. Duffy, D. Bednarski, M. Schummer, and D. Haussler.
Support vector machine classification and validation of cancer tissue samples using
microarray expression data. Bioinformatics, 16(10):906–914, 2000.
[68] S. Geman, E. Bienenstock, and R. Doursat. Neural networks and the bias-variance
dilemma. Neural Computation, 4(1):1–58, 1992.
[69] R. Ghani. Using error correcting output codes for text classification. In ICML 2000:
Proceedings of the 17th International Conference on Machine Learning, pages 303–
310, San Francisco, US, 2000. Morgan Kaufmann Publishers.
[70] J. Ghosh. Multiclassifier systems: Back to the future. In Multiple Classifier Systems.
Third International Workshop, MCS2002, Cagliari, Italy, volume 2364 of Lecture
Notes in Computer Science, pages 1–15. Springer-Verlag, 2002.
[71] G. Giacinto and F. Roli. Dynamic Classifier Fusion. In J. Kittler and F. Roli, editors, Multiple Classifier Systems. First International Workshop, MCS 2000, Cagliari,
Italy, volume 1857 of Lecture Notes in Computer Science, pages 177–189. SpringerVerlag, 2000.
[72] G. Giacinto and F. Roli. An approach to the automatic design of multiple classifier
systems. Pattern Recognition Letters, 22(1):25–33, 2001.
[73] G. Giacinto and F. Roli. Dynamic classifier selection based on multiple classifier
behaviour. Pattern Recognition, 34(9):179–181, 2001.
[74] G. Giacinto, F. Roli, and G. Fumera. Selection of classifiers based on multiple
classifier behaviour. In SSPR/SPR, pages 87–93, 2000.
[75] T.R. Golub et al. Molecular Classification of Cancer: Class Discovery and Class
Prediction by Gene Expression Monitoring. Science, 286:531–537, 1999.
155
[76] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik. Gene Selection for Cancer Classification using Support Vector Machines. Machine Learning, 46(1/3):389–422, 2002.
[77] L. Hansen and P. Salamon. Neural network ensembles. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 12(10):993–1001, 1990.
[78] T. Hastie and R. Tibshirani. Generalized Additive Models. Chapman and Hall,
London, 1990.
[79] T. Hastie and R. Tibshirani. Classification by pairwise coupling. The Annals of
Statistics, 26(1):451–471, 1998.
[80] T. Heskes. Bias/Variance Decompostion for Likelihood-Based Estimators. Neural
Computation, 10:1425–1433, 1998.
[81] T.K. Ho. The random subspace method for constructing decision forests. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 20(8):832–844, 1998.
[82] T.K. Ho. Complexity of Classification Problems ans Comparative Advantages of
Combined Classifiers. In J. Kittler and F. Roli, editors, Multiple Classifier Systems.
First International Workshop, MCS 2000, Cagliari, Italy, volume 1857 of Lecture
Notes in Computer Science, pages 97–106. Springer-Verlag, 2000.
[83] T.K. Ho. Data Complexity Analysis for Classifiers Combination. In J. Kittler and
F. Roli, editors, Multiple Classifier Systems. Second International Workshop, MCS
2001, Cambridge, UK, volume 2096 of Lecture Notes in Computer Science, pages
53–67, Berlin, 2001. Springer-Verlag.
[84] T.K. Ho and M. Basu. Complexity measures of supervised classification problems.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(3):289–300,
2002.
[85] T.K. Ho, J.J. Hull, and S.N. Srihari. Decision combination in multiple classifiers.
IEEE Trans. on Pattern Analysis and Machine Intelligence, 19(4):405–410, 1997.
[86] K. Hornik. Approximation capabilities of multilayer feedforward networks. Neural
Networks, 4:251–257, 1991.
[87] Y.S. Huang and Suen. C.Y. Combination of multiple experts for the recognition of
unconstrained handwritten numerals. IEEE Trans. on Pattern Analysis and Machine
Intelligence, 17:90–94, 1995.
[88] L. Hyafil and R.L. Rivest. Constructing optimal binary decision tree is np-complete.
Information Processing Letters, 5(1):15–17, 1976.
156
[89] S. Impedovo and A. Salzo. A New Evaluation Method for Expert Combination in
Multi-expert System Designing. In J. Kittler and F. Roli, editors, Multiple Classifier
Systems. First International Workshop, MCS 2000, Cagliari, Italy, volume 1857 of
Lecture Notes in Computer Science, pages 230–239. Springer-Verlag, 2000.
[90] R.A. Jacobs. Methods for combining experts probability assessment. Neural Computation, 7:867–888, 1995.
[91] R.A. Jacobs, M.I. Jordan, S.J. Nowlan, and G.E. Hinton. Adaptive mixtures of local
experts. Neural Computation, 3(1):125–130, 1991.
[92] A. Jain, R. Duin, and J. Mao. Statistical pattern recognition: a review. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 22:4–37, 2000.
[93] G. James. Majority vote classifiers: theory and applications. PhD thesis, Department
of Statistics - Stanford University, Stanford, CA, 1998.
[94] G. James. Variance and bias for general loss function. Machine Learning, 2003. (in
press).
[95] C. Ji and S. Ma. Combinination of weak classifiers. IEEE Trans. Neural Networks,
8(1):32–42, 1997.
[96] T. Joachims. Making large scale SVM learning practical. In Smola A. Scholkopf B.,
Burges C., editor, Advances in Kernel Methods - Support Vector Learning, pages
169–184. MIT Press, Cambridge, MA, 1999.
[97] M. Jordan and R. Jacobs. Hierarchies of adaptive experts. In Advances in Neural
Information Processing Systems, volume 4, pages 985–992. Morgan Kauffman, San
Mateo, CA, 1992.
[98] M.I. Jordan and R.A. Jacobs. Hierarchical mixture of experts and the em algorithm.
Neural Computation, 6:181–214, 1994.
[99] H. Kargupta, B. Park, D. Hershberger, and E. Johnson. Collective data mining:a new
perspective toward distributed data mining. In H. Kargupta and P. Chan, editors,
Advances in in Distributed and Parallel Knowledge Discovery. MIT/AAAI Press,
1999.
[100] J.M. Keller, P. Gader, H. Tahani, J. Chiang, and M. Mohamed. Advances in fuzzy
integratiopn for pattern recognition. Fuzzy Sets and Systems, 65:273–283, 1994.
[101] J. Khan et al. Classification and diagnostic prediction of cancers using gene expression
profiling and artificial neural networks. Nature Medicine, 7(6):673–679, 2001.
157
[102] H.C. Kim, S. Pang, H.M. Je, D. Kim, and S.Y. Bang. Pattern Classification Using
Support Vector Machine Ensemble. In Proc. of ICPR’02, volume 2, pages 20160–
20163. IEEE, 2002.
[103] F. Kimura and M. Shridar. Handwritten Numerical Recognition Based on Multiple
Algorithms. Pattern Recognition, 24(10):969–983, 1991.
[104] J. Kittler. Combining classifiers: a theoretical framework. Pattern Analysis and
Applications, (1):18–27, 1998.
[105] J. Kittler, M. Hatef, R.P.W. Duin, and Matas J. On combining classifiers. IEEE
Trans. on Pattern Analysis and Machine Intelligence, 20(3):226–239, 1998.
[106] J. Kittler and F. Roli. Multiple Classifier Systems, First International Workshop,
MCS 2000, Cagliari, Italy, volume 1857 of Lecture Notes in Computer Science.
Springer-Verlag, Berlin, 2000.
[107] J. Kittler and F. Roli. Multiple Classifier Systems, Second International Workshop, MCS2001, Cambridge, UK, volume 2096 of Lecture Notes in Computer Science.
Springer-Verlag, Berlin, 2001.
[108] E.M. Kleinberg. On the Algorithmic Implementation of Stochastic Discrimination.
IEEE Transactions on Pattern Analysis and Machine Intelligence.
[109] E.M. Kleinberg. Stochastic Discrimination. Annals of Mathematics and Artificial
Intelligence, pages 207–239, 1990.
[110] E.M. Kleinberg. An overtraining-resistant stochastic modeling method for pattern
recognition. Annals of Statistics, 4(6):2319–2349, 1996.
[111] E.M. Kleinberg. A Mathematically Rigorous Foundation for Supervised Learning.
In J. Kittler and F. Roli, editors, Multiple Classifier Systems. First International
Workshop, MCS 2000, Cagliari, Italy, volume 1857 of Lecture Notes in Computer
Science, pages 67–76. Springer-Verlag, 2000.
[112] R. Kohavi and D.H. Wolpert. Bias plus variance decomposition for zero-one loss
functions. In Proc. of the Thirteenth International Conference on Machine Learning, The Seventeenth International Conference on Machine Learning, pages 275–283,
Bari, Italy, 1996. Morgan Kaufmann.
[113] J. Kolen and Pollack J. Back propagation is sensitive to initial conditions. In Advances in Neural Information Processing Systems, volume 3, pages 860–867. Morgan
Kauffman, San Francisco, CA, 1991.
158
[114] E. Kong and T.G. Dietterich. Error - correcting output coding correct bias and
variance. In The XII International Conference on Machine Learning, pages 313–321,
San Francisco, CA, 1995. Morgan Kauffman.
[115] A. Krogh and J. Vedelsby. Neural networks ensembles, cross validation and active
learning. In D.S. Touretzky, G. Tesauro, and T.K. Leen, editors, Advances in Neural
Information Processing Systems, volume 7, pages 107–115. MIT Press, Cambridge,
MA, 1995.
[116] L.I. Kuncheva. Genetic algorithm for feature selection for parallel classifiers. Information Processing Letters, 46:163–168, 1993.
[117] L.I. Kuncheva. An application of OWA operators to the aggragation of multiple
classification decisions. In The Ordered Weighted Averaging operators. Theory and
Applciations, pages 330–343. Kluwer Academic Publisher, USA, 1997.
[118] L.I. Kuncheva, J.C. Bezdek, and R.P.W. Duin. Decision templates for multiple
classifier fusion: an experimental comparison. Pattern Recognition, 34(2):299–314,
2001.
[119] L.I. Kuncheva, F. Roli, G.L. Marcialis, and C.A. Shipp. Complexity of Data Subsets
Generated by the Random Subspace Method: An Experimental Investigation. In
J. Kittler and F. Roli, editors, Multiple Classifier Systems. Second International
Workshop, MCS 2001, Cambridge, UK, volume 2096 of Lecture Notes in Computer
Science, pages 349–358. Springer-Verlag, 2001.
[120] L.I. Kuncheva and C.J. Whitaker. Feature Subsets for Classifier Combination: An
Enumerative Experiment. In J. Kittler and F. Roli, editors, Multiple Classifier Systems. Second International Workshop, MCS 2001, Cambridge, UK, volume 2096 of
Lecture Notes in Computer Science, pages 228–237. Springer-Verlag, 2001.
[121] L.I. Kuncheva and C.J. Whitaker. Measures of diversity in classifier ensembles.
Machine Learning, 2003. (in press).
[122] L. Lam. Classifier combinations: Implementations and theoretical issues. In Multiple
Classifier Systems. First International Workshop, MCS 2000, Cagliari, Italy, volume
1857 of Lecture Notes in Computer Science, pages 77–86. Springer-Verlag, 2000.
[123] L. Lam and C. Sue. Optimal combination of pattern classifiers. Pattern Recognition
Letters, 16:945–954, 1995.
[124] L. Lam and C. Sue. Application of majority voting to pattern recognition: an
analysis of its behavior and performance. IEEE Transactions on Systems, Man and
Cybernetics, 27(5):553–568, 1997.
159
[125] W.B. Langdon and B.F. Buxton. Genetic programming for improved receiver operating characteristics. In J. Kittler and F. Roli, editors, Second International Conference
on Multiple Classifier System, volume 2096 of LNCS, pages 68–77, Cambridge, 2001.
Springer Verlag.
[126] M. Li and P Vitanyi. An Introduction to Kolmogorov Complexity and Its Applications.
Springer-Verlag, Berlin, 1993.
[127] D.J. Lockhart and E.A. Winzeler. Genomics, gene expression and DNA arrays.
Nature, 405:827–836, 2000.
[128] R. Maclin and D. Opitz. An empirical evaluation of bagging and boosting. In
Fourteenth National Conference on Artificial Intelligence, pages 546–551, Providence,
USA, 1997. AAAI-Press.
[129] L. Mason, P. Bartlett, and J. Baxter. Improved generalization through explicit
optimization of margins. Machine Learning, 2000.
[130] F. Masulli and G. Valentini. Comparing decomposition methods for classification.
In R.J. Howlett and L.C. Jain, editors, KES’2000, Fourth International Conference
on Knowledge-Based Intelligent Engineering Systems & Allied Technologies, pages
788–791, Piscataway, NJ, 2000. IEEE.
[131] F. Masulli and G. Valentini. Effectiveness of error correcting output codes in multiclass learning problems. In Lecture Notes in Computer Science, volume 1857, pages
107–116. Springer-Verlag, Berlin, Heidelberg, 2000.
[132] F. Masulli and G. Valentini. Dependence among Codeword Bits Errors in ECOC
Learning Machines: an Experimental Analysis. In Lecture Notes in Computer Science, volume 2096, pages 158–167. Springer-Verlag, Berlin, 2001.
[133] F. Masulli and G. Valentini. Quantitative Evaluation of Dependence among Outputs in ECOC Classifiers Using Mutual Information Based Measures. In K. Marko
and P. Webos, editors, Proceedings of the International Joint Conference on Neural
Networks IJCNN’01, volume 2, pages 784–789, Piscataway, NJ, USA, 2001. IEEE.
[134] E. Mayoraz and M. Moreira. On the decomposition of polychotomies into dichotomies. In The XIV International Conference on Machine Learning, pages 219–
226, Nashville, TN, July 1997.
[135] C.J. Merz and P.M. Murphy. UCI repository of machine learning databases, 1998.
www.ics.uci.edu/mlearn/MLRepository.html.
[136] D. Modha and Spangler W.S. Clustering hypertext with application to web searching.
In Proc. of the ACM Hypertext 2000 Conference, San Antonio, TX, 2000.
160
[137] M. Moreira and E. Mayoraz. Improved pairwise coupling classifiers with correcting classifiers. In C. Nedellec and C. Rouveirol, editors, Lecture Notes in Artificial
Intelligence, Vol. 1398, pages 160–171, Berlin, Heidelberg, New York, 1998.
[138] D.W. Opitz and J.W. Shavlik. Actively searching for an effective neural network
ensemble. Connection Science, 8(3/4):337–353, 1996.
[139] N.C. Oza and K. Tumer. Input Decimation Ensembles: Decorrelation through Dimensionality Reduction. In J. Kittler and F. Roli, editors, Multiple Classifier Systems. Second International Workshop, MCS 2001, Cambridge, UK, volume 2096 of
Lecture Notes in Computer Science, pages 238–247. Springer-Verlag, 2001.
[140] P.S. Pacheco. Parallel Programming with MPI. Morgan Kauffman, San Francisco,
CA, 1997.
[141] H.S. Park and S.W. Lee. Off-line recognition of large sets handwritten characters
with multiple Hidden-Markov models. Pattern Recognition, 29(2):231–244, 1996.
[142] J. Park and I.W. Sandberg. Approximation and radial basis function networks.
Neural Computation, 5(2):305–316, 1993.
[143] B. Parmanto, P. Munro, and H. Doyle. Improving committe diagnosis with resampling techniques. In D.S. Touretzky, M. Mozer, and M. Hesselmo, editors, Advances
in Neural Information Processing Systems, volume 8, pages 882–888. MIT Press,
Cambridge, MA, 1996.
[144] B. Parmanto, P. Munro, and H. Doyle. Reducing variance of committee predition
with resampling techniques. Connection Science, 8(3/4):405–416, 1996.
[145] D. Partridge and W.B. Yates. Engineering multiversion neural-net systems. Neural
Computation, 8:869–893, 1996.
[146] P. Pavlidis, J. Weston, J. Cai, and W.N. Grundy. Gene functional classification from
heterogenous data. In Fifth International Conference on Computational Molecular
Biology, 2001.
[147] M.P. Perrone and L.N. Cooper. When networks disagree: ensemble methods for
hybrid neural networks. In Mammone R.J., editor, Artificial Neural Networks for
Speech and Vision, pages 126–142. Chapman & Hall, London, 1993.
[148] W.W. Peterson and E.J.Jr. Weldon. Error correcting codes. MIT Press, Cambridge,
MA, 1972.
[149] W.H. Press, S.A. Teukolski, W.T. Vetterling, and B.P. Flannery. Numerical Recipes
in C: The Art of Scientific Computing. Cambridge University Press, 1992.
161
[150] A. Prodromidis, P. Chan, and S. Stolfo. Meta-Learning in Distributed Data Mining
Systems: Issues and Approaches. In H. Kargupta and P. Chan, editors, Advances in
Distributed Data Mining, pages 81–113. AAAI Press, 1999.
[151] J.R. Quinlan. C4.5 Programs for Machine Learning. Morgan Kauffman, 1993.
[152] G. Ratsch, B. Scholkopf, A.J. Smola, K-R. Muller, T. Onoda, and S. Mika. nu-arc
ensemble learning in the presence of outliers. In S. A. Solla, T.K. Leen, and K-R.
Muller, editors, Advances in Neural Information Processing Systems, volume 12. MIT
Press, Cambridge, MA, 2000.
[153] Y. Raviv and N. Intrator. Bootstrapping with noise: An effective regularization
technique. Connection Science, 8(3/4):355–372, 1996.
[154] G. Rogova. Combining the results of several neural neetworks classifiers. Neural
Networks, 7:777–781, 1994.
[155] F. Roli and G. Giacinto. Analysis of linear and order statistics combiners for fusion
of imbalanced classifiers. In Multiple Classifier Systems. Third International Workshop, MCS2002, Cagliari, Italy, volume 2364 of Lecture Notes in Computer Science.
Springer-Verlag, 2002.
[156] F. Roli, G. Giacinto, and G. Vernazza. Methods for Designing Multiple Classifier
Systems. In J. Kittler and F. Roli, editors, Multiple Classifier Systems. Second
International Workshop, MCS 2001, Cambridge, UK, volume 2096 of Lecture Notes
in Computer Science, pages 78–87. Springer-Verlag, 2001.
[157] F. Roli and J. Kittler. Multiple Classifier Systems, Third International Workshop, MCS2002, Cagliari, Italy, volume 2364 of Lecture Notes in Computer Science.
Springer-Verlag, Berlin, 2002.
[158] S. Ramaswamy et al. Multiclass cancer diagnosis using tumor gene expression signatures. PNAS, 98(26):15149–15154, 2001.
[159] R. Schapire and Y. Singer. Boostexter: A boosting-based system for text categorization. Machine Learning, 39(2/3):135–168, 2000.
[160] R.E. Schapire. The strenght of weak learnability. Machine Learning, 5(2):197–227,
1990.
[161] R.E. Schapire. A brief introduction to boosting. In Thomas Dean, editor, 16th
International Joint Conference on Artificial Intelligence, pages 1401–1406. Morgan
Kauffman, 1999.
162
[162] R.E. Schapire, Y. Freund, P. Bartlett, and W. Lee. Boosting the margin: A
new explanation for the effectiveness of voting methods. The Annals of Statistics,
26(5):1651–1686, 1998.
[163] R.E. Schapire and Y. Singer. Improved boosting algorithms using confidence-rated
predictions. Machine Learning, 37(3):297–336, 1999.
[164] H. Schwenk and Y. Bengio. Training methods for adaptive boosting of neural networks. In Advances in Neural Information Processing Systems, volume 10, pages
647–653. 1998.
[165] A. Sharkey, N. Sharkey, and G. Chandroth. Diverse neural net solutions to a fault
diagnosis problem. Neural Computing and Applications, 4:218–227, 1996.
[166] A Sharkey, N. Sharkey, U. Gerecke, and G. Chandroth. The test and select approach to ensemble combination. In J. Kittler and F. Roli, editors, Multiple Classifier
Systems. First International Workshop, MCS 2000, Cagliari, Italy, volume 1857 of
Lecture Notes in Computer Science, pages 30–44. Springer-Verlag, 2000.
[167] A. (editor) Sharkey. Combining Artificial Neural Nets: Ensemble and Modular MultiNet Systems. Springer-Verlag, London, 1999.
[168] M. Skurichina and R.P.W. Duin. Bagging, boosting and the randon subspace method
for linear classifiers. Pattern Analysis and Applications. (in press).
[169] M. Skurichina and R.P.W. Duin. Bagging for linear classifiers. Pattern Recognition,
31(7):909–930, 1998.
[170] M. Skurichina and R.P.W. Duin. Bagging and the Random Subspace Method for
Redundant Feature Spaces. In Multiple Classifier Systems. Second International
Workshop, MCS 2001, Cambridge, UK, volume 2096 of Lecture Notes in Computer
Science, pages 1–10. Springer-Verlag, 2001.
[171] A Strehl and J. Ghosh. Cluster Ensembles - A Knowledge Reuse Framework for
Combining Multiple Partitions. Journal of Machine Learning Research, 3:583–617,
2002.
[172] C. Suen and L. Lam. Multiple classifier combination methodologies for different
output levels. In Multiple Classifier Systems. First International Workshop, MCS
2000, Cagliari, Italy, volume 1857 of Lecture Notes in Computer Science, pages 52–
66. Springer-Verlag, 2000.
[173] R. Tibshirani. Bias, variance and prediction error for classification rules. Technical
report, Department of Preventive Medicine and Biostatistics and Department od
Statistics, University of Toronto, Toronto, Canada, 1996.
163
[174] K. Tumer and J. Ghosh. Error correlation and error reduction in ensemble classifiers.
Connection Science, 8(3/4):385–404, 1996.
[175] K. Tumer and N.C. Oza. Decimated input ensembles for improved generalization. In
IJCNN-99, The IEEE-INNS-ENNS International Joint Conference on Neural Networks, 1999.
[176] G. Valentini. Upper bounds on the training error of ECOC-SVM ensembles. Technical
Report TR-00-17, DISI - Dipartimento di Informatica e Scienze dell’ Informazione
- Università di Genova, 2000. ftp://ftp.disi.unige.it/person/ValentiniG/papers/TR00-17.ps.gz.
[177] G. Valentini. Classification of human malignancies by machine learning methods using DNA microarray gene expression data. In G.M. Papadourakis, editor, Fourth International Conference Neural Networks and Expert Systems in Medicine and HealthCare, pages 399–408, Milos island, Greece, 2001. Technological Educational Institute
of Crete.
[178] G. Valentini. Gene expression data analysis of human lymphoma using support
vector machines and output coding ensembles. Artificial Intelligence in Medicine,
26(3):283–306, 2002.
[179] G. Valentini. Supervised gene expression data analysis using Support Vector Machines and Multi-Layer perceptrons. In Proc. of KES’2002, the Sixth International
Conference on Knowledge-Based Intelligent Information & Engineering Systems, special session Machine Learning in Bioinformatics, Amsterdam, the Netherlands, 2002.
IOS Press.
[180] G. Valentini. Bias–variance analysis in bagged svm ensembles: data and graphics.
DISI, Dipartimento di Informatica e Scienze dell’ Informazione, Università di genova,
Italy., 2003. ftp://ftp.disi.unige.it/person/ValentiniG/papers/bv-svm-bagging.ps.gz.
[181] G. Valentini. Bias–variance analysis in random aggregated svm ensembles: data
and graphics. DISI, Dipartimento di Informatica e Scienze dell’ Informazione, Università di genova, Italy., 2003. ftp://ftp.disi.unige.it/person/ValentiniG/papers/bvsvm-undersampling.ps.gz.
[182] G. Valentini and T.G. Dietterich. Bias–variance analysis and ensembles of SVM.
In Multiple Classifier Systems. Third International Workshop, MCS2002, Cagliari,
Italy, volume 2364 of Lecture Notes in Computer Science, pages 222–231. SpringerVerlag, 2002.
164
[183] G. Valentini and F. Masulli. Ensembles of learning machines. In Neural Nets WIRN02, volume 2486 of Lecture Notes in Computer Science, pages 3–19. Springer-Verlag,
2002.
[184] G. Valentini and F. Masulli. NEURObjects: an object-oriented library for neural
network development. Neurocomputing, 48(1–4):623–646, 2002.
[185] G. Valentini, M. Muselli, and F. Ruffino. Bagged Ensembles of SVMs for Gene
Expression Data Analysis. In IJCNN2003, The IEEE-INNS-ENNS International
Joint Conference on Neural Networks, Portland, USA, 2003.
[186] J. Van Lint. Coding theory. Spriger Verlag, Berlin, 1971.
[187] V. N. Vapnik. Statistical Learning Theory. Wiley, New York, 1998.
[188] D. Wang, J.M. Keller, C.A. Carson, K.K. McAdoo-Edwards, and C.W. Bailey. Use of
fuzzy logic inspired features to improve bacterial recognition through classifier fusion.
IEEE Transactions on Systems, Man and Cybernetics, 28B(4):583–591, 1998.
[189] D.H. Wolpert. Stacked Generalization. Neural Networks, 5:241–259, 1992.
[190] K. Woods, W.P. Kegelmeyer, and K. Bowyer. Combination of multiple classifiers
using local accuracy estimates. IEEE Trans. on Pattern Analysis and Machine Intelligence, 19(4):405–410, 1997.
[191] L Xu, C Krzyzak, and C. Suen. Methods of combining multiple classifiers and their
applications to handwritting recognition. IEEE Transactions on Systems, Man and
Cybernetics, 22(3):418–435, 1992.
[192] C. Yeang et al. Molecular classification of multiple tumor types. In ISMB 2001,
Proceedings of the 9th International Conference on Intelligent Systems for Molecular
Biology, pages 316–322, Copenaghen, Denmark, 2001. Oxford University Press.
165