Margin and Radius Based Multiple Kernel
Learning
Huyen Do, Alexandros Kalousis, Adam Woznica, and Melanie Hilario
University of Geneva, Computer Science Department
7, route de Drize, Battelle batiment A, 1227 Carouge, Switzerland
{Huyen.Do,Alexandros.Kalousis,Adam.Woznica,Melanie.Hilario}@unige.ch
Abstract. A serious drawback of kernel methods, and Support Vector
Machines (SVM) in particular, is the difficulty in choosing a suitable
kernel function for a given dataset. One of the approaches proposed to
address this problem is Multiple Kernel Learning (MKL) in which several kernels are combined adaptively for a given dataset. Many of the
existing MKL methods use the SVM objective function and try to find a
linear combination of basic kernels such that the separating margin between the classes is maximized. However, these methods ignore the fact
that the theoretical error bound depends not only on the margin, but
also on the radius of the smallest sphere that contains all the training
instances. We present a novel MKL algorithm that optimizes the error
bound taking account of both the margin and the radius. The empirical
results show that the proposed method compares favorably with other
state-of-the-art MKL methods.
Keywords: Learning Kernel Combination, Support Vector Machines,
convex optimization.
1
Introduction
Over the last few years kernel methods [1,2], such as Support Vector Machines
(SVM), have proved to be efficient machine learning tools. They work in a feature
space implicitly defined by a positive semi-definite kernel function, which allows
the computation of inner products in feature spaces using only the objects in
the input space.
The main limitation of kernel methods stems from the fact that in general
it is difficult to select a kernel function, and hence a feature mapping, that
is suitable for a given problem. To address this problem several several attempts have been recently made to learn kernel operators directly from the
data [3,4,5,6,7,8,9,10,11,12]. The proposed methods differ in the objective functions (e.g. CV risk, margin based, alignment, etc.) as well as in the classes of
kernels that they consider (e.g. combination of finite or infinite set of basic
kernels).
The most popular approach in the context of kernel learning considers a finite set of predefined basic kernels which are combined so that the margin-based
W. Buntine et al. (Eds.): ECML PKDD 2009, Part I, LNAI 5781, pp. 330–343, 2009.
c Springer-Verlag Berlin Heidelberg 2009
Margin and Radius Based Multiple Kernel Learning
331
objective function of SVM is optimized. Thelearned kernel K is a linear combiM
nation of basic kernels Ki , i.e. K(x, x′ ) = i=1 µi Ki (x, x′ ), µi ≥ 0, where M
is the number of basic kernels, and x and x′ are input objects. The weights µi
of the kernels are included in the margin-based objective function. This setting
is commonly referred to as the Multiple Kernel Learning (MKL).
The MKL formulation has been introduced in [3] as a semi-definite programming problem, which scaled well only for small problems. [7] extended that work
and proposed a faster method based on the conic duality of MKL and solved
the problem using Sequential Minimal Optimization (SMO). [5] reformulated the
MKL problem as semi-infinite linear problem. In [6] the authors proposed an adjustment in the cost function of [5] to improve predictive performance. Although
the MKL approach to kernel learning has some limitations (e.g. one has to choose
the basic kernels), it is widely used because of its simplicity, interpretability and
good performance.
The MKL methods that use the SVM objective function do not exploit the
fact that the error bound of SVM depends not only on the separating margin,
but also on the radius of the smallest sphere that encloses the data. In fact
even the standard SVM algorithms do not exploit the latter, because for a given
feature space the radius is fixed. However in the context of MKL the radius is
not fixed but is a function of the weights of the basic kernels.
In this paper we propose a novel MKL method that takes account of both
radius and margin to optimize the error bound. Following a number of transformations, these problems are cast in a form that can be solved by the two step
optimization algorithm given in [6].
The paper is organized as follows. In Section 2 we introduce the general MKL
framework. Next, in Section 3 we discuss the various error bounds that motivate
the use of the radius. The main contribution of the work is presented in Section 4 where we propose a new method for multiple kernel learning that aims to
optimize the margin- and radius-dependent error bound. In Section 5 we present
the empirical results on several benchmark datasets. Finally, we conclude with
Section 6 where we also present pointers to future work.
2
Multiple Kernel Learning Problem
Consider a mapping of instances x ∈ Xi , to a new feature space Hi
x → Φi (x) ∈ Hi
(1)
′
This mapping can be performed by a kernel function Ki (x, x ) which is defined as the inner product of the images of two instances x and x′ in Hi , i.e.
Ki (x, x′ ) = Φi (x), Φi (x′ ); Hi may have even infinite dimensionality. Typically,
the computation of the inner product in Hi is done implicitly, i.e. without having
to compute explicitly the images Φi (x) and Φi (x′ ).
2.1
Original Problem Formulation of MKL
Given a set of training examples S = {(x1 , y1 ), ..., (xl , yl )} and a set of basic
kernel functions, Z = {Ki (x, x′ )|i := 1, . . . M }, the goal of MKL is to optimize
332
H. Do et al.
a cost function Q(f (Z, µ)(x, x′ ), S) where f (Z, µ)(x, x′ ) is some positive semidefinite function of the set of the basis kernels, parametrized by µ; most often a
linear combination of the form:
f (Z, µ)(x, x′ ) =
M
i=1
µi Ki (x, x′ ), µi ≥ 0,
M
µi = 1
(2)
i
To simplify notation we will denote f (Z, µ) by Kµ . In the remaining part of
this work we will only focus on the normalized versions of Ki , defined as:
Ki (x, x′ )
.
Ki (x, x′ ) :=
Ki (x, x) · Ki (x′ , x′ )
(3)
If Kµ is a linear combination of kernels then its feature space Hµ is given by
the mapping:
√
√
x → Φµ (x) = ( µ1 Φ1 (x), ..., µM ΦM (x))T ∈ Hµ
(4)
where Φi (x) is the mapping to the Hi feature space associated with the Ki
kernel, as this was given in Formula 1.
In previous work within the MKL context the cost function, Q, has taken
different forms such as the Kernel Target Alignment, which measures the “goodness“ of a kernel for a given learning task [9], or the typical SVM cost function
combining classification error and the margin [3,5,6], or as in [4] any of the above
with an added regularization term for the complexity of the combined kernel.
3
Margin and Radius Based Error Bounds
There are a number of theorems in statistical learning that bound the expected
classification error of the thresholded linear classifier, that corresponds to the
maximum margin hyperplane, by quantities that are related to the margin and
the radius of the smallest sphere that encloses the data. Below we give two of
them that are applicable on linearly separable and non-separable training sets,
respectively.
Theorem 1. [10], Given a training set S = {(x1 , y1 ), ..., (xl , yl )} of size l, a
feature space H and a hyperplane (w, b), the margin γ(w, b, S) and the radius
R(S) are defined by
yi (w, Φ(xi ) + b)
w
(xi ,yi )∈S
R(S) = min max Φ(xi ) − a
γ(w, b, S) =
min
a
i
The maximum margin algorithm Ll : (X × Y)l → H × R takes as input a training
set of size l and returns a hyperplane in feature space such that the margin
γ(w, b, S) is maximized. Note that assuming the training set is separable means
Margin and Radius Based Multiple Kernel Learning
333
that γ > 0. Under this assumption, for all probability measures P underlying the
data S, the expectation of the misclassification probability
perr (w, b) = P (sign(w, Φ(X) + b) = Y )
has the bound
R2 (Z)
1
E
l
γ 2 (Ll (Z), Z)
The expectation is taken over the random draw of a training set Z of size l − 1
for the left hand side and l for the right hand side.
E{perr (Ll−1 (Z))} ≤
The following theorem gives a similar result for the error bound of the linearly
non-separable case.
Theorem 2. [13], Consider thresholding real-valued linear functions L with unit
weight vectors on an inner product space H and fix γ ∈ R+ . There is a constant
c, such that for any probability distribution D on H×{−∞, ∞} with support in a
ball of radius R around the origin, with probability 1 − δ over l random examples
S, any hypothesis f ∈ L has error no more than:
1
c R2 + ξ22 2
err(f )D ≤ (
log l + log ),
l
γ2
δ
(5)
where ξ = ξ(f, S, γ) is the margin slack vector with respect to f and γ.
It is clear from both theorems that the bound on the expected error depends not
only on the margin but also on the radius of the data, being a function of the
R2 /γ 2 ratio. Nevertheless standard SVM algorithms can ignore the dependency
of the error bound on the radius because for a fixed feature space the radius
is constant and can be simply ignored in the optimization procedure. However
in the MKL scenario where the Hµ feature space is not fixed but depends on
the parameter vector µ the radius is no longer fixed but it is a function of µ
and thus should not be ignored in the optimization procedure. The radius of the
smallest sphere that contains all instances in the H feature space defined by the
Φ(x) mapping is computed by the following formula [14]:
min
R,Φ(x0 )
R2
(6)
s.t. Φ(xi ) − Φ(x0 )2 ≤ R2 , ∀i
It can be shown that if Kµ is a linear combination of kernels, of the form given
in Formula 2, then for the Rµ radius of its Hµ feature space the following
inequalities hold:
2
max(µi Ri2 ) ≤ Rµ
≤
i
M
i=1
µi Ri2 ≤ max(Ri2 ),
i
s.t.
M
(7)
µi = 1
i
where Ri is the radius of the component feature space Hi associated with the
Ki kernel. The proof of the above statement is given in the appendix.
334
4
H. Do et al.
MKL with Margin and Radius Optimization
In the next sections we will show how we can make direct use of the dependency
of the error bound both on the margin and the radius in the context of the MKL
problem in an effort to decrease even more the error bound than what is possible
by optimizing only over the margin.
4.1
Soft Margin MKL
The standard l2 -soft margin SVM is based on theorem 2 and learns maximal
margin hyperplanes while controlling for the l2 norm of the slack vector in an
effort to optimize the error bound given in equation 5; as already mentioned
previously the radius although it appears in the error bound is not considered
in the optimization problem due to the fact that for a given feature space it is
fixed. The exact optimization problem solved by the l2 -soft margin SVM is [13]:
l
C 2
1
w, w +
ξ
2
2 i=1 i
min
w,b,ξ
(8)
s.t. yi (w, Φ(xi ) + b) ≥ 1 − ξi , ∀i
The solution hyperplane (w∗ , b∗ ) of this problem realizes the maximum margin
classifier with geometric margin γ = w1∗ .
When instead of a single kernel we learn with a combination of kernels Kµ
then the radius of the resulting feature space Hµ depends on the parameters
µ which are also learned. We can profit from this additional dependency and
optimize not only for the margin but also for the radius, as Theorems 1 and 2
suggest, in the hope of reducing even more the error bounds than what would
be possible by just focusing on the margin.
A straightforward way to do so is to alter the cost function of the above
optimization problem so that it also includes the radius. Thus we define the
primal form of soft margin MKL optimization problem as follows:
l
C 2
1
2
w, wRµ
ξ
+
2
2 i=1 i
min
w,b,ξ,µ
(9)
s.t. yi (w, Φ(xi ) + b) ≥ 1 − ξi , ∀i
Accounting for the form Φµ of the feature space Hµ , as it is given in equation 4,
this optimization problem can be rewritten as:
min
w,b,ξ,µ
M
l
1
C 2
2
ξ
wk , wk Rµ
+
2
2 i=1 i
k
M
√
s.t. yi ( wk , µk Φk (xi ) + b) ≥ 1 − ξi , ∀i
k
(10)
Margin and Radius Based Multiple Kernel Learning
335
2
where the w is the same as that of Formula 9 and equal to (w1 , . . . , wM ), Rµ
√
√
can be computed by equation 6. By letting w := µ.w then wk := µk wk we
can rewrite equation 10 as1 :
min
w,b,ξ,µ
M
l
1 wk , wk 2
C 2
ξ
Rµ +
2
µk
2 i i
(11)
k
M
s.t. yi ( wk , Φk (xi ) + b) ≥ 1 − ξi ,
k
M
k=1
µk = 1, µk ≥ 0, ∀k
The non-negativity of µ is required toguarantee that the kernel combination is a
M
valid kernel function; the constraint k=1 µk = 1 is added to make the solution
interpretable (kernel with bigger weight can be interpreted as more important
one) and to get a specific solution (note that if µ is solution of 11 (without the
+
constraint M
k=1 µk = 1), then λµ, λ ∈ R is also its solution).
We will denote the cost function of equation 11 by F (w, b, ξ, µ). F is not a
convex function, this is probably the main reason why in current MKL algorithms
the radius is simply removed from the original cost function, therefore they do
not really optimize the generalization error bound.
M
2
≤ k µk Rk2 and
From the set of inequalities given in equation 7 we have Rµ
from this we can get:
F (w, b, ξ, µ) =
M wk ,wk 2
l
1
Rµ + C2 i ξi2
k
2
µk
M wk ,wk M
l 2
1
C
2
k µk Rk + 2
k
i ξi
2
µk
M
M
l
k
+ 2 MCµ R2 i ξi2 ) k µk Rk2
( 12 k wkµ,w
k
k k
k
M
1
(2
k
wk ,wk
µk
+
C
2
2 M
k µk Rk
l
2
i ξi )
≤
(12)
=
≤
= F(w, b, ξ, µ)
M
The last inequality holds because in the context we examine we have k µk Rk2 ≤
1. This is a result of the fact that we work with the normalized feature spaces,
using the normalized
as these were defined
3, thus we have
kernels
Min equation
M
2
Rk2 ≤ 1 and since
µ
=
1
it
holds
that:
µ
R
≤
1.
Since F is an
k
k k
k
k
upper bound of F and moreover it is convex2 we are going to use it as our
objective function. As a result, we propose to solve, instead of the original soft
margin optimization problem given in equation 11, the following upper bounding
convex optimization problem:
1
2
Note that if µk = 0 then from the dual form we have wk = 0. In this case, we use
the convention that 00 = 0.
The convexity of this new function can be easily proved by showing that the Hessian
matrix is positive semi-definite.
336
H. Do et al.
min
w,b,ξ,µ
M
l
1 wk , wk
C
ξi2
+ M
2
2
µk
µ
R
2
k
k
k i
k
(13)
M
s.t. yi ( wk , Φk (xi ) + b) ≥ 1 − ξi ,
k
M
k=1
µk = 1, µk ≥ 0, ∀k
The dual function of this optimization problem is:
l
Ws (α, µ) = −
+
M
1
αi αj yi yj
µk Kk (xi , xj )
2 ij
(14)
k
l
i
αi −
M
k
µk Rk2
α, α
2C
M
M
l
l
2
1
k µk Rk
=−
δij ) +
αi
αi αj yi yj (
µk Kk (xi , xj ) +
2 ij
C
i
k
where δij is the Kronecker δ defined to be 1 if i = j and 0 otherwise. The dual
optimization problem is given as:
max Ws (α, µ)
α,µ
s.t.
αi yi = 0,
(15)
i
αi ≥ 0, ∀i
l
Rk2 C i ξi2
, ∀k
αi αj yi yj Kk (xi , xj ) =
( k µk Rk2 )2
ij
In the next section we will show how we can solve this new optimization problem.
4.2
Algorithm
The dual function 14 is quadratic with respect to α and linear with respect to
µ. One way to solve the optimization problem 13 is to use a two step iterative
algorithm such as the ones described in [6], [10]. Following such a two step
approach, in the first step we will solve a quadratic problem that optimizes over
(w, b), while keeping µ fixed; as a consequence the resulting dual function is a
simple quadratic function of α which can be optimized easily. In the second step
we will solve a linear problem that optimizes over µ.
Margin and Radius Based Multiple Kernel Learning
337
More precisely, the formulation of the optimization problem with the two-step
approach takes the following form:
min J(µ)
(16)
µ
M
s.t.
k=1
where
J(µ) =
µk = 1, µk ≥ 0, ∀k
M
wk ,wk
+ 2 MCµ R2
k
µk
k k
k
M
yi ( k wk , Φk (xi ) + b) ≥
1
2
minw,b
s.t.
l
2
i ξi
1 − ξi
(17)
To solve the outer optimization problem, i.e. minµ J(µ), we use gradient descent
method. At each iteration, we fix µ, compute the value of J(µ) and then compute
the gradient of J(µ) with respect to µ. The dual function of Formula 17 is the
Ws (α, µ) function already given in Formula 14. Since µ is fixed we now optimize
only over α (the resulting dual optimization problem is much simpler compared
to the original soft margin dual optimization problem given in Formula 15):
max Ws (α, µ)
α
αi yi = 0, αi ≥ 0, ∀i
s.t.
i
which has the same form as the SVM quadratic optimization problem, the only
difference is that the C parameter here is equal to M Cµ R2 .
k k
k
For the strong duality, at the optimal solution α∗ , the values of dual cost
function and primal cost function are equal. Thus the value of Ws (α, µ), and
the J(µ) value, is given by:
Ws (α∗ , µ) = −
+
M
l
1 ∗ ∗
αi αj yi yj (
µk Kk (xi , xj )
2 ij
k
M
k
l
µk Rk2
α∗i
δij ) +
C
i
The last step of the algorithm is to compute the gradient of the J(µ) function,
Formula 17, with respect to µ. As [6] have pointed out, we can use the theorem
of Bonnans and Shapiro [15] to compute gradients of such functions. Hence, the
gradient is in the following form:
l
∂J(µ)
R2
1 ∗ ∗
αi αj yi yj (Kk (xi , xj ) + k δij )
=−
∂µk
2 ij
C
To compute the optimal step in the gradient descent we used line search. The
complete two-step procedure is given in Algorithm 1.
338
H. Do et al.
Algorithm 1. R-MKL
1
Initialize µ1k = M
for k = 1, ..., M
repeat
2
t 2
= M
Set Rµ
k µk Rk
t
compute J(µ ) as the solution of a quadratic optimization problem with K :=
M t
k µk Kk
∂J
for k = 1, ..., M
compute ∂µ
k
compute optimal step γt
(µ)
µt+1 ← µt + γt ∂J∂µ
until stopCriteria is true
4.3
Computational Complexity
At each step of the
we have to compute the solution of a standard SVM,
iteration
M
with kernel K = k=1 µk Kk , and C equal to M Cµ R2 , which is a quadratic prok
k
k
gramming problem with a complexity of O(n3 ) where n is number of instances.
2
;
Moreover, when µ is updated we have to recompute the approximation of Rµ
the complexity of this procedure is linear in the number of kernels, O(M ).
5
Experiments
We experimented with ten different datasets. Six of them were taken from the
UCI repository (Ionosphere, Liver, Sonar, Wdbc, Wpbc, Musk1), while four come
from the domain of genomics and proteomics (ColonCancer, CentralNervousSystem, FemaleVsMale, Leukemia) [16]; these four are characterized by small sample
and high dimensionality morphology. A short description of the datasets is given
in Table 1. We experimented with two different types of basic kernels, i.e. polynomials and Gaussians, and performed two sets of experiments. In the first set
of experiments we used both types of kernels and in the second one we focused
only on Gaussians kernels. For each set of experiments the total number of basic
kernels was 20; for the first set we used polynomial kernels of degree one, two,
and three and 17 Gaussians with bandwidth δ that ranged from 1 to 17 with
a step of one; for the second set of experiments we only used Gaussian kernels
with bandwidth δ that ranged from 1 to 20 with a step of one.
We compared our MKL algorithm (denoted as R-MKL) with two state-of-theart MKL algorithms: Support Kernel Machine (SKM) [7], and SimpleMKL [6].
We estimate the classification error using 10-fold cross validation. For comparison
purposes we also provide the performances of the best single kernel (BK) and
the majority classifier (MC); the latter always predicts the majority class. The
performance of the BK is that of the best single kernel estimated also by 10-fold
cross validation, since it is the best result after seeing the performance of all
individual kernels on the available data it is optimistically biased. We tuned the
parameter C in an inner-loop 10-fold cross-validation choosing the values from
the set {0.1, 1, 10, 100}. All algorithms terminate when the duality gap is smaller
than 0.01. All input kernel matrices are normalized by equation 3.
Margin and Radius Based Multiple Kernel Learning
339
Table 1. Short description of the classification datasets used
Dataset
#Inst #Attr #Class1 #Class2
Ionosphere
351
34
126
225
Liver
345
6
145
200
208
60
97
111
Sonar
569
32
357
212
Wdbc
198
34
151
47
Wpbc
476
166
269
207
Musk1
ColonCancer
62
2000
40
22
7129
21
39
CentralNervous 60
FemaleVSMale
134
1524
67
67
72
7128
25
47
Leukemia
We compared the significance level of the performance differences of the algorithms with McNemar’s test [17], where the level of significance is set to 0.05.
We also established a ranking schema of the examined MKL algorithms based
on the results of the pairwise comparisons [18]. More precisely, if an algorithm
is significantly better than another it is credited with one point; if there is no
significant difference between two algorithms then they are credited with 0.5
points; finally, if an algorithm is significantly worse than another it is credited
with zero points. Thus, if an algorithm is significantly better than all the others
for a given dataset it has a score of two. We give the full results in Tables 2, 3.
For each algorithm we report a triplet in which the first element is the estimated
classification error, the second is the number of selected kernels, and the last is
the above described rank.
Our kernel combination algorithm does remarkably well in the first set of
experiments, Table 2, in which it is significantly better than both other algorithms in four datasets and significantly worse in two; for the four remaining
datasets there are no significant differences. Note that in the cases of Wpbc and
Table 2. Results for the first experiments, where both polynomial and Gaussian kernels
are used. Each triplet x,y,z gives respectively the classification error, the number of
selected kernels, and the number of significance point that the algorithm scores for the
given experiment set and dataset. Columns BK and MC give the errors of the best
single kernel and the majority classifier, respectively.
D. Set
SKM
Ionos. 04.00,02,1.5
Liver 33.82,05,1.5
Sonar 15.50,01,1.0
Wdbc 11.25,03,1.0
Wpbc 23.68,17,1.0
Musk1 11.70,07,1.0
Colon. 18.33,18,1.0
CentNe. 35.00,17,1.0
Female. 33.85,20,1.0
Leuke. 07.14,18,0.5
Simple
03.71,02,1.5
33.53,13,1.5
15.50,01,1.0
13.04,18,0.0
23.68,01,1.0
13.40,18,0.0
18.33,18,1.0
35.00,17,1.0
38.92,18,0.0
07.14,18,0.5
R-MKL
04.86,02,0
36.18,03,0
15.50,01,1
03.75,18,2
23.68,01,1
06.60,01,2
16.67,18,1
35.00,17,1
20.00,18,2
02.86,18,2
BK
05.71
30.29
17.50
08.57
23.68
04.47
11.67
31.67
22.31
02.86
MC
36.00
42.06
46.00
37.32
23.68
43.83
35.00
35.00
60.00
34.29
340
H. Do et al.
Table 3. Results for the second set of experiments, where only Gaussian kernels are
used. The table contains the same information as the previous one.
D. Set
SKM
Ionos. 04.86,02,1.0
Liver 33.53,03,1.0
Sonar 15.50,01,1.0
Wdbc 37.32,03,1.0
Wpbc 23.68,20,1.0
Musk1 43.83,14,1.0
Colon.
NA
CentNe. 35.00,20,1.0
Female. 60.00,20,1.0
Leuke. 34.29,20,1.0
Simple
05.43,04,1.0
33.53,20,1.0
15.50,01,1.0
37.32,19,1.0
23.68,19,1.0
43.83,19,1.0
35.00,20,1.5
35.00,20,1.0
60.00,20,1.0
34.29,20,1.0
R-MKL
05.14,03,1.0
33.53,16,1.0
15.50,01,1.0
37.32,20,1.0
23.68,20,1.0
43.83,20,1.0
35.00,20,1.5
35.00,20,1.0
60.00,20,1.0
34.29,20,1.0
BK
05.14
34.71
17.00
37.32
23.68
43.83
35.00
35.00
60.00
34.29
MC
36.00
42.06
46.00
37.32
23.68
43.83
35.00
35.00
60.00
34.29
CentralNervousSystem all algorithms have a performance that is similar to that
of the majority classifier, i.e. the learned models do not have any discriminatory
power. By examining the classification performances of the individual kernels
on these datasets we see that none of them had a performance that was better than that of the majority classifier; this could explain the bad behavior of
the different kernel combination schemata. Overall, for this set of experiments
R-MKL gets 12 significance points over the different datasets, SKM 10.5, and
SimpleMKL 7.5. The performance improvements of R-MKL over the two other
methods are are quite impressive on those datasets on which R-MKL performs
well; more precisely its classification error is around 30%, 50%, and 40%, of that
of the other algorithms for Wdbc, Musk1, and Leukemia datasets respectively.
In the second set of experiments,Table 3, all methods perform very poorly in
seven out of ten datasets; their classification performance is similar to that of the
majority classifier. In the remaining datasets, with the exception of ColonCancer
for which SKM failed (we used the implementation provided by the authors of
the algorithm and it returns with some errors), there is no significant difference
between the three algorithms. The collectively bad performance in the last seven
databases is explained by the fact that none of the basic kernels had a classification error that was better than that of the majority classifier. Overall, for this
set of experiments SKM scores 9 points, and Simple MKL and R-MKL score
10.5 points each.
Comparing the number of selected kernels by the different kernel combination
methods using a paired t-test (significance level of 0.05) revealed no statistically
significant differences between the three algorithms on both sets of experiments.
In an effort to get an empirical estimation of the quality of the approximation of the radius that we used to make the optimization
problems convex, we
M
µ R2 −R2
k k
µ
. We computed this
computed the approximation error defined as k R
2
µ
error over the different folds of the ten-fold cross-validation for each dataset.
The average approximation error over the different datasets was 0.0056. We
also computed this error over 1000 random values of µ for each dataset and the
Margin and Radius Based Multiple Kernel Learning
341
average
was 0.0104. Thus, the empirical evidence seems to indicate that the
error
M
2
Rµ
≤ k µk Rk2 bound is relatively tight, at least for the datasets we examined.
6
Conclusion and Future Work
In this paper we presented a new kernel combination method that incorporates
in its cost function not only the margin but also the radius of the smallest sphere
that encloses the data. This idea is a direct implementation of well known error
bounds from statistical learning theory. To the best of our knowledge this is
the first work in which the radius is used together with the margin in an effort
to minimize the generalization error. Even though the resulting optimization
problems were non-convex and we had to use an upper bound on the radius to
get convex forms, the empirical results were quite encouraging. In particular, our
method competed with other state-of-the-art methods for kernel combination,
thus demonstrating the benefit and the potential of the proposed technique.
Finally, we mention that it is still a challenging research direction to fully exploit
the examined generalization bound.
In future work we would like to examine optimization techniques for directly
solving the non-convex optimization problem presented in Formula 11. In particular, we will examine whether it is possible to decompose the cost function as a
sum convex and concave functions, or to represent it as d.m functions (difference
of two monotonic functions) [19,20]. Additionally, we plan to analyze the bound
M
2
≤ k µk Rk2 and see how it relates with the real optimal value.
Rµ
Acknowledgments. The work reported in this paper was partially funded by
the European Commission through EU projects DropTop (FP6-037739), DebugIT (FP7-217139) and e-LICO (FP7-231519). The support of the Swiss NSF
(Grant 200021-122283/1) is also gratefully acknowledged.
References
1. Shawe-Taylor, J., Cristianini, N.: Kernel methods for pattern analysis. Cambridge
University Press, Cambridge (2004)
2. Schölkopf, B., Smola, A.J.: Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge (2001)
3. Lanckriet, G., Cristianini, N., Bartlett, P., Ghaoui, L.E.: Learning the kernel matrix
with semidefinite programming. Journal of Machine Learning Research 5, 27–72
(2004)
4. Ong, C.S., Smola, A.J., Williamson, R.C.: Learning the kernel with hyperkernels.
Journal of Machine Learning Research 6, 1043–1071 (2005)
5. Sonnenburg, S., Ratsch, G., Schafer, C.: A general and efficient multiple kernel
learning algorithm. Journal of Machine Learning Research 7, 1531–1565 (2006)
6. Bach, F., Rakotomamonjy, A., Canu, S., Grandvalet, Y.: SimpleMKL. Journal of
Machine Learning Research (2008)
342
H. Do et al.
7. Bach, F.R., Lanckriet, G.R.G., Jordan, M.I.: Multiple kernel learning, conic duality, and the smo algorithm. In: ICML 2004: Proceedings of the twenty-first international conference on Machine learning, p. 6. ACM, New York (2004)
8. Lanckriet, G., Bie, T.D., Cristianini, N.: A statistical framework for genomic data
fusion. Bioinformatics 20 (2004)
9. Cristianini, N., Shawe-Taylor, J., Elisseeff, A.: On kernel-target alignment. Journal
of Machine Learning Research (2002)
10. Chapelle, O., Vapnik, V., Bousquet, O., Mukherjee, S.: Choosing multiple parameters for support vector machines. Machine Learning 46(1-3), 131–159 (2002)
11. Crammer, K., Keshet, J., Singer, Y.: Kernel design using boosting. In: Advances
in Neural Information Processing Systems, vol. 14. MIT Press, Cambridge (2002)
12. Bousquet, O., Herrmann, D.: On the complexity of learning the kernel matrix. In:
Advances in Neural Information Processing Systems, vol. 14. MIT Press, Cambridge (2003)
13. Cristianini, N., Shawe-Taylor, J.: An introduction to Support Vector Machines.
Cambridge University Press, Cambridge (2000)
14. Vapnik, V.: Statistical learning theory. Wiley Interscience, Hoboken (1998)
15. Bonnans, J., Shapiro, A.: Optimization problems with perturbation: A guided tour.
SIAM Review 40(2), 202–227 (1998)
16. Kalousis, A., Prados, J., Hilario, M.: Stability of feature selection algorithms: a
study on high dimensional spaces. Knowledge and Information Systems 12(1), 95–
116 (2007)
17. McNemar, Q.: Note on the sampling error of the difference between correlated
proportions or percentages. Psychometrika 12, 153–157 (1947)
18. Kalousis, A., Theoharis, T.: Noemon: Design, implementation and performance
results for an intelligent assistant for classifier selection. Intelligent Data Analysis
Journal 3, 319–337 (1999)
19. Leo Liberti, N.M. (ed.): Global Optimization - From Theory to Implementation.
Springer, Heidelberg (2006)
20. Collobert, R., Weston, J., Bottou, L.: Trading convexity for scalability. In: Proceedings of the 23th Conference on Machine Learning (2006)
21. Stephen Boyd, L.V. (ed.): Convex optimization. Cambridge University Press, Cambridge (2004)
Appendix
Proof of Inequality 7. If K(x, x′ ) is the kernel function associated with the Φ(x)
mapping then the computation of the radius in the dual form is given in [1]:
max R2 =
βi βj
s.t.
l
i
l
i
βi K(xi , xi ) −
l
βi βj K(xi , xj )
(18)
ij
βi = 1, βi ≥ 0
If β ∗ is the optimal solution of (18) when K = Kµ =
optimal solution of (18) when K = Kk , i.e. :
M
1
µk Kk , and βˆk is the
Margin and Radius Based Multiple Kernel Learning
2
Rµ
=
l
l
∗
βi∗ βj∗ Kk (xi , xj ))
βi Kk (xi , xi ) −
µk (
M
Rk2 =
i=1
i,j=1
i=1
k=1
l
343
βˆk i Kk (xi , xi ) −
l
βˆk i βˆk j Kk (xi , xj )
i,j=1
then
l
i=1
l
i=1
2
Therefore: Rµ
≤
M
βi∗ Kk (xi , xi ) −
l
i,j=1
βˆk i Kk (xi , xi ) −
k=1
βi∗ βj∗ Kk (xi , xj ) ≤
l
βˆk i βˆk j Kk (xi , xj )
i,j=1
µk Rk2
Proof of convexity of R-MKL (Eq.13). To prove that 13 is convex, it is enough
2
2
to show that functions xµ , where x ∈ R, µ ∈ R+ , and Mξα µ , where ξ ∈
k
k
k
R, µk , αk ∈ R+ are convex. The first is quadratic-over-linear function which is
convex. The second is convex because its epigraph is a convex set [21].