Margin and Radius Based Multiple Kernel Learning

Huyen Do; Alexandros Kalousis; Adam Woznica; Melanie Hilario

Margin and radius based multiple kernel learning

Machine Learning and …, 2009

Margin and Radius Based Multiple Kernel Learning 331 objective function of SVM is optimized. The learned kernel K is a linear combi- nation of basic kernels K i , i.e. K(x, x ′ )= ∑ M i=1 μ i K i (x, x ′ ),μ i ≥ 0, where M is the number of basic kernels, and x and x ′ are input objects. The weights μ i of the kernels are included in the margin-based objective function. This setting is commonly referred to as the Multiple Kernel Learning (MKL). The MKL formulation has been introduced in [3] as a semi-deﬁnite program- ming problem, which scaled well only for small problems. [7] extended that work and proposed a faster method based on the conic duality of MKL and solved the problem using Sequential Minimal Optimization (SMO). [5] reformulated the MKL problem as semi-inﬁnite linear problem. In [6] the authors proposed an ad- justment in the cost function of [5] to improve predictive performance. Although the MKL approach to kernel learning has some limitations (e.g. one has to choose the basic kernels), it is widely used because of its simplicity, interpretability and good performance. The MKL methods that use the SVM objective function do not exploit the fact that the error bound of SVM depends not only on the separating margin, but also on the radius of the smallest sphere that encloses the data. In fact even the standard SVM algorithms do not exploit the latter, because for a given feature space the radius is ﬁxed. However in the context of MKL the radius is not ﬁxed but is a function of the weights of the basic kernels. In this paper we propose a novel MKL method that takes account of both radius and margin to optimize the error bound. Following a number of transfor- mations, these problems are cast in a form that can be solved by the two step optimization algorithm given in [6]. The paper is organized as follows. In Section 2 we introduce the general MKL framework. Next, in Section 3 we discuss the various error bounds that motivate the use of the radius. The main contribution of the work is presented in Sec- tion 4 where we propose a new method for multiple kernel learning that aims to optimize the margin- and radius-dependent error bound. In Section 5 we present the empirical results on several benchmark datasets. Finally, we conclude with Section 6 where we also present pointers to future work. 2 Multiple Kernel Learning Problem Consider a mapping of instances x ∈X i , to a new feature space H i x → Φ i (x) ∈H i (1) This mapping can be performed by a kernel function K i (x, x ′ ) which is de- ﬁned as the inner product of the images of two instances x and x ′ in H i , i.e. K i (x, x ′ )= 〈Φ i (x), Φ i (x ′ )〉; H i may have even inﬁnite dimensionality. Typically, the computation of the inner product in H i is done implicitly, i.e. without having to compute explicitly the images Φ i (x) and Φ i (x ′ ). 2.1 Original Problem Formulation of MKL Given a set of training examples S = {(x 1 ,y 1 ), ..., (x l ,y l )} and a set of basic kernel functions, Z = {K i (x, x ′ )|i := 1,...M }, the goal of MKL is to optimize

Margin and Radius Based Multiple Kernel Learning Huyen Do, Alexandros Kalousis, Adam Woznica, and Melanie Hilario University of Geneva, Computer Science Department 7, route de Drize, Battelle batiment A, 1227 Carouge, Switzerland {Huyen.Do,Alexandros.Kalousis,Adam.Woznica,Melanie.Hilario}@unige.ch Abstract. A serious drawback of kernel methods, and Support Vector Machines (SVM) in particular, is the diﬃculty in choosing a suitable kernel function for a given dataset. One of the approaches proposed to address this problem is Multiple Kernel Learning (MKL) in which several kernels are combined adaptively for a given dataset. Many of the existing MKL methods use the SVM objective function and try to find a linear combination of basic kernels such that the separating margin between the classes is maximized. However, these methods ignore the fact that the theoretical error bound depends not only on the margin, but also on the radius of the smallest sphere that contains all the training instances. We present a novel MKL algorithm that optimizes the error bound taking account of both the margin and the radius. The empirical results show that the proposed method compares favorably with other state-of-the-art MKL methods. Keywords: Learning Kernel Combination, Support Vector Machines, convex optimization. 1 Introduction Over the last few years kernel methods [1,2], such as Support Vector Machines (SVM), have proved to be eﬃcient machine learning tools. They work in a feature space implicitly defined by a positive semi-definite kernel function, which allows the computation of inner products in feature spaces using only the objects in the input space. The main limitation of kernel methods stems from the fact that in general it is diﬃcult to select a kernel function, and hence a feature mapping, that is suitable for a given problem. To address this problem several several attempts have been recently made to learn kernel operators directly from the data [3,4,5,6,7,8,9,10,11,12]. The proposed methods diﬀer in the objective functions (e.g. CV risk, margin based, alignment, etc.) as well as in the classes of kernels that they consider (e.g. combination of finite or infinite set of basic kernels). The most popular approach in the context of kernel learning considers a finite set of predefined basic kernels which are combined so that the margin-based W. Buntine et al. (Eds.): ECML PKDD 2009, Part I, LNAI 5781, pp. 330–343, 2009. c Springer-Verlag Berlin Heidelberg 2009 Margin and Radius Based Multiple Kernel Learning 331 objective function of SVM is optimized. Thelearned kernel K is a linear combiM nation of basic kernels Ki , i.e. K(x, x′ ) = i=1 µi Ki (x, x′ ), µi ≥ 0, where M is the number of basic kernels, and x and x′ are input objects. The weights µi of the kernels are included in the margin-based objective function. This setting is commonly referred to as the Multiple Kernel Learning (MKL). The MKL formulation has been introduced in [3] as a semi-definite programming problem, which scaled well only for small problems. [7] extended that work and proposed a faster method based on the conic duality of MKL and solved the problem using Sequential Minimal Optimization (SMO). [5] reformulated the MKL problem as semi-infinite linear problem. In [6] the authors proposed an adjustment in the cost function of [5] to improve predictive performance. Although the MKL approach to kernel learning has some limitations (e.g. one has to choose the basic kernels), it is widely used because of its simplicity, interpretability and good performance. The MKL methods that use the SVM objective function do not exploit the fact that the error bound of SVM depends not only on the separating margin, but also on the radius of the smallest sphere that encloses the data. In fact even the standard SVM algorithms do not exploit the latter, because for a given feature space the radius is fixed. However in the context of MKL the radius is not fixed but is a function of the weights of the basic kernels. In this paper we propose a novel MKL method that takes account of both radius and margin to optimize the error bound. Following a number of transformations, these problems are cast in a form that can be solved by the two step optimization algorithm given in [6]. The paper is organized as follows. In Section 2 we introduce the general MKL framework. Next, in Section 3 we discuss the various error bounds that motivate the use of the radius. The main contribution of the work is presented in Section 4 where we propose a new method for multiple kernel learning that aims to optimize the margin- and radius-dependent error bound. In Section 5 we present the empirical results on several benchmark datasets. Finally, we conclude with Section 6 where we also present pointers to future work. 2 Multiple Kernel Learning Problem Consider a mapping of instances x ∈ Xi , to a new feature space Hi x → Φi (x) ∈ Hi (1) ′ This mapping can be performed by a kernel function Ki (x, x ) which is defined as the inner product of the images of two instances x and x′ in Hi , i.e. Ki (x, x′ ) = Φi (x), Φi (x′ ); Hi may have even infinite dimensionality. Typically, the computation of the inner product in Hi is done implicitly, i.e. without having to compute explicitly the images Φi (x) and Φi (x′ ). 2.1 Original Problem Formulation of MKL Given a set of training examples S = {(x1 , y1 ), ..., (xl , yl )} and a set of basic kernel functions, Z = {Ki (x, x′ )|i := 1, . . . M }, the goal of MKL is to optimize 332 H. Do et al. a cost function Q(f (Z, µ)(x, x′ ), S) where f (Z, µ)(x, x′ ) is some positive semidefinite function of the set of the basis kernels, parametrized by µ; most often a linear combination of the form: f (Z, µ)(x, x′ ) = M i=1 µi Ki (x, x′ ), µi ≥ 0, M µi = 1 (2) i To simplify notation we will denote f (Z, µ) by Kµ . In the remaining part of this work we will only focus on the normalized versions of Ki , defined as: Ki (x, x′ ) . Ki (x, x′ ) := Ki (x, x) · Ki (x′ , x′ ) (3) If Kµ is a linear combination of kernels then its feature space Hµ is given by the mapping: √ √ x → Φµ (x) = ( µ1 Φ1 (x), ..., µM ΦM (x))T ∈ Hµ (4) where Φi (x) is the mapping to the Hi feature space associated with the Ki kernel, as this was given in Formula 1. In previous work within the MKL context the cost function, Q, has taken diﬀerent forms such as the Kernel Target Alignment, which measures the “goodness“ of a kernel for a given learning task [9], or the typical SVM cost function combining classification error and the margin [3,5,6], or as in [4] any of the above with an added regularization term for the complexity of the combined kernel. 3 Margin and Radius Based Error Bounds There are a number of theorems in statistical learning that bound the expected classification error of the thresholded linear classifier, that corresponds to the maximum margin hyperplane, by quantities that are related to the margin and the radius of the smallest sphere that encloses the data. Below we give two of them that are applicable on linearly separable and non-separable training sets, respectively. Theorem 1. [10], Given a training set S = {(x1 , y1 ), ..., (xl , yl )} of size l, a feature space H and a hyperplane (w, b), the margin γ(w, b, S) and the radius R(S) are defined by yi (w, Φ(xi ) + b) w (xi ,yi )∈S R(S) = min max Φ(xi ) − a γ(w, b, S) = min a i The maximum margin algorithm Ll : (X × Y)l → H × R takes as input a training set of size l and returns a hyperplane in feature space such that the margin γ(w, b, S) is maximized. Note that assuming the training set is separable means Margin and Radius Based Multiple Kernel Learning 333 that γ > 0. Under this assumption, for all probability measures P underlying the data S, the expectation of the misclassification probability perr (w, b) = P (sign(w, Φ(X) + b) = Y ) has the bound R2 (Z) 1 E l γ 2 (Ll (Z), Z) The expectation is taken over the random draw of a training set Z of size l − 1 for the left hand side and l for the right hand side. E{perr (Ll−1 (Z))} ≤ The following theorem gives a similar result for the error bound of the linearly non-separable case. Theorem 2. [13], Consider thresholding real-valued linear functions L with unit weight vectors on an inner product space H and fix γ ∈ R+ . There is a constant c, such that for any probability distribution D on H×{−∞, ∞} with support in a ball of radius R around the origin, with probability 1 − δ over l random examples S, any hypothesis f ∈ L has error no more than: 1 c R2 + ξ22 2 err(f )D ≤ ( log l + log ), l γ2 δ (5) where ξ = ξ(f, S, γ) is the margin slack vector with respect to f and γ. It is clear from both theorems that the bound on the expected error depends not only on the margin but also on the radius of the data, being a function of the R2 /γ 2 ratio. Nevertheless standard SVM algorithms can ignore the dependency of the error bound on the radius because for a fixed feature space the radius is constant and can be simply ignored in the optimization procedure. However in the MKL scenario where the Hµ feature space is not fixed but depends on the parameter vector µ the radius is no longer fixed but it is a function of µ and thus should not be ignored in the optimization procedure. The radius of the smallest sphere that contains all instances in the H feature space defined by the Φ(x) mapping is computed by the following formula [14]: min R,Φ(x0 ) R2 (6) s.t. Φ(xi ) − Φ(x0 )2 ≤ R2 , ∀i It can be shown that if Kµ is a linear combination of kernels, of the form given in Formula 2, then for the Rµ radius of its Hµ feature space the following inequalities hold: 2 max(µi Ri2 ) ≤ Rµ ≤ i M i=1 µi Ri2 ≤ max(Ri2 ), i s.t. M (7) µi = 1 i where Ri is the radius of the component feature space Hi associated with the Ki kernel. The proof of the above statement is given in the appendix. 334 4 H. Do et al. MKL with Margin and Radius Optimization In the next sections we will show how we can make direct use of the dependency of the error bound both on the margin and the radius in the context of the MKL problem in an eﬀort to decrease even more the error bound than what is possible by optimizing only over the margin. 4.1 Soft Margin MKL The standard l2 -soft margin SVM is based on theorem 2 and learns maximal margin hyperplanes while controlling for the l2 norm of the slack vector in an eﬀort to optimize the error bound given in equation 5; as already mentioned previously the radius although it appears in the error bound is not considered in the optimization problem due to the fact that for a given feature space it is fixed. The exact optimization problem solved by the l2 -soft margin SVM is [13]: l C 2 1 w, w + ξ 2 2 i=1 i min w,b,ξ (8) s.t. yi (w, Φ(xi ) + b) ≥ 1 − ξi , ∀i The solution hyperplane (w∗ , b∗ ) of this problem realizes the maximum margin classifier with geometric margin γ = w1∗ . When instead of a single kernel we learn with a combination of kernels Kµ then the radius of the resulting feature space Hµ depends on the parameters µ which are also learned. We can profit from this additional dependency and optimize not only for the margin but also for the radius, as Theorems 1 and 2 suggest, in the hope of reducing even more the error bounds than what would be possible by just focusing on the margin. A straightforward way to do so is to alter the cost function of the above optimization problem so that it also includes the radius. Thus we define the primal form of soft margin MKL optimization problem as follows: l C 2 1 2 w, wRµ ξ + 2 2 i=1 i min w,b,ξ,µ (9) s.t. yi (w, Φ(xi ) + b) ≥ 1 − ξi , ∀i Accounting for the form Φµ of the feature space Hµ , as it is given in equation 4, this optimization problem can be rewritten as: min w,b,ξ,µ M l 1 C 2 2 ξ wk , wk Rµ + 2 2 i=1 i k M √ s.t. yi ( wk , µk Φk (xi ) + b) ≥ 1 − ξi , ∀i k (10) Margin and Radius Based Multiple Kernel Learning 335 2 where the w is the same as that of Formula 9 and equal to (w1 , . . . , wM ), Rµ √ √ can be computed by equation 6. By letting w := µ.w then wk := µk wk we can rewrite equation 10 as1 : min w,b,ξ,µ M l 1 wk , wk 2 C 2 ξ Rµ + 2 µk 2 i i (11) k M s.t. yi ( wk , Φk (xi ) + b) ≥ 1 − ξi , k M k=1 µk = 1, µk ≥ 0, ∀k The non-negativity of µ is required toguarantee that the kernel combination is a M valid kernel function; the constraint k=1 µk = 1 is added to make the solution interpretable (kernel with bigger weight can be interpreted as more important one) and to get a specific solution (note that if µ is solution of 11 (without the + constraint M k=1 µk = 1), then λµ, λ ∈ R is also its solution). We will denote the cost function of equation 11 by F (w, b, ξ, µ). F is not a convex function, this is probably the main reason why in current MKL algorithms the radius is simply removed from the original cost function, therefore they do not really optimize the generalization error bound. M 2 ≤ k µk Rk2 and From the set of inequalities given in equation 7 we have Rµ from this we can get: F (w, b, ξ, µ) = M wk ,wk 2 l 1 Rµ + C2 i ξi2 k 2 µk M wk ,wk M l 2 1 C 2 k µk Rk + 2 k i ξi 2 µk M M l k + 2 MCµ R2 i ξi2 ) k µk Rk2 ( 12 k wkµ,w k k k k M 1 (2 k wk ,wk µk + C 2 2 M k µk Rk l 2 i ξi ) ≤ (12) = ≤ = F(w, b, ξ, µ) M The last inequality holds because in the context we examine we have k µk Rk2 ≤ 1. This is a result of the fact that we work with the normalized feature spaces, using the normalized as these were defined 3, thus we have kernels Min equation M 2 Rk2 ≤ 1 and since µ = 1 it holds that: µ R ≤ 1. Since F is an k k k k k upper bound of F and moreover it is convex2 we are going to use it as our objective function. As a result, we propose to solve, instead of the original soft margin optimization problem given in equation 11, the following upper bounding convex optimization problem: 1 2 Note that if µk = 0 then from the dual form we have wk = 0. In this case, we use the convention that 00 = 0. The convexity of this new function can be easily proved by showing that the Hessian matrix is positive semi-definite. 336 H. Do et al. min w,b,ξ,µ M l 1 wk , wk C ξi2 + M 2 2 µk µ R 2 k k k i k (13) M s.t. yi ( wk , Φk (xi ) + b) ≥ 1 − ξi , k M k=1 µk = 1, µk ≥ 0, ∀k The dual function of this optimization problem is: l Ws (α, µ) = − + M 1 αi αj yi yj µk Kk (xi , xj ) 2 ij (14) k l i αi − M k µk Rk2 α, α 2C M M l l 2 1 k µk Rk =− δij ) + αi αi αj yi yj ( µk Kk (xi , xj ) + 2 ij C i k where δij is the Kronecker δ defined to be 1 if i = j and 0 otherwise. The dual optimization problem is given as: max Ws (α, µ) α,µ s.t. αi yi = 0, (15) i αi ≥ 0, ∀i l Rk2 C i ξi2 , ∀k αi αj yi yj Kk (xi , xj ) = ( k µk Rk2 )2 ij In the next section we will show how we can solve this new optimization problem. 4.2 Algorithm The dual function 14 is quadratic with respect to α and linear with respect to µ. One way to solve the optimization problem 13 is to use a two step iterative algorithm such as the ones described in [6], [10]. Following such a two step approach, in the first step we will solve a quadratic problem that optimizes over (w, b), while keeping µ fixed; as a consequence the resulting dual function is a simple quadratic function of α which can be optimized easily. In the second step we will solve a linear problem that optimizes over µ. Margin and Radius Based Multiple Kernel Learning 337 More precisely, the formulation of the optimization problem with the two-step approach takes the following form: min J(µ) (16) µ M s.t. k=1 where J(µ) = µk = 1, µk ≥ 0, ∀k M wk ,wk + 2 MCµ R2 k µk k k k M yi ( k wk , Φk (xi ) + b) ≥ 1 2 minw,b s.t. l 2 i ξi 1 − ξi (17) To solve the outer optimization problem, i.e. minµ J(µ), we use gradient descent method. At each iteration, we fix µ, compute the value of J(µ) and then compute the gradient of J(µ) with respect to µ. The dual function of Formula 17 is the Ws (α, µ) function already given in Formula 14. Since µ is fixed we now optimize only over α (the resulting dual optimization problem is much simpler compared to the original soft margin dual optimization problem given in Formula 15): max Ws (α, µ) α αi yi = 0, αi ≥ 0, ∀i s.t. i which has the same form as the SVM quadratic optimization problem, the only diﬀerence is that the C parameter here is equal to M Cµ R2 . k k k For the strong duality, at the optimal solution α∗ , the values of dual cost function and primal cost function are equal. Thus the value of Ws (α, µ), and the J(µ) value, is given by: Ws (α∗ , µ) = − + M l 1 ∗ ∗ αi αj yi yj ( µk Kk (xi , xj ) 2 ij k M k l µk Rk2 α∗i δij ) + C i The last step of the algorithm is to compute the gradient of the J(µ) function, Formula 17, with respect to µ. As [6] have pointed out, we can use the theorem of Bonnans and Shapiro [15] to compute gradients of such functions. Hence, the gradient is in the following form: l ∂J(µ) R2 1 ∗ ∗ αi αj yi yj (Kk (xi , xj ) + k δij ) =− ∂µk 2 ij C To compute the optimal step in the gradient descent we used line search. The complete two-step procedure is given in Algorithm 1. 338 H. Do et al. Algorithm 1. R-MKL 1 Initialize µ1k = M for k = 1, ..., M repeat 2 t 2 = M Set Rµ k µk Rk t compute J(µ ) as the solution of a quadratic optimization problem with K := M t k µk Kk ∂J for k = 1, ..., M compute ∂µ k compute optimal step γt (µ) µt+1 ← µt + γt ∂J∂µ until stopCriteria is true 4.3 Computational Complexity At each step of the we have to compute the solution of a standard SVM, iteration M with kernel K = k=1 µk Kk , and C equal to M Cµ R2 , which is a quadratic prok k k gramming problem with a complexity of O(n3 ) where n is number of instances. 2 ; Moreover, when µ is updated we have to recompute the approximation of Rµ the complexity of this procedure is linear in the number of kernels, O(M ). 5 Experiments We experimented with ten diﬀerent datasets. Six of them were taken from the UCI repository (Ionosphere, Liver, Sonar, Wdbc, Wpbc, Musk1), while four come from the domain of genomics and proteomics (ColonCancer, CentralNervousSystem, FemaleVsMale, Leukemia) [16]; these four are characterized by small sample and high dimensionality morphology. A short description of the datasets is given in Table 1. We experimented with two diﬀerent types of basic kernels, i.e. polynomials and Gaussians, and performed two sets of experiments. In the first set of experiments we used both types of kernels and in the second one we focused only on Gaussians kernels. For each set of experiments the total number of basic kernels was 20; for the first set we used polynomial kernels of degree one, two, and three and 17 Gaussians with bandwidth δ that ranged from 1 to 17 with a step of one; for the second set of experiments we only used Gaussian kernels with bandwidth δ that ranged from 1 to 20 with a step of one. We compared our MKL algorithm (denoted as R-MKL) with two state-of-theart MKL algorithms: Support Kernel Machine (SKM) [7], and SimpleMKL [6]. We estimate the classification error using 10-fold cross validation. For comparison purposes we also provide the performances of the best single kernel (BK) and the majority classifier (MC); the latter always predicts the majority class. The performance of the BK is that of the best single kernel estimated also by 10-fold cross validation, since it is the best result after seeing the performance of all individual kernels on the available data it is optimistically biased. We tuned the parameter C in an inner-loop 10-fold cross-validation choosing the values from the set {0.1, 1, 10, 100}. All algorithms terminate when the duality gap is smaller than 0.01. All input kernel matrices are normalized by equation 3. Margin and Radius Based Multiple Kernel Learning 339 Table 1. Short description of the classification datasets used Dataset #Inst #Attr #Class1 #Class2 Ionosphere 351 34 126 225 Liver 345 6 145 200 208 60 97 111 Sonar 569 32 357 212 Wdbc 198 34 151 47 Wpbc 476 166 269 207 Musk1 ColonCancer 62 2000 40 22 7129 21 39 CentralNervous 60 FemaleVSMale 134 1524 67 67 72 7128 25 47 Leukemia We compared the significance level of the performance diﬀerences of the algorithms with McNemar’s test [17], where the level of significance is set to 0.05. We also established a ranking schema of the examined MKL algorithms based on the results of the pairwise comparisons [18]. More precisely, if an algorithm is significantly better than another it is credited with one point; if there is no significant diﬀerence between two algorithms then they are credited with 0.5 points; finally, if an algorithm is significantly worse than another it is credited with zero points. Thus, if an algorithm is significantly better than all the others for a given dataset it has a score of two. We give the full results in Tables 2, 3. For each algorithm we report a triplet in which the first element is the estimated classification error, the second is the number of selected kernels, and the last is the above described rank. Our kernel combination algorithm does remarkably well in the first set of experiments, Table 2, in which it is significantly better than both other algorithms in four datasets and significantly worse in two; for the four remaining datasets there are no significant diﬀerences. Note that in the cases of Wpbc and Table 2. Results for the first experiments, where both polynomial and Gaussian kernels are used. Each triplet x,y,z gives respectively the classification error, the number of selected kernels, and the number of significance point that the algorithm scores for the given experiment set and dataset. Columns BK and MC give the errors of the best single kernel and the majority classifier, respectively. D. Set SKM Ionos. 04.00,02,1.5 Liver 33.82,05,1.5 Sonar 15.50,01,1.0 Wdbc 11.25,03,1.0 Wpbc 23.68,17,1.0 Musk1 11.70,07,1.0 Colon. 18.33,18,1.0 CentNe. 35.00,17,1.0 Female. 33.85,20,1.0 Leuke. 07.14,18,0.5 Simple 03.71,02,1.5 33.53,13,1.5 15.50,01,1.0 13.04,18,0.0 23.68,01,1.0 13.40,18,0.0 18.33,18,1.0 35.00,17,1.0 38.92,18,0.0 07.14,18,0.5 R-MKL 04.86,02,0 36.18,03,0 15.50,01,1 03.75,18,2 23.68,01,1 06.60,01,2 16.67,18,1 35.00,17,1 20.00,18,2 02.86,18,2 BK 05.71 30.29 17.50 08.57 23.68 04.47 11.67 31.67 22.31 02.86 MC 36.00 42.06 46.00 37.32 23.68 43.83 35.00 35.00 60.00 34.29 340 H. Do et al. Table 3. Results for the second set of experiments, where only Gaussian kernels are used. The table contains the same information as the previous one. D. Set SKM Ionos. 04.86,02,1.0 Liver 33.53,03,1.0 Sonar 15.50,01,1.0 Wdbc 37.32,03,1.0 Wpbc 23.68,20,1.0 Musk1 43.83,14,1.0 Colon. NA CentNe. 35.00,20,1.0 Female. 60.00,20,1.0 Leuke. 34.29,20,1.0 Simple 05.43,04,1.0 33.53,20,1.0 15.50,01,1.0 37.32,19,1.0 23.68,19,1.0 43.83,19,1.0 35.00,20,1.5 35.00,20,1.0 60.00,20,1.0 34.29,20,1.0 R-MKL 05.14,03,1.0 33.53,16,1.0 15.50,01,1.0 37.32,20,1.0 23.68,20,1.0 43.83,20,1.0 35.00,20,1.5 35.00,20,1.0 60.00,20,1.0 34.29,20,1.0 BK 05.14 34.71 17.00 37.32 23.68 43.83 35.00 35.00 60.00 34.29 MC 36.00 42.06 46.00 37.32 23.68 43.83 35.00 35.00 60.00 34.29 CentralNervousSystem all algorithms have a performance that is similar to that of the majority classifier, i.e. the learned models do not have any discriminatory power. By examining the classification performances of the individual kernels on these datasets we see that none of them had a performance that was better than that of the majority classifier; this could explain the bad behavior of the diﬀerent kernel combination schemata. Overall, for this set of experiments R-MKL gets 12 significance points over the diﬀerent datasets, SKM 10.5, and SimpleMKL 7.5. The performance improvements of R-MKL over the two other methods are are quite impressive on those datasets on which R-MKL performs well; more precisely its classification error is around 30%, 50%, and 40%, of that of the other algorithms for Wdbc, Musk1, and Leukemia datasets respectively. In the second set of experiments,Table 3, all methods perform very poorly in seven out of ten datasets; their classification performance is similar to that of the majority classifier. In the remaining datasets, with the exception of ColonCancer for which SKM failed (we used the implementation provided by the authors of the algorithm and it returns with some errors), there is no significant diﬀerence between the three algorithms. The collectively bad performance in the last seven databases is explained by the fact that none of the basic kernels had a classification error that was better than that of the majority classifier. Overall, for this set of experiments SKM scores 9 points, and Simple MKL and R-MKL score 10.5 points each. Comparing the number of selected kernels by the diﬀerent kernel combination methods using a paired t-test (significance level of 0.05) revealed no statistically significant diﬀerences between the three algorithms on both sets of experiments. In an eﬀort to get an empirical estimation of the quality of the approximation of the radius that we used to make the optimization problems convex, we M µ R2 −R2 k k µ . We computed this computed the approximation error defined as k R 2 µ error over the diﬀerent folds of the ten-fold cross-validation for each dataset. The average approximation error over the diﬀerent datasets was 0.0056. We also computed this error over 1000 random values of µ for each dataset and the Margin and Radius Based Multiple Kernel Learning 341 average was 0.0104. Thus, the empirical evidence seems to indicate that the error M 2 Rµ ≤ k µk Rk2 bound is relatively tight, at least for the datasets we examined. 6 Conclusion and Future Work In this paper we presented a new kernel combination method that incorporates in its cost function not only the margin but also the radius of the smallest sphere that encloses the data. This idea is a direct implementation of well known error bounds from statistical learning theory. To the best of our knowledge this is the first work in which the radius is used together with the margin in an eﬀort to minimize the generalization error. Even though the resulting optimization problems were non-convex and we had to use an upper bound on the radius to get convex forms, the empirical results were quite encouraging. In particular, our method competed with other state-of-the-art methods for kernel combination, thus demonstrating the benefit and the potential of the proposed technique. Finally, we mention that it is still a challenging research direction to fully exploit the examined generalization bound. In future work we would like to examine optimization techniques for directly solving the non-convex optimization problem presented in Formula 11. In particular, we will examine whether it is possible to decompose the cost function as a sum convex and concave functions, or to represent it as d.m functions (diﬀerence of two monotonic functions) [19,20]. Additionally, we plan to analyze the bound M 2 ≤ k µk Rk2 and see how it relates with the real optimal value. Rµ Acknowledgments. The work reported in this paper was partially funded by the European Commission through EU projects DropTop (FP6-037739), DebugIT (FP7-217139) and e-LICO (FP7-231519). The support of the Swiss NSF (Grant 200021-122283/1) is also gratefully acknowledged. References 1. Shawe-Taylor, J., Cristianini, N.: Kernel methods for pattern analysis. Cambridge University Press, Cambridge (2004) 2. Schölkopf, B., Smola, A.J.: Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge (2001) 3. Lanckriet, G., Cristianini, N., Bartlett, P., Ghaoui, L.E.: Learning the kernel matrix with semidefinite programming. Journal of Machine Learning Research 5, 27–72 (2004) 4. Ong, C.S., Smola, A.J., Williamson, R.C.: Learning the kernel with hyperkernels. Journal of Machine Learning Research 6, 1043–1071 (2005) 5. Sonnenburg, S., Ratsch, G., Schafer, C.: A general and eﬃcient multiple kernel learning algorithm. Journal of Machine Learning Research 7, 1531–1565 (2006) 6. Bach, F., Rakotomamonjy, A., Canu, S., Grandvalet, Y.: SimpleMKL. Journal of Machine Learning Research (2008) 342 H. Do et al. 7. Bach, F.R., Lanckriet, G.R.G., Jordan, M.I.: Multiple kernel learning, conic duality, and the smo algorithm. In: ICML 2004: Proceedings of the twenty-first international conference on Machine learning, p. 6. ACM, New York (2004) 8. Lanckriet, G., Bie, T.D., Cristianini, N.: A statistical framework for genomic data fusion. Bioinformatics 20 (2004) 9. Cristianini, N., Shawe-Taylor, J., Elisseeﬀ, A.: On kernel-target alignment. Journal of Machine Learning Research (2002) 10. Chapelle, O., Vapnik, V., Bousquet, O., Mukherjee, S.: Choosing multiple parameters for support vector machines. Machine Learning 46(1-3), 131–159 (2002) 11. Crammer, K., Keshet, J., Singer, Y.: Kernel design using boosting. In: Advances in Neural Information Processing Systems, vol. 14. MIT Press, Cambridge (2002) 12. Bousquet, O., Herrmann, D.: On the complexity of learning the kernel matrix. In: Advances in Neural Information Processing Systems, vol. 14. MIT Press, Cambridge (2003) 13. Cristianini, N., Shawe-Taylor, J.: An introduction to Support Vector Machines. Cambridge University Press, Cambridge (2000) 14. Vapnik, V.: Statistical learning theory. Wiley Interscience, Hoboken (1998) 15. Bonnans, J., Shapiro, A.: Optimization problems with perturbation: A guided tour. SIAM Review 40(2), 202–227 (1998) 16. Kalousis, A., Prados, J., Hilario, M.: Stability of feature selection algorithms: a study on high dimensional spaces. Knowledge and Information Systems 12(1), 95– 116 (2007) 17. McNemar, Q.: Note on the sampling error of the diﬀerence between correlated proportions or percentages. Psychometrika 12, 153–157 (1947) 18. Kalousis, A., Theoharis, T.: Noemon: Design, implementation and performance results for an intelligent assistant for classifier selection. Intelligent Data Analysis Journal 3, 319–337 (1999) 19. Leo Liberti, N.M. (ed.): Global Optimization - From Theory to Implementation. Springer, Heidelberg (2006) 20. Collobert, R., Weston, J., Bottou, L.: Trading convexity for scalability. In: Proceedings of the 23th Conference on Machine Learning (2006) 21. Stephen Boyd, L.V. (ed.): Convex optimization. Cambridge University Press, Cambridge (2004) Appendix Proof of Inequality 7. If K(x, x′ ) is the kernel function associated with the Φ(x) mapping then the computation of the radius in the dual form is given in [1]: max R2 = βi βj s.t. l i l i βi K(xi , xi ) − l βi βj K(xi , xj ) (18) ij βi = 1, βi ≥ 0 If β ∗ is the optimal solution of (18) when K = Kµ = optimal solution of (18) when K = Kk , i.e. : M 1 µk Kk , and βˆk is the Margin and Radius Based Multiple Kernel Learning 2 Rµ = l l ∗ βi∗ βj∗ Kk (xi , xj )) βi Kk (xi , xi ) − µk ( M Rk2 = i=1 i,j=1 i=1 k=1 l 343 βˆk i Kk (xi , xi ) − l βˆk i βˆk j Kk (xi , xj ) i,j=1 then l i=1 l i=1 2 Therefore: Rµ ≤ M βi∗ Kk (xi , xi ) − l i,j=1 βˆk i Kk (xi , xi ) − k=1 βi∗ βj∗ Kk (xi , xj ) ≤ l βˆk i βˆk j Kk (xi , xj ) i,j=1 µk Rk2 Proof of convexity of R-MKL (Eq.13). To prove that 13 is convex, it is enough 2 2 to show that functions xµ , where x ∈ R, µ ∈ R+ , and Mξα µ , where ξ ∈ k k k R, µk , αk ∈ R+ are convex. The first is quadratic-over-linear function which is convex. The second is convex because its epigraph is a convex set [21].

Log In

Margin and radius based multiple kernel learning