A Multi-Objective Genetic Algorithm for Pruning Support Vector Machines

Support Vector Machine SVM Pruning Experiments Conclusion Future Work
A Multi-Objective Genetic Algorithm for
Pruning Support Vector Machines
Mohamed Abdel Hady, Wessam Herbawi,
Friedhelm Schwenker
Institute of Neural Information Processing
University of Ulm, Germany
{mohamed.abdel-hady}@uni-ulm.de
November 4, 2011
1 / 15

Support Vector Machine
+
- +
+
+
+
+
+
+
+
-
-
-
-
-
-
-
-
-
+
{x|‹w, ϕ(x)›+b = 0}
w
y = -1 y = +1
{x|‹w, ϕ(x)›+b = -1}
{x|‹w, ϕ(x)›+b = +1}
Maximum
margin
+
+
-
-
є1
є4
є2
є3
+
-
2 / 15

To obtain the optimal hyperplane, one solves the following convex quadratic
optimization problem with respect to weight vector w and bias b:
min
w,b
1
2
w 2
+ C
n
i=1
i , (1)
subject to the constraints:
yi ( w, φ(xi ) + b) ≥ 1 − i , i ≥ 0 for i = 1 . . . , n (2)
The regularization parameter C controls the trade-off between maximizing the margin
1/ w and minimizing the sum of slack variables of the training examples
i = max(0, 1 − yi ( w, φ(xi ) + b))for i = 1, . . . , n. (3)
The training example xi is correctly classiﬁed if 0 ≤ i < 1 and is misclassiﬁed when
i ≥ 1.
3 / 15

The problem is converted into its equivalent dual problem, using standard Lagrangian
techniques, whose number of variables is the number of training examples.
max
α
n
i=1
αi −
1
2
n
i,j=1
αi αj yi yj k(xi , xj ) (4)
subject to the constraints
n
i=1
αi yi = 0 and 0 ≤ αi ≤ C for i = 1, . . . n. (5)
where the coefﬁcients α∗
i are the optimal solution of the dual problem and k is the
kernel function. Hence, the decision function to classify unseen example x can be
written as:
f(x) =
nsv
i=1
α∗
i yi k(x, xi ) + b∗
, (6)
The training examples xi with α∗
i > 0 are called support vectors and the number of
support vectors is denoted by nsv ≤ n.
4 / 15

SVM Pruning
The classification time complexity of the SVM classifier scales with the number of
support vectors (O(nsv )).
To reduce the complexity of SVM, the number of support vectors should be
reduced
To reduce the overfitting (over-training) of SVM, the number of support vectors
should be reduced
Indirect methods: reduce the number of training examples
{(xi , yi ) : i = 1, . . . , n} [Pedrajas, IEEE TNN 2009]
Direct methods: The multiobjective evolutionary SVM proposed in this paper is
the first evolutionary algorithm that reformulates SVM pruning as a combinatorial
multi-objective optimization problem.
5 / 15

Genetic Algorithm for Support Vector Selection
Evaluate SVM
simplified decision
function
GA Operators
(Selection, Crossover
and Mutation)
Evaluate the fitness of
individuals in
population
Number of support
vectors
Training error
Genetic Algorithm
support vectors
indices
6 / 15

Representation (Encoding)
For support vector selection a binary encoding is appropriate. Here, the tth
candidate solution in a population is an nsv -dimensional bit vector st ∈ {0, 1}nsv .
The jth support vector will be included in the decision function if stj = 1 and
excluded when stj = 0. For instance, if we have a problem with 7 support
vectors, the tth individual solution of the population can be represented as
st = (1, 0, 0, 1, 1, 1, 0) or st = (0, 1, 0, 1, 1, 0, 1).
Then for each solution with bit vector st , only the summation of the nsv selected
support vectors are performed to deﬁne the reduced decision function (freduced ),
which is used in Eq. (9) to evaluate the ﬁtness of solution st .
freduced (xi , st ) =
nsv
j=1
stj α∗
j yj Kij + b∗
, (7)
7 / 15

Selection Criteria (Objectives)
determine the quality of each candidate solution in the population. We want to
design classifiers with high generalization ability.
There is a trade-off between SVM complexity and its training error (the number
of misclassified examples on the set n training examples)
the following two objective functions are used to measure the fitness of a solution
st :
f1(st ) = nsv =
nsv
j=1
stj (8)
and
f2(st ) =
n
i=1
1(yi =sgn(freduced (xi ,st ))) (9)
where freduced is the reduced decision function defined in Eq. (7) and sgn is the
indicator function with values -1 and +1. It is easy to achieve zero training error
when all training examples are support vectors, but this solution is not likely to
generalize well (prone to overfitting).
8 / 15

Experimental Setup
soft-margin L1-SVMs with Gaussian kernel function
k(x, xi ) = exp(−γ x − xi
2
) (10)
with γ = 1/d and the regularization term C =1.
four benchmark datasets from UCI Benchmark Repository, ionosphere, diabetes,
sick, and german credit where the number of features (d) is 34, 8, 29, and 20,
respectively.
All features are normalized to have zero mean and unit variance.
Each dataset is divided randomly into two subsets, 10% are used as testset
Dtest , while the remaining 90% are used as training examples Dtrain. Thus, the
size of training sets (n) is 315, 691, 3394 and 900 and the size of test set (m) is
36, 77, 378 and 100, respectively.
At the beginning of the experiment, a soft margin L1-norm SVM is constructed
using subset Dtrain and SMO algorithm.
The training error f2(st ) of each individual solution st (support vector subset) is
evaluated on subset Dtrain where CE(train) = f2(st )/n. After each run of MOGA,
we evaluate the average test set error CE(test) of each solution in the ﬁnal set of
Pareto-optimal solutions using subset Dtest .
9 / 15

Experimental Results
For the application of the NSGA-II we choose a population size of 100 and the
other parameters of the NSGA-II (pc = 0.9, pmut = 1/nsv , ηc = 20, ηmut = 20)
where the two objectives given in Eq. (8) and Eq. (9) are optimized.
For each dataset, ten optimization runs of MOGA are carried out, each of them
lasting for 10000 generations.
Pareto-optimal solutions after pruning compared to unpruned SVM
dataset ionosphere diabetes sick german credit
before [101, 4, 10] [399, 126, 14] [503, 88, 12] [820, 20, 27]
after
[0, 202, 23] [0, 450, 50] [0, 208, 23] [8, 259, 26]
to [15, 3, 5] to [101, 125, 18] to [92, 83, 13] to [283, 57, 22]
the solutions are written as triple [nsv , n.CE(train), m.CE(test)]
10 / 15

Pareto Fronts
0 5 10 15
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
ionosphere
0 50 100 150
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
diabetes
after pruning: CE(train)
after pruning: CE(test)
before pruning: CE(train)
before pruning: CE(test)
0 20 40 60 80 100
0.02
0.03
0.04
0.05
0.06
0.07
sick
0 100 200 300
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
german credit
11 / 15

For many solutions for ionosphere and german credit, we can see the effort of
overﬁtting as the generalization ability of the SVM classiﬁer was improved after
pruning while the training error get worse.
A typical MOO heuristic is to select a solution (support vector subset) that
corresponds to an interesting part of the Pareto front.
12 / 15

Attainment Surfaces
0 5 10 15 20 25
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
ionosphere
0 50 100 150
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
diabetes
attainment surface: 10th
attainment surface: 5th
attainment surface: 1st
before pruning
0 50 100 150 200
0.02
0.03
0.04
0.05
0.06
0.07
sick
0 100 200 300 400
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
german credit
13 / 15

The attainment curves have a maximum complexity of 22, 132, 171, and 300 for
ionosphere, diabetes, sick and german credit, respectively. That is, the
evolutionary pruning approach achieved a percentage of complexity reduction
equals to 78.2%, 66.9%, 66% and 63.4% for the four datasets, repectively
without sacriﬁcing the training error.
14 / 15

Conclusion
Support vector selection is a multi-objective optimization problem. We have
described a genetic algorithm to reduce the computational complexity of support
vector machines by reducing number of support vectors comprised in their
decision functions.
The resulting Pareto fronts visualize the trade-off between SVM complexity and
its training error for guiding the support vector selection
For some data sets, the experimental results show that the test set classification
accuracy is improved after pruning without sacrificing the training set accuracy.
Thus, the post-pruning of SVMs achieved the same effect of post-pruning
decision trees where it reduces overfitting.
15 / 15

Future Work
We plan to extend the application of the proposed approach to regression tasks
that suffer from the same problem of large number of support vectors in the
decision functions of support vector regression machines.
In addition, we will conduct further experiments using other types of kernel
functions as we used only Gaussian kernels in the presented experiments. We
expect that the percentage of complexity reduction is kernel-dependent.
16 / 15

Thanks for your attention
Questions ??
17 / 15

A Multi-Objective Genetic Algorithm for Pruning Support Vector Machines

More Related Content

A Multi-Objective Genetic Algorithm for Pruning Support Vector Machines