Support Vector Machines
Support Vector Machines
Support Vector Machines
A Flexible
Implementation for
Support Vector
Machines
Roland Nilsson
Johan Björkegren
Jesper Tegnér
Support vector machines (SVMs) are learning algorithms that have many
applications in pattern recognition and nonlinear regression. Being very
popular, SVM software is available in many versions. Still, existing imple-
mentations, usually in low-level languages such as C, are often difficult to
understand and adapt to specific research tasks. In this article, we present
a compact and yet flexible implementation of SVMs in Mathematica,
traditionally named MathSVM. This software is designed to be easy to
extend and modify, drawing on the powerful high-level language of
Mathematica.
‡ Background
A pattern recognition problem amounts to learning how to discriminate between
data points xi belonging to two classes, defined by class labels yi œ 8+1, -1<, when
given only a set of examples Hxi , yi L from each class. These problems are found
in various applications, from automated handwriting recognition to medical
expert systems, and pattern recognition or machine learning algorithms are
routinely applied to solve them.
It may be helpful for newcomers to relate this to a more familiar problem:
standard statistical hypothesis testing for one-dimensional xi , such as Student’s
t-test [1, ch. 8], can be viewed as a very simple kind of pattern recognition
problem. Here, the hypotheses H0 and H1 correspond to classes +1, -1, and the
familiar
l êê
o H0, if x < x yzz
gHxL = mo H , if x > êê ,
n 1 x{
where êê
x is the mean of within-class means,
êê 1 i1 1 y
x = ÅÅÅÅÅ jj ÅÅÅÅÅÅÅ ‚ xi + ÅÅÅÅÅÅÅÅÅÅ ‚ xi zz,
2 k l1 i: y1 =+1 l-1 i: y1 =-1 {
lc = » 8i : y1 = c< » and H0 < H1 , is called the decision rule or sometimes simply the
classifier. We say that the decision rule g is induced from data xi , in this case
determined by computing êê x.
However, real pattern recognition problems usually involve high-dimensional
data (such as image data) and unknown underlying distributions. In this situation,
it is nearly impossible to develop statistical tests like the preceding one. These
problems are typically attacked with algorithms, such as artificial neural networks
[2], decisions trees [3, ch. 18], Bayesian models [4], and recently SVMs [5], to
which we will devote the rest of this article. Here we will only consider data that
can be represented as vectors x œ Rn ; other kinds of information can usually be
changed to this form in some appropriate manner.
-4 -2 2
-1
-2
Figure 1. Two-class data (black and grey dots), their optimal separating hyperplane
(continuous line), and support vectors (circled in blue). This is an example output of the
SVMPlot function in MathSVM. The width of the “corridor” defined by the two dotted
lines connecting the support vectors is the margin of the optimal separating hyperplane.
In[1]:= MathSVM‘
Statistics‘NormalDistribution‘
In[3]:= ? QPSolve
The variable a has dim a = l, the number of data points, so the matrix Q has l 2
elements, which may be quite large for large problems. Therefore, QPSolve
employs a divide-and-conquer approach [7] that allows for solving (3) efficiently
without storing the full matrix Q in memory.
Having solved the dual problem for a using QPSolve, we obtain the optimal
weight vector w and bias term b, that is, the solution to the primal problem (2),
using the identities
w = ‚ ai yi xi (4)
i
1
b = - ÅÅÅÅÅ HwT x+ + wT x- L, (5)
2
where in (5) x+ , x- are any two support vectors belonging to class +1 and -1,
respectively (there always exist at least two such support vectors) [5].
The returned vector a is the solution found by QPSolve for the dual formulation
(3) of the SVM problem we just constructed. For this specific problem, the dual
formulation used is exactly that described by (3), that is
Q = Hqi j L = H yi y j x j xi L
p = H-1, … , -1L
a = H0, … , 0L, b = H0, … , 0L
c=0
The support vectors are immediately identifiable as the nonzero ai . (4) and (5)
are implemented as
In[9]:= WeightVectorΑ, X, y
Out[9]= 0.73531, 0.692296
In[10]:= BiasΑ, X, y
Out[10]= 0.973496
-3 -2 -1 1 2 3
-1
-2
-3
-4
· Nonseparable Data
Often the assumption of separable training data is not reasonable (it may fail
even for the preceding simple example). In such cases, NonseparableSVM should
be used. This SVM variant takes a parameter C that determines how hard points
violating the constraint in (2) should be penalized. This parameter appears in the
objective function of the primal problem, which now is formulated as [5]
1
minw,b,x ÅÅÅÅÅ »» w.w »» +C ‚ xi
2 i
subject to yi Hxi .w + bL ¥ 1 - xi , xi ¥ 0.
Large C means high penalty, and in the limit C Ø ¶ we obtain the separable case.
In[12]:= Τ 0.01;
Α NonseparableSVMX, y, 0.5, Τ
Out[13]= 0.3991, 0, 0, 0, 0, 0, 0, 0, 0, 0.1009, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.5
In[14]:= SVMPlotΑ, X, y
-3 -2 -1 1 2 3
-1
-2
-3
-4
· Getting QP Formulations
Sometimes it is interesting to examine what the QP problem looks like for a
given SVM formulation. Using the option FormulationOnly, we can inspect the
various parameters instead of actually solving the QP. This can, for example, be
used to study expressions analytically.
In[15]:= ClearX, y, Α, len
In[16]:= Τ 0.01;
NonseparableSVMArrayx, 2, 2,
Arrayy, 2, C, Τ, FormulationOnly True
,
1, 1, 0, 0, C, C, 0, y1, y2, 0.01
Kd Hx, yL = H1 + x. yLd
we can obtain any polynomial separating surfaces.
1 x.y
2
Out[18]=
,
i, len 2
;
y JoinTable1, len 2, Table1, len 2;
SVMDataPlotX, y, PlotRange All
0.05
-0.05
-0.1
-0.15
Let us solve this problem using the polynomial kernel. This is done as before, by
supplying the desired kernel (which can be any function accepting two argu-
ments) using the KernelFunction option.
In[23]:= Τ 0.01;
pk PolynomialKernel#1, #2, 2 &;
Α SeparableSVMX, y, Τ, KernelFunction pk
Out[25]= 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 3826.52, 0, 0, 0, 0, 0, 0, 0, 1146.88, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 1644.97, 0, 0, 0, 0, 1034.67, 0
When visualizing the results, SVMPlot can use the kernel functions to draw any
nonlinear decision curves.
In[26]:= SVMPlotΑ, X, y, KernelFunction pk
0.05
-0.05
-0.1
-0.15
The SVM algorithm is still fast, although the problem is much harder due to
extremely low sample density, which is reflected by more support vectors (the
nonzero ai ).
In[32]:= Τ 0.01;
Α SeparableSVMX, y, Τ Timing
Out[33]= 0.79 Second, 0.0000185576, 9.32288 106 ,
0, 0, 0, 0.000031162, 0.0000215259, 0, 0.0000267382,
0.0000171359, 1.00351 106 , 0.0000297867, 0.0000293384, 0,
0.0000131243, 0, 0.000024641, 0.000020647, 0, 5.90165 106
The solution in this case may be viewed using the projection of data onto the
weight vector w. The separation looks almost perfect, although there will be
problems with overfitting (the solution may not work well when applied to new,
unseen examples x). However, this problem is outside the scope of the present
article.
In[34]:= ListPlotX.WeightVectorΑ, X, y
0.5
5 10 15 20
-0.5
-1
20
15
10
-4 -2 2 4
We can adapt the SVM method to the regression setting by using a e-insensitive
loss function
l
o 0, »» f HxL - gHxL »» < e
Le H f HxL, gHxLL = m
o »» f HxL - gHxL »» -e, otherwise
n
where gHxL is the SVM approximation to the regression function f HxL. This loss
function determines how much a deviation from the true f HxL is penalized; for
deviations less than e, no penalty is incurred. Here is what the loss function looks
like.
In[38]:= PlotIfAbsx 1, 0, Absx 1, x, 3, 3, AxesLabel "fg", "L"
L
2
1.5
0.5
f-g
-3 -2 -1 1 2 3
20
15
10
-4 -2 2 4
Out[44]= 4.83561
We can, of course, also obtain the analytical expression of the estimated regres-
sion function.
In[45]:= RegressionFunctionΑ, X, y, Ε, x, KernelFunction pk
· Two-Dimensional Example
We can use SVM regression with domains of any dimension (that is the main
advantage). Here is a simple two-dimensional example.
0
0 2 4 6 8
3
4
2.5
2
2
-4
4 0
-2
0 -2
2
-4
4
‡ Conclusion
In this article, we have demonstrated the utility of the MathSVM package for
solving pattern recognition and regression problems. This is an area of very
active research and these algorithms are evolving quickly. In a rapidly moving
field such as this, it is important to have a clear, well documented, high-level
approach to implementation to minimize confusion. Mathematica provides an
excellent solution here, due to its high-level programming language and sym-
bolic capabilities.
MathSVM is currently 100% native Mathematica code, written with the emphasis
on clarity. This does incur penalties in terms of computational speed. Some parts
of the QP algorithm are therefore being ported to Java at this time to improve
performance. This should not impair the clarity of the software in any way, since
the QPSolve function is easy separable from the other parts of MathSVM in a
“black box” fashion.
The MathSVM software is still in its infancy and will no doubt expand rapidly, as
our group is currently involved in many projects in pattern recognition and
high-dimensional data analysis in general, as well as in a biomedical context. We
hope that this contribution will initiate other efforts to bring understandable
implementations of machine learning algorithms to the Mathematica community.
‡ References
[1] G. Casella and R. L. Berger, Statistical Inference, 2nd ed., Belmont, CA: Duxbury Press,
2002.
[2] S. Haykin, Neural Networks: A Comprehensive Foundation, 2nd ed., Englewood Cliffs,
NJ: Prentice Hall, 1999.
[3] S. Russell and P. Norvig, Artificial Intelligence: A Modern Approach, Englewood Cliffs,
NJ: Prentice Hall, 1995.
‡ Additional Material
MathSVM.nb
MathSVM.m
Available at www.mathematica-journal.com/issue/v10i1/download.
Roland Nilsson
Computational Biology
Linköping University
SE-58183 Linköping, Sweden
rolle@ifm.liu.se
Johan Björkegren
Center for Genomics and Bioinformatics
Karolinska Institutet
SE-17177 Stockholm, Sweden
johan.bjorkegren@ks.se
Jesper Tegnér
Computational Biology
Linköping University
SE-58183 Linköping, Sweden
jespert@ifm.liu.se