Kernal Methods Machine Learning
Kernal Methods Machine Learning
1. Introduction. Over the last ten years estimation and learning methods utilizing positive definite kernels have become rather popular, particularly in machine learning. Since these methods have a stronger mathematical
slant than earlier machine learning methods (e.g., neural networks), there
is also significant interest in the statistics and mathematics community for
these methods. The present review aims to summarize the state of the art on
a conceptual level. In doing so, we build on various sources, including Burges
[25], Cristianini and Shawe-Taylor [37], Herbrich [64] and Vapnik [141] and,
in particular, Sch
olkopf and Smola [118], but we also add a fair amount of
more recent material which helps unifying the exposition. We have not had
space to include proofs; they can be found either in the long version of the
present paper (see Hofmann et al. [69]), in the references given or in the
above books.
The main idea of all the described methods can be summarized in one
paragraph. Traditionally, theory and algorithms of machine learning and
Received December 2005; revised February 2007.
Supported in part by grants of the ARC and by the Pascal Network of Excellence.
AMS 2000 subject classifications. Primary 30C40; secondary 68T05.
Key words and phrases. Machine learning, reproducing kernels, support vector machines, graphical models.
1
T. HOFMANN, B. SCHOLKOPF
AND A. J. SMOLA
statistics has been very well developed for the linear case. Real world data
analysis problems, on the other hand, often require nonlinear methods to detect the kind of dependencies that allow successful prediction of properties
of interest. By using a positive definite kernel, one can sometimes have the
best of both worlds. The kernel corresponds to a dot product in a (usually
high-dimensional) feature space. In this space, our estimation methods are
linear, but as long as we can formulate everything in terms of kernel evaluations, we never explicitly have to compute in the high-dimensional feature
space.
The paper has three main sections: Section 2 deals with fundamental
properties of kernels, with special emphasis on (conditionally) positive definite kernels and their characterization. We give concrete examples for such
kernels and discuss kernels and reproducing kernel Hilbert spaces in the context of regularization. Section 3 presents various approaches for estimating
dependencies and analyzing data that make use of kernels. We provide an
overview of the problem formulations as well as their solution using convex
programming techniques. Finally, Section 4 examines the use of reproducing kernel Hilbert spaces as a means to define statistical models, the focus
being on structured, multidimensional responses. We also show how such
techniques can be combined with Markov networks as a suitable framework
to model dependencies between response variables.
2. Kernels.
2.1. An introductory example. Suppose we are given empirical data
(1)
(x1 , y1 ), . . . , (xn , yn ) X Y.
Here, the domain X is some nonempty set that the inputs (the predictor
variables) xi are taken from; the yi Y are called targets (the response variable). Here and below, i, j [n], where we use the notation [n] := {1, . . . , n}.
Note that we have not made any assumptions on the domain X other
than it being a set. In order to study the problem of learning, we need
additional structure. In learning, we want to be able to generalize to unseen
data points. In the case of binary pattern recognition, given some new input
x X , we want to predict the corresponding y {1} (more complex output
domains Y will be treated below). Loosely speaking, we want to choose y
such that (x, y) is in some sense similar to the training examples. To this
end, we need similarity measures in X and in {1}. The latter is easier,
as two target values can only be identical or different. For the former, we
require a function
(2)
k : X X R,
(x, x ) 7 k(x, x )
Fig. 1. A simple geometric classification algorithm: given two classes of points (depicted by o and +), compute their means c+ , c and assign a test input x to the
one whose mean is closer. This can be done by looking at the dot product between x c
[where c = (c+ + c )/2] and w := c+ c , which changes sign as the enclosed angle passes
through /2. Note that the corresponding decision boundary is a hyperplane (the dotted
line) orthogonal to w (from Sch
olkopf and Smola [118]).
(3)
where maps into some dot product space H, sometimes called the feature
space. The similarity measure k is usually called a kernel, and is called its
feature map.
The advantage of using such a kernel as a similarity measure is that
it allows us to construct algorithms in dot product spaces. For instance,
consider the following simple classification algorithm, described in Figure 1,
where Y = {1}. The idea P
is to compute the means of the
two classes in
P
the feature space, c+ = n1+ {i:yi =+1} (xi ), and c = n1 {i:yi =1} (xi ),
where n+ and n are the number of examples with positive and negative
target values, respectively. We then assign a new point (x) to the class
whose mean is closer to it. This leads to the prediction rule
y = sgn(h(x), c+ i h(x), c i + b)
(4)
X
X
1
1
h(x), (xi )i
h(x), (xi )i + b ,
(5) y = sgn
{z
} n
{z
}
|
n+ {i:y =+1} |
{i:y =1}
i
where b = 12 ( n12
k(x,xi )
1
{(i,j):yi =yj =1} k(xi , xj ) n2
+
k(x,xi )
Let us consider one well-known special case of this type of classifier. Assume that the class means have the same distance to the origin (hence,
b = 0), and that k(, x) is a density for all x X . If the two classes are
T. HOFMANN, B. SCHOLKOPF
AND A. J. SMOLA
equally likely and were generated from two probability distributions that
are estimated
X
X
1
1
(6) p+ (x) :=
k(x, xi ),
p (x) :=
k(x, xi ),
n+ {i:y =+1}
n {i:y =1}
i
then (5) is the estimated Bayes decision rule, plugging in the estimates p+
and p for the true densities.
The classifier (5) is closely related to the Support Vector Machine (SVM )
that we will discuss below. It is linear in the feature space (4), while in the
input domain, it is represented by a kernel expansion (5). In both cases, the
decision boundary is a hyperplane in the feature space; however, the normal
vectors [for (4), w = c+ c ] are usually rather different.
The normal vector not only characterizes the alignment of the hyperplane,
its length can also be used to construct tests for the equality of the two classgenerating distributions (Borgwardt et al. [22]).
As an aside, note that if we normalize the targets such that yi = yi /|{j : yj =
yi }|, in which case the yi sum to zero, then kwk2 = hK, yy iF , where h, iF
is the Frobenius dot product. If the two classes have equal size, then up to a
scaling factor involving kKk2 and n, this equals the kernel-target alignment
defined by Cristianini et al. [38].
2.2. Positive definite kernels. We have required that a kernel satisfy (3),
that is, correspond to a dot product in some dot product space. In the
present section we show that the class of kernels that can be written in the
form (3) coincides with the class of positive definite kernels. This has farreaching consequences. There are examples of positive definite kernels which
can be evaluated efficiently even though they correspond to dot products in
infinite dimensional dot product spaces. In such cases, substituting k(x, x )
for h(x), (x )i, as we have done in (5), is crucial. In the machine learning
community, this substitution is called the kernel trick.
Definition 1 (Gram matrix).
X , the n n matrix
(7)
K := (k(xi , xj ))ij
ci cj Kij 0
i,j
for all ci R is called positive definite. If equality in (8) only occurs for
c1 = = cn = 0, then we shall call the matrix strictly positive definite.
2.2.1. Construction of the reproducing kernel Hilbert space. We now define a map from X into the space of functions mapping X into R, denoted
as RX , via
(10)
: X RX
Here, (x) = k(, x) denotes the function that assigns the value k(x , x) to
x X .
We next construct a dot product space containing the images of the inputs
under . To this end, we first turn it into a vector space by forming linear
combinations
(11)
f () =
n
X
i k(, xi ).
i=1
(12)
hf, gi :=
n X
n
X
i j k(xi , xj ).
i=1 j=1
T. HOFMANN, B. SCHOLKOPF
AND A. J. SMOLA
(13)
n
X
i j k(xi , xj ) 0.
i,j=1
p
X
i,j=1
i j hfi , fj i =
p
X
i fi ,
i=1
p
X
j=1
j fj
0.
Here, the equality follows from the bilinearity of h, i, and the right-hand
inequality from (13).
By (14), h, i is a positive definite kernel, defined on our vector space of
functions. For the last step in proving that it even is a dot product, we note
that, by (12), for all functions (11),
(15) hk(, x), f i = f (x) and, in particular, hk(, x), k(, x )i = k(x, x ).
By virtue of these properties, k is called a reproducing kernel (Aronszajn
[7]).
Due to (15) and (9), we have
(16)
n
X
i=1
ci = 0.
Interestingly, it turns out that many kernel algorithms, including SVMs and
kernel PCA (see Section 3), can be applied also with this larger class of
kernels, due to their being translation invariant in feature space (Hein et al.
[63] and Sch
olkopf and Smola [118]).
We conclude this section with a note on terminology. In the early years of
kernel machine learning research, it was not the notion of positive definite
kernels that was being used. Instead, researchers considered kernels satisfying the conditions of Mercers theorem (Mercer [99], see, e.g., Cristianini
and Shawe-Taylor [37] and Vapnik [141]). However, while all such kernels do
satisfy (3), the converse is not true. Since (3) is what we are interested in,
positive definite kernels are thus the right class of kernels to consider.
2.2.2. Properties of positive definite kernels. We begin with some closure
properties of the set of positive definite kernels.
Proposition 4. Below, k1 , k2 , . . . are arbitrary positive definite kernels
on X X , where X is a nonempty set:
(i) The set of positive definite kernels is a closed convex cone, that is,
(a) if 1 , 2 0, then 1 k1 + 2 k2 is positive definite; and (b) if k(x, x ) :=
limn kn (x, x ) exists for all x, x , then k is positive definite.
(ii) The pointwise product k1 k2 is positive definite.
(iii) Assume that for i = 1, 2, ki is a positive definite kernel on Xi Xi ,
where Xi is a nonempty set. Then the tensor product k1 k2 and the direct
sum k1 k2 are positive definite kernels on (X1 X2 ) (X1 X2 ).
The proofs can be found in Berg et al. [18].
It is reassuring that sums and products of positive definite kernels are
positive definite. We will now explain that, loosely speaking, there are no
other operations that preserve positive definiteness. To this end, let C denote the set of all functions : R R that map positive definite kernels to
(conditionally) positive definite kernels (readers who are not interested in
the case of conditionally positive definite kernels may ignore the term in
parentheses). We define
C := {|k is a p.d. kernel (k) is a (conditionally) p.d. kernel},
C = {| for any Hilbert space F,
(hx, x iF ) is (conditionally) positive definite},
C = {| for all n N: K is a p.d.
n n matrix (K) is (conditionally) p.d.},
where (K) is the n n matrix with elements (Kij ).
T. HOFMANN, B. SCHOLKOPF
AND A. J. SMOLA
Proposition 5.
C = C = C .
The following proposition follows from a result of FitzGerald et al. [50] for
(conditionally) positive definite matrices; by Proposition 5, it also applies for
(conditionally) positive definite kernels, and for functions of dot products.
We state the latter case.
Proposition 6. Let : R R. Then (hx, x iF ) is positive definite for
any Hilbert space F if and only if is real entire of the form
(18)
(t) =
an tn
n=0
with an 0 for n 0.
Moreover, (hx, x iF ) is conditionally positive definite for any Hilbert
space F if and only if is real entire of the form (18) with an 0 for
n 1.
There are further properties of k that can be read off the coefficients an :
Steinwart [128] showed that if all an are strictly positive, then the kernel of Proposition 6 is universal on every compact subset S of Rd in the
sense that its RKHS is dense in the space of continuous functions on S in
the k k norm. For support vector machines using universal kernels, he
then shows (universal) consistency (Steinwart [129]). Examples of universal kernels are (19) and (20) below.
In Lemma 11 we will show that the a0 term does not affect an SVM.
Hence, we infer that it is actually sufficient for consistency to have an > 0
for n 1.
We conclude the section with an example of a kernel which is positive definite
by Proposition 6. To this end, let X be a dot product space. The power series
expansion of (x) = ex then tells us that
(19)
k(x, x ) = ehx,x i/
2 /(2 2 )
k(x, x ) = h(x x ),
in which case h is called a positive definite function. The following characterization is due to Bochner [21]. We state it in the form given by Wendland
[152].
Theorem 7. A continuous function h on Rd is positive definite if and
only if there exists a finite nonnegative Borel measure on Rd such that
(22)
h(x) =
Rd
eihx,i d().
T. HOFMANN, B. SCHOLKOPF
AND A. J. SMOLA
10
hx, x i =
d
X
[x]j [x ]j
j=1
(23)
=
!p
j[d]p
where Cp maps x Rd to the vector Cp (x) whose entries are all possible
pth degree ordered products of the entries of x (note that [d] is used as a
shorthand for {1, . . . , d}). The polynomial kernel of degree p thus computes
a dot product in the space spanned by all monomials of degree p in the input
coordinates. Other useful kernels include the inhomogeneous polynomial,
(24)
where p N and c 0,
k(x, x ) = B2p+1 (x x )
11
Convolutions and structures. Let us now move to kernels defined on structured objects (Haussler [62] and Watkins [151]). Suppose the object x X is
composed of xp Xp , where p [P ] (note that the sets Xp need not be equal).
For instance, consider the string x = AT G and P = 2. It is composed of the
parts x1 = AT and x2 = G, or alternatively, of x1 = A and x2 = T G. Mathematically speaking, the set of allowed decompositions can be thought
of as a relation R(x1 , . . . , xP , x), to be read as x1 , . . . , xP constitute the
composite object x.
Haussler [62] investigated how to define a kernel between composite objects by building on similarity measures that assess their respective parts;
in other words, kernels kp defined on Xp Xp . Define the R-convolution of
k1 , . . . , kP as
(26)
[k1 kP ](x, x ) :=
P
Y
kp (
xp , x
p ),
x
R(x),
x R(x ) p=1
where the sum runs over all possible ways R(x) and R(x ) in which we
can decompose x into x
1 , . . . , x
P and x analogously [here we used the convention that an empty sum equals zero, hence, if either x or x cannot be
decomposed, then (k1 kP )(x, x ) = 0]. If there is only a finite number
of ways, the relation R is called finite. In this case, it can be shown that the
R-convolution is a valid kernel (Haussler [62]).
ANOVA kernels. Specific examples of convolution kernels are Gaussians
and ANOVA kernels (Vapnik [141] and Wahba [148]). To construct an ANOVA
kernel, we consider X = S N for some set S, and kernels k(i) on S S, where
i = 1, . . . , N . For P = 1, . . . , N , the ANOVA kernel of order P is defined as
(27)
kP (x, x ) :=
P
Y
Note that if P = N , the sum consists only of the term for which (i1 , . . . , iP ) =
(1, . . . , N ), and k equals the tensor product k(1) k(N ) . At the other
extreme, if P = 1, then the products collapse to one factor each, and k equals
the direct sum k(1) k(N ) . For intermediate values of P , we get kernels
that lie in between tensor products and direct sums.
ANOVA kernels typically use some moderate value of P , which specifies
the order of the interactions between attributes xip that we are interested
in. The sum then runs over the numerous terms that take into account
interactions of order P ; fortunately, the computational cost can be reduced
to O(P d) cost by utilizing recurrent procedures for the kernel evaluation.
ANOVA kernels have been shown to work rather well in multi-dimensional
SV regression problems (Stitson et al. [131]).
12
T. HOFMANN, B. SCHOLKOPF
AND A. J. SMOLA
Bag of words. One way in which SVMs have been used for text categorization (Joachims [77]) is the bag-of-words representation. This maps a given
text to a sparse vector, where each component corresponds to a word, and
a component is set to one (or some other number) whenever the related
word occurs in the text. Using an efficient sparse representation, the dot
product between two such vectors can be computed quickly. Furthermore,
this dot product is by construction a valid kernel, referred to as a sparse
vector kernel. One of its shortcomings, however, is that it does not take into
account the word ordering of a document. Other sparse vector kernels are
also conceivable, such as one that maps a text to the set of pairs of words
that are in the same sentence (Joachims [77] and Watkins [151]).
n-grams and suffix trees. A more sophisticated way of dealing with string
data was proposed by Haussler [62] and Watkins [151]. The basic idea is
as described above for general structured objects (26): Compare the strings
by means of the substrings they contain. The more substrings two strings
have in common, the more similar they are. The substrings need not always
be contiguous; that said, the further apart the first and last element of a
substring are, the less weight should be given to the similarity. Depending
on the specific choice of a similarity measure, it is possible to define more
or less efficient kernels which compute the dot product in the feature space
spanned by all substrings of documents.
Consider a finite alphabet , the
set of all strings of length n, n , and
S
a string kernel computed from exact matches. Here #(x, s) is the number of
occurrences of s in x and cs 0.
Vishwanathan and Smola [146] provide an algorithm using suffix trees,
which allows one to compute for arbitrary cs the value of the kernel k(x, x )
in O(|x| + |x |) time and memory. Moreover, also f (x) = hw, (x)i can be
computed in O(|x|) time if preprocessing linear in the size of the support
vectors is carried out. These kernels are then applied to function prediction
(according to the gene ontology) of proteins using only their sequence information. Another prominent application of string kernels is in the field of
splice form prediction and gene finding (R
atsch et al. [112]).
For inexact matches of a limited degree, typically up to = 3, and strings
of bounded length, a similar data structure can be built by explicitly generating a dictionary of strings and their neighborhood in terms of a Hamming
distance (Leslie et al. [92]). These kernels are defined by replacing #(x, s)
13
(28)
l(i) .
i:s(i)=u
Here, 0 < 1 is a decay parameter: The larger the length of the subsequence in s, the smaller the respective contribution to [n (s)]u . The sum
runs over all subsequences of s which equal u.
For instance, consider a dimension of H3 spanned (i.e., labeled) by the
string asd. In this case we have [3 (Nasdaq)]asd = 3 , while [3 (lass das)]asd =
25 . In the first string, asd is a contiguous substring. In the second string,
it appears twice as a noncontiguous substring of length 5 in lass das, the
two occurrences are lass das and lass das.
The kernel induced by the map n takes the form
(29)
kn (s, t) =
un
[n (s)]u [n (t)]u =
l(i) l(j) .
un (i,j):s(i)=t(j)=u
14
T. HOFMANN, B. SCHOLKOPF
AND A. J. SMOLA
15
r() = exp()
(31)
diffusion kernel,
r() = ( + )
(32)
r() = ( )p
where > 0 is chosen such as to reflect the amount of diffusion in (30), the
degree of regularization in (31) or the weighting of steps within a random
walk (32) respectively. Equation (30) was proposed by Kondor and Lafferty
[87]. In Section 2.3.2 we will discuss the connection between regularization
operators and kernels in Rn . Without going into details, the function r()
describes the smoothness properties on the graph and L plays the role of
the Laplace operator.
Kernels on sets and subspaces. Whenever each observation xi consists of
a set of instances, we may use a range of methods to capture the specific
properties of these sets (for an overview, see Vishwanathan et al. [147]):
Take
the average of the elements of the set in feature space, that is, (xi ) =
1 P
j (xij ). This yields good performance in the area of multi-instance
n
learning.
Jebara and Kondor [75] extend the idea by dealing with distributions
pi (x) such that (xi ) = E[(x)], where x pi (x). They apply it to image
classification with missing pixels.
Alternatively, one can study angles enclosed by subspaces spanned by
the observations. In a nutshell, if U, U denote the orthogonal matrices
spanning the subspaces of x and x respectively, then k(x, x ) = det U U .
Fisher kernels. [74] have designed kernels building on probability density
models p(x|). Denote by
(33)
(34)
the Fisher scores and the Fisher information matrix respectively. Note that
for maximum likelihood estimators Ex [U (x)] = 0 and, therefore, I is the
covariance of U (x). The Fisher kernel is defined as
(35)
T. HOFMANN, B. SCHOLKOPF
AND A. J. SMOLA
16
f (x) =
n
X
i k(xi , x).
i=1
17
hf, gi =
f (x)g(x) dx.
Rather than (39), we consider the equivalent condition (cf. Section 2.2.1)
(41)
T. HOFMANN, B. SCHOLKOPF
AND A. J. SMOLA
18
k(x, x ) =
ihxx ,i
() d =
eihx,i eihx ,i () d.
We would like to rewrite this as hk(x, ), k(x , )i for some linear operator
. It turns out that a multiplication operator in the Fourier domain will do
the job. To this end, recall the d-dimensional Fourier transform, given by
(44)
F [f ]() := (2)d/2
(45)
d/2
[f ](x) = (2)
Z Z
f (x)eihx,i dx,
f ()eihx,i d.
= (2)d/2 ()eihx,i .
k(x, x ) = (2)d
(48)
we thus have
(49)
k(x, x ) =
19
20
T. HOFMANN, B. SCHOLKOPF
AND A. J. SMOLA
been unaware of the fact that it considered a special case of positive definite
kernels. The latter was initiated by Hilbert [67] and Mercer [99], and was
pursued, for instance, by Schoenberg [115]. Hilbert calls a kernel k definit if
(50)
bZ b
for all nonzero continuous functions fR, and shows that all eigenvalues of the
corresponding integral operator f 7 ab k(x, )f (x) dx are then positive. If k
R
satisfies the condition (50) subject to the constraint that ab f (x)g(x) dx = 0,
for some fixed function g, Hilbert calls it relativ definit. For that case, he
shows that k has at most one negative eigenvalue. Note that if f is chosen
to be constant, then this notion is closely related to the one of conditionally
positive definite kernels; see (17). For further historical details, see the review
of Stewart [130] or Berg at al. [18].
3. Convex programming methods for estimation. As we saw, kernels
can be used both for the purpose of describing nonlinear functions subject
to smoothness constraints and for the purpose of computing inner products
in some feature space efficiently. In this section we focus on the latter and
how it allows us to design methods of estimation based on the geometry of
the problems at hand.
Unless stated otherwise, E[] denotes the expectation with respect to all
random variables of the argument. Subscripts, such as EX [], indicate that
the expectation is taken over X. We will omit them wherever obvious. Finally, we will refer to Eemp [] as the empirical average with respect to an
n-sample. Given a sample S := {(x1 , y1 ), . . . , (xn , yn )} X Y, we now aim
at finding an affine function f (x) = hw, (x)i + b or in some cases a function f (x, y) = h(x, y), wi such that the empirical risk on S is minimized.
In the binary classification case this means that we want to maximize the
agreement between sgn f (x) and y.
Minimization of the empirical risk with respect to (w, b) is NP-hard (Minsky and Papert [101]). In fact, Ben-David et al. [15] show that even approximately minimizing the empirical risk is NP-hard, not only for linear
function classes but also for spheres and other simple geometrical objects.
This means that even if the statistical challenges could be solved, we still
would be confronted with a formidable algorithmic problem.
The indicator function {yf (x) < 0} is discontinuous and even small changes
in f may lead to large changes in both empirical and expected risk. Properties of such functions can be captured by the VC-dimension (Vapnik
and Chervonenkis [142]), that is, the maximum number of observations
which can be labeled in an arbitrary fashion by functions of the class.
Necessary and sufficient conditions for estimation can be stated in these
21
minimize 12 kwk2
w,b
subject to
yi (hw, xi + b) 1.
Note that kwk1 f (xi ) is the distance of the point xi to the hyperplane
H(w, b) := {x|hw, xi + b = 0}. The condition yi f (xi ) 1 implies that the
margin of separation is at least 2kwk1 . The bound becomes exact if equality
is attained for some yi = 1 and yj = 1. Consequently, minimizing kwk
subject to the constraints maximizes the margin of separation. Equation (51)
is a quadratic program which can be solved efficiently (Fletcher [51]).
Mangasarian [95] devised a similar optimization scheme using kwk1 instead of kwk2 in the objective function of (51). The result is a linear program. In general, one can show (Smola et al. [124]) that minimizing the p
norm of w leads to the maximizing of the margin of separation in the q
norm where p1 + 1q = 1. The 1 norm leads to sparse approximation schemes
(see also Chen et al. [29]), whereas the 2 norm can be extended to Hilbert
spaces and kernels.
To deal with nonseparable problems, that is, cases when (51) is infeasible,
we need to relax the constraints of the optimization problem. Bennett and
Mangasarian [17] and Cortes and Vapnik [34] impose a linear penalty on the
violation of the large-margin constraints to obtain
minimize 12 kwk2 + C
w,b,
(52)
n
X
i=1
T. HOFMANN, B. SCHOLKOPF
AND A. J. SMOLA
22
into an RKHS. To address these problems, one may solve the problem in
dual space as follows. The Lagrange function of (52) is given by
L(w, b, , , ) =
2
1
2 kwk
+C
n
X
i=1
(53)
+
n
X
i (1 i yi (hw, xi i + b))
i=1
n
X
i i ,
i=1
n
X
i yi xi = 0 and
i=1
b L =
(54)
n
X
i yi = 0
and
i=1
i L = C i + i = 0.
P
(55)
Q Rnn is the matrix of inner products Qij := yi yj hxi , xj i. Clearly, this can
be extended to feature maps and kernels easily via Kij := yi yj h(xi ), (xj )i =
yi yj k(xi , xj ). Note that w lies in the span of the xi . This is an instance of
the representer theorem (Theorem 9). The KKT conditions (Boser et al.
[23], Cortes and Vapnik [34], Karush [81] and Kuhn and Tucker [88]) require
that at optimality i (yi f (xi ) 1) = 0. This means that only those xi may
appear in the expansion (54) for which yi f (xi ) 1, as otherwise i = 0. The
xi with i > P
0 are commonly referred to as support vectors.
Note that ni=1 i is an upper bound on the empirical risk, as yi f (xi ) 0
implies i 1 (see also Lemma 10). The number of misclassified points xi
itself depends on the configuration of the data and the value of C. Ben-David
et al. [15] show that finding even an approximate minimum classification
error solution is difficult. That said, it is possible to modify (52) such that
a desired target number of observations violates yi f (xi ) for some
R by making the threshold itself a variable of the optimization problem
(Sch
olkopf et al. [120]). This leads to the following optimization problem
(-SV classification):
minimize 12 kwk2 +
w,b,
n
X
i=1
i n
(56)
subject to
23
yi (hw, xi i + b) i and i 0.
subject to
(57)
One can show that for every C there exists a such that the solution of
olkopf et al. [120] prove that
(57) is a multiple of the solution of (55). Sch
solving (57) for which > 0 satisfies the following:
1. is an upper bound on the fraction of margin errors.
2. is a lower bound on the fraction of SVs.
Moreover, under mild conditions, with probability 1, asymptotically, equals
both the fraction of SVs and the fraction of errors.
This statement implies that whenever the data are sufficiently well separable (i.e., > 0), -SV classification finds a solution with a fraction of at
most margin errors. Also note that, for = 1, all i = 1, that is, f becomes
an affine copy of the Parzen windows classifier (5).
3.2. Estimating the support of a density. We now extend the notion of
linear separation to that of estimating the support of a density (Sch
olkopf
et al. [117] and Tax and Duin [134]). Denote by X = {x1 , . . . , xn } X the
sample drawn from P(x). Let C be a class of measurable subsets of X and
let be a real-valued function defined on C. The quantile function (Einmal
and Mason [47]) with respect to (P, , C) is defined as
(58)
U () = inf{(C)|P(C) , C C}
T. HOFMANN, B. SCHOLKOPF
AND A. J. SMOLA
24
n
X
i n
i=1
(59)
subject to
hw, xi i i and i 0.
Here, (0, 1] plays the same role as in (56), controlling the number of
observations xi for which f (xi ) . Since nonzero slack variables i are
penalized in the objective function, if w and solve this problem, then the
decision function f (x) will attain or exceed for at least a fraction 1 of
the xi contained in X, while the regularization term kwk will still be small.
The dual of (59) yield:
(60)
minimize 12 K subject to
(61)
and f (xi ) yi i i ,
n
X
(yi f (xi ))
where () = max(0, || ).
i=1
25
() = 12 2 yields penalized least squares (LS) regression (Hoerl and Kennard [68], Morozov [102], Tikhonov [136] and Wahba [148]). The corresponding optimization problem can be minimized by solving a linear system.
For () = ||, we obtain the penalized least absolute deviations (LAD)
estimator (Bloomfield and Steiger [20]). That is, we obtain a quadratic
program to estimate the conditional median.
A combination of LS and LAD loss yields a penalized version of Hubers
olkopf [126]). In this case
robust regression (Huber [71] and Smola and Sch
1 2
we have () = 2
for || and () = || 2 for || .
Note that also quantile regression can be modified to work with kernels
(Sch
olkopf et al. [120]) by using as loss function the pinball loss, that
is, () = (1 ) if < 0 and () = if > 0.
All the optimization problems arising from the above five cases are convex
quadratic programs. Their dual resembles that of (61), namely,
(63a)
(63b)
1
minimize
2 ( ) K( ) + ( + ) y ( )
subject to
Here Kij = hxi , xj i for linear models and Kij = k(xi , xj ) if we map x (x).
The -trick, as described in (56) (Sch
olkopf et al. [120]), can be extended
to regression, allowing one to choose the margin of approximation automatically. In this case (63a) drops the terms in . In its place, we add a linear
constraint ( ) 1 = n. Likewise, LAD is obtained from (63) by dropping the terms in without additional constraints. Robust regression leaves
(63) unchanged, however, in the definition of K we have an additional term
of 1 on the main diagonal. Further details can be found in Sch
olkopf and
Smola [118]. For quantile regression we drop and we obtain different constants C(1 ) and C for the constraints on and . We will discuss
uniform convergence properties of the empirical risk estimates with respect
to various () in Section 3.6.
3.4. Multicategory classification, ranking and ordinal regression. Many
estimation problems cannot be described by assuming that Y = {1}. In
this case it is advantageous to go beyond simple functions f (x) depending on x only. Instead, we can encode a larger degree of information by
estimating a function f (x, y) and subsequently obtaining a prediction via
y(x) := arg maxyY f (x, y). In other words, we study problems where y is
obtained as the solution of an optimization problem over f (x, y) and we
wish to find f such that y matches yi as well as possible for relevant inputs
x.
T. HOFMANN, B. SCHOLKOPF
AND A. J. SMOLA
26
Note that the loss may be more than just a simple 01 loss. In the following
we denote by (y, y ) the loss incurred by estimating y instead of y. Without
loss of generality, we require that (y, y) = 0 and that (y, y ) 0 for all
y, y Y. Key in our reasoning is the following:
Lemma 10. Let f : X Y R and assume that (y, y ) 0 with
(y, y) = 0. Moreover, let 0 such that f (x, y) f (x, y ) (y, y )
for all y Y. In this case (y, arg maxy Y f (x, y )).
The construction of the estimator was suggested by Taskar et al. [132] and
Tsochantaridis et al. [137], and a special instance of the above lemma is given
by Joachims [78]. While the bound appears quite innocuous, it allows us to
describe a much richer class of estimation problems as a convex program.
To deal with the added complexity, we assume that f is given by f (x, y) =
h(x, y), wi. Given the possibly nontrivial connection between x and y, the
use of (x, y) cannot be avoided. Corresponding kernel functions are given
by k(x, y, x , y ) = h(x, y), (x , y )i. We have the following optimization
problem (Tsochantaridis et al. [137]):
minimize 12 kwk2 + C
w,
(64)
subject to
n
X
i=1
27
an entire set of labels. Joachims [78] shows that the so-called F1 score
(van Rijsbergen [138]) used in document retrieval and the area under the
ROC curve (Bamber [10]) fall into this category of problems. Moreover,
Joachims [78] derives an O(n2 ) method for evaluating the inequality constraint over Y.
Multilabel estimation problems deal with the situation where we want to
find the best subset of labels Y 2[N ] which correspond to some observation x. Elisseeff and Weston [48] devise a ranking scheme where
f (x, i) > f (x, j) if label i y and j
/ y. It is a special case of an approach
described next.
Note that (64) is invariant under translations (x, y) (x, y) + 0 where
0 is constant, as (xi , yi ) (xi , y) remains unchanged. In practice, this
means that transformations k(x, y, x , y ) k(x, y, x , y ) + h0 , (x, y)i +
h0 , (x , y )i + k0 k2 do not affect the outcome of the estimation process.
Since 0 was arbitrary, we have the following lemma:
Lemma 11. Let H be an RKHS on X Y with kernel k. Moreover, let
g H. Then the function k(x, y, x , y ) + f (x, y) + f (x, y ) + kgk2H is a kernel
and it yields the same estimates as k.
We need a slight extension to deal with general ranking problems. Denote
by Y = Graph[N ] the set of all directed graphs on N vertices which do not
contain loops of less than three nodes. Here an edge (i, j) y indicates that
i is preferred to j with respect to the observation x. It is the goal to find
some function f : X [N ] R which imposes a total order on [N ] (for a
given x) by virtue of the function values f (x, i) such that the total order
and y are in good agreement.
More specifically, Crammer and Singer [36] and Dekel et al. [45] propose a
decomposition algorithm A for the graphs y such that the estimation error
is given by the number of subgraphs of y which are in disagreement with
the total order imposed by f . As an example, multiclass classification can
be viewed as a graph y where the correct label i is at the root of a directed
graph and all incorrect labels are its children. Multilabel classification is
then a bipartite graph where the correct labels only contain outgoing arcs
and the incorrect labels only incoming ones.
This setting leads to a form similar to (64), except for the fact that we
now have constraints over each subgraph G A(y). We solve
minimize 21 kwk2 + C
w,
(65)
subject to
n
X
i=1
|A(yi )|1
iG
GA(yi )
28
T. HOFMANN, B. SCHOLKOPF
AND A. J. SMOLA
That is, we test for all (u, v) G whether the ranking imposed by G yi is
satisfied.
Finally, ordinal regression problems which perform ranking not over labels y but rather over observations x were studied by Herbrich et al. [65] and
Chapelle and Harchaoui [27] in the context of ordinal regression and conjoint
analysis respectively. In ordinal regression x is preferred to x if f (x) > f (x )
and, hence, one minimizes an optimization problem akin to (64), with constraint hw, (xi ) (xj )i 1 ij . In conjoint analysis the same operation
is carried out for (x, u), where u is the user under consideration. Similar
models were also studied by Basilico and Hofmann [13]. Further models will
be discussed in Section 4, in particular situations where Y is of exponential
size. These models allow one to deal with sequences and more sophisticated
structures.
3.5. Applications of SVM algorithms. When SVMs were first presented,
they initially met with skepticism in the statistical community. Part of the
reason was that, as described, SVMs construct their decision rules in potentially very high-dimensional feature spaces associated with kernels. Although
there was a fair amount of theoretical work addressing this issue (see Section 3.6 below), it was probably to a larger extent the empirical success of
SVMs that paved its way to become a standard method of the statistical
toolbox. The first successes of SVMs on practical problems were in handwritten digit recognition, which was the main benchmark task considered in the
Adaptive Systems Department at AT&T Bell Labs where SVMs were developed. Using methods to incorporate transformation invariances, SVMs were
shown to beat the world record on the MNIST benchmark set, at the time
the gold standard in the field (DeCoste and Sch
olkopf [44]). There has been
a significant number of further computer vision applications of SVMs since
then, including tasks such as object recognition and detection. Nevertheless,
it is probably fair to say that two other fields have been more influential in
spreading the use of SVMs: bioinformatics and natural language processing.
Both of them have generated a spectrum of challenging high-dimensional
problems on which SVMs excel, such as microarray processing tasks and
text categorization. For references, see Joachims [77] and Sch
olkopf et al.
[121].
Many successful applications have been implemented using SV classifiers;
however, also the other variants of SVMs have led to very good results,
including SV regression, SV novelty detection, SVMs for ranking and, more
recently, problems with interdependent labels (McCallum et al. [96] and
Tsochantaridis et al. [137]).
At present there exists a large number of readily available software packages for SVM optimization. For instance, SVMStruct, based on Tsochantaridis et al. [137] solves structured estimation problems. LibSVM is an
29
open source solver which excels on binary problems. The Torch package
contains a number of estimation methods, including SVM solvers. Several
SVM implementations are also available via statistical packages, such as R.
3.6. Margins and uniform convergence bounds. While the algorithms
were motivated by means of their practicality and the fact that 01 loss
functions yield hard-to-control estimators, there exists a large body of work
on statistical analysis. We refer to the works of Bartlett and Mendelson [12],
Jordan et al. [80], Koltchinskii [86], Mendelson [98] and Vapnik [141] for
details. In particular, the review of Bousquet et al. [24] provides an excellent summary of the current state of the art. Specifically for the structured
case, recent work by Collins [30] and Taskar et al. [132] deals with explicit
constructions to obtain better scaling behavior in terms of the number of
class labels.
The general strategy of the analysis can be described by the following
three steps: first, the discrete loss is upper bounded by some function, such
as (yf (x)), which can be efficiently minimized [e.g. the soft margin function
max(0, 1 yf (x)) of the previous section satisfies this property]. Second, one
proves that the empirical average of the -loss is concentrated close to its
expectation. This will be achieved by means of Rademacher averages. Third,
one shows that under rather general conditions the minimization of the loss is consistent with the minimization of the expected risk. Finally, these
bounds are combined to obtain rates of convergence which only depend on
the Rademacher average and the approximation properties of the function
class under consideration.
4. Statistical models and RKHS. As we have argued so far, the reproducing kernel Hilbert space approach offers many advantages in machine
learning: (i) powerful and flexible models can be defined, (ii) many results
and algorithms for linear models in Euclidean spaces can be generalized
to RKHS, (iii) learning theory assures that effective learning in RKHS is
possible, for instance, by means of regularization.
In this chapter we will show how kernel methods can be utilized in the
context of statistical models. There are several reasons to pursue such an avenue. First of all, in conditional modeling, it is often insufficient to compute a
prediction without assessing confidence and reliability. Second, when dealing
with multiple or structured responses, it is important to model dependencies between responses in addition to the dependence on a set of covariates.
Third, incomplete data, be it due to missing variables, incomplete training
targets or a model structure involving latent variables, needs to be dealt
with in a principled manner. All of these issues can be addressed by using
the RKHS approach to define statistical models and by combining kernels
with statistical approaches such as exponential models, generalized linear
models and Markov networks.
T. HOFMANN, B. SCHOLKOPF
AND A. J. SMOLA
30
eh,(x)i d(x).
X
m
R : g() < }
g() = () := E [(X)],
2 g() = V [(X)],
n
1X
E[(X)] = () =
(xi ) := rES [(X)].
n i=1
4.1.2. Exponential RKHS models. One can extend the parameteric exponential model in (66) by defining a statistical model via an RKHS H
with generating kernel k. Linear function h, ()i over X are replaced with
functions f H, which yields an exponential RKHS model
(69)
f H := f : f () =
xS
31
ef (x,y) d(y).
32
T. HOFMANN, B. SCHOLKOPF
AND A. J. SMOLA
n
1X
ln p(yi |xi ; f ),
n i=1
For the parametric case, Lafferty et al. [90] have employed variants of
improved iterative scaling (Darroch and Ratcliff [40] and Della Pietra
[46]) to optimize equation (72), whereas Sha and Pereira [122] have investigated preconditioned conjugate gradient descent and limited memory
quasi-Newton methods.
In order to optimize equation (72) one usually needs to compute expectations of the canonical statistics Ef [(Y, x)] at sample points x = xi , which
requires the availability of efficient inference algorithms.
33
As we have seen in the case of classification and regression, likelihoodbased criteria are by no means the only justifiable choice and large margin
methods offer an interesting alternative. To that extend, we will present
a general formulation of large margin methods for response variables over
finite sample spaces that is based on the approach suggested by Altun et al.
[6] and Taskar et al. [132]. Define
r(x, y; f ) := f (x, y) max
f (x, y ) = min
log
y 6=y
(73)
y 6=y
p(y|x; f )
p(y |x; f )
and
provided the latter is feasible, that is, if there exists f H such that r(S; f ) >
0. To make the connection to SVMs, consider the case of binary classification with (x, y) = y(x), f (x, y; w) = hw, y(x)i, where r(x, y; f ) =
hw, y(x)i hw, y(x)i = 2yhw, (x)i = 2(x, y; w). The latter is twice the
standard margin for binary classification in SVMs.
A soft margin version can be defined based on the Hinge loss as follows:
n
1X
C (f ; S) :=
min{1 r(xi , yi ; f ), 0},
n i=1
hl
(75)
T. HOFMANN, B. SCHOLKOPF
AND A. J. SMOLA
34
n X
X
1
exp[f (xi , yi ; w) f (xi , y; w)].
fexp (S) := arg min
n i=1 y6=y
w
i
f() =
n X
X
i=1 yY
35
= arg min
s.t.
1
2
n X X
X
iy jy Kiy,jy
n X
X
iy
i=1 y6=yi
iy 1, i [n]; iy 0, i [n], y Y,
y6=yi
k(xi , xj )
iy jy [1 + yi ,y yj ,y yi ,y yj ,y ].
The pairs (xi , y) for which iy > 0 are the support pairs, generalizing
the notion of support vectors. As in binary SVMs, their number can be
much smaller than the total number of constraints. Notice also that in the
final expansion contributions k(, (xi , yi )) will get nonnegative weights,
whereas k(, (xi , y)) for y 6= yi P
will get nonpositive weights. Overall one
gets a balance equation iyi y6=yi iy = 0 for every data point.
y6=yi
36
T. HOFMANN, B. SCHOLKOPF
AND A. J. SMOLA
and then strengthens the current relaxation by including iyi in the optimization of the dual if f (xi , yi ) f (xi , yi ) < 1 i . Here > 0 is a
pre-defined tolerance parameter. It is important to understand how many
strengthening steps are necessary to achieve a reasonable close approximation to the original problem. The following theorem provides an answer:
= maxi,y Kiy,iy and
Theorem 15 (Tsochantaridis et al. [137]). Let R
choose > 0. A sequential strengthening procedure, which optimizes equation (75) by greedily selecting -violated constraints, will find an approximate solution where all constraints are fulfilled within a precision of , that
2
4R
is, r(xi , yi ; f ) 1 i after at most 2n
max{1, n2 } steps.
the optimal solution of a relaxation
Corollary 16. Denote by (f, )
of the problem in Proposition 14, minimizing R(f, , S) while violating no
constraint by more than (cf. Theorem 15). Then
S) R(fsm , , S) R(f, ,
S) + ,
R(f, ,
where (fsm , ) is the optimal solution of the original problem.
Combined with an efficient QP solver, the above theorem guarantees a
and 1 . This holds irrespective of speruntime polynomial in n, 1 , R
cial properties of the data set utilized, the only exception being the de
pendency on the sample points xi is through the radius R.
The remaining key problem is how to compute equation (79) efficiently.
The answer depends on the specific form of the joint kernel k and/or
the feature map . In many cases, efficient dynamic programming techniques exists, whereas in other cases one has to resort to approximations
or use other methods to identify a set of candidate distractors Yi Y for
a training pair (xi , yi ) (Collins [30]). Sometimes one may also have search
heuristics available that may not find the solution to Equation (79), but
that find (other) -violating constraints with a reasonable computational
effort.
4.1.7. Generalized Gaussian processes classification. The model
equation (70) and the minimization of the regularized log-loss can be interpreted as a generalization of Gaussian process classification (Altun et al.
[4] and Rasmussen and Williams [111]) by assuming that (f (x, ))xX is a
vector-valued zero mean Gaussian process; note that the covariance function
C is defined over pairs X Y. For a given sample S, define a multi-index
vector F (S) := (f (xi , y))i,y as the restriction of the stochastic process f to
Denote the kernel matrix by K = (Kiy,jy ), where
the augmented sample S.
Kiy,jy := C((xi , y), (xj , y )) with indices i, j [n] and y, y Y, so that, in
37
summary, F (S) N (0, K). This induces a predictive model via Bayesian
model integration according to
p(y|x; S) =
(80)
where x is a test point that has been included in the sample (transductive
setting). For an i.i.d. sample, the log-posterior for F can be written as
(81)
ln p(F |S) = 12 F T K1 F +
n
X
i=1
Invoking the representer theorem for F (S) := arg maxF ln p(F |S), we know
that
F (S)iy =
(82)
n X
X
iy Kiy,jy ,
j=1 y Y
min T K
n
X
i=1
T Keiyi + log
yY
exp[T Keiy ] ,
P
where eiy denotes the respective unit vector. Notice that for f () = i,y iy k(,
(xi , y)) thePfirstPterm is equivalent to the squared RKHS norm of f H since
hf, f iH = i,j y,y iy jy hk(, (xi , y)), k(, (xj , y ))i. The latter inner product reduces to k((xi , y), (xj , y )) due to the reproducing property. Again, the
key issue in solving (83) is how to achieve spareness in the expansion for F .
38
T. HOFMANN, B. SCHOLKOPF
AND A. J. SMOLA
p(z) = exp
"
cC (G)
fc (zc ) ,
where fc are clique compatibility functions dependent on z only via the restriction on clique configurations zc .
The significance of this result is that in order to specify a distribution for
Z, one only needs to specify or estimate the simpler functions fc .
4.2.2. Kernel decomposition over Markov networks. It is of interest to
analyze the structure of kernels k that generate Hilbert spaces H of functions
that are consistent with a graph.
Definition 19. A function f : Z R is compatible withPa conditional
independence graph G, if f decomposes additively as f (z) = cC (G) fc (zc )
with suitably chosen functions fc . A Hilbert space H over Z is compatible
with G, if every function f H is compatible with G. Such f and H are also
called G-compatible.
Proposition 20. Let H with kernel k be a G-compatible RKHS. Then
there are functions kcd : Zc Zd R such that the kernel decomposes as
k(u, z) =
kcd (uc , zd ).
c,dC
39
Proposition 20 is useful for the design of kernels, since it states that only
kernels allowing an additive decomposition into local functions kcd are
compatible with a given Markov network G. Lafferty et al. [89] have pursued a similar approach by considering kernels for RKHS with functions
defined over ZC := {(c, zc ) : c c, zc Zc }. In the latter case one can even
deal with cases where the conditional dependency graph is (potentially)
different for every instance.
An illuminating example of how to design kernels via the decomposition in Proposition 20 is the case of conditional Markov chains, for which
models based on joint kernels have been proposed in Altun et al. [6],
Collins [30], Lafferty et al. [90] and Taskar et al. [132]. Given an input
sequences X = (Xt )t[T ] , the goal is to predict a sequence of labels or
class variables Y = (Yt )t[T ] , Yt . Dependencies between class variables
are modeled in terms of a Markov chain, whereas outputs Yt are assumed
to depend (directly) on an observation window (Xtr , . . . , Xt , . . . , Xt+r ).
Notice that this goes beyond the standard hidden Markov model structure by allowing for overlapping features (r 1). For simplicity, we focus on a window size of r = 1, in which case the clique set is given by
C := {ct := (xt , yt , yt+1 ), ct := (xt+1 , yt , yt+1 ) : t [T 1]}. We assume an
input kernel k is given and introduce indicator vectors (or dummy variates) I(Y{t,t+1} ) := (I, (Y{t,t+1} )), . Now we can define the local kernel functions as
(85)
k(xs , xt ),
k(xs+1 , xt+1 ),
if c = cs and d = ct ,
if c = cs and d = ct .
Notice that the inner product between indicator vectors is zero, unless the
variable pairs are in the same configuration.
Conditional Markov chain models have found widespread applications
in natural language processing (e.g., for part of speech tagging and shallow
parsing, cf. Sha and Pereira [122]), in information retrieval (e.g., for information extraction, cf. McCallum et al. [96]) or in computational biology
(e.g., for gene prediction, cf. Culotta et al. [39]).
4.2.3. Clique-based sparse approximation. Proposition 20 immediately
leads to an alternative version of the representer theorem as observed by
Lafferty et al. [89] and Altum et al. [4].
Corollary 22. If H is G-compatible then in the same setting as in
Corollary 13, the optimizer f can be written as
(86)
f(u) =
n X X
X
i=1 cC yc Yc
i
c,y
c
dC
kcd ((xic , yc ), ud ),
40
T. HOFMANN, B. SCHOLKOPF
AND A. J. SMOLA
here xic are the variables of xi belonging to clique c and Yc is the subspace
of Zc that contains response variables.
Notice that thePnumber of parameters in the representation equation (86)
scales with n cC |Yc | as opposed to n |Y| in equation (77). For cliques
with reasonably small state spaces, this will be a significantly more compact representation. Notice also that the evaluation of functions kcd will
typically be more efficient than evaluating k.
In spite of this improvement, the number of terms in the expansion in
equation (86) may in practice still be too large. In this case, one can pursue
a reduced set approach, which selects a subset of variables to be included
in a sparsified expansion. This has been proposed in Taskar et al. [132] for
the soft margin maximization problem, as well as in Altun et al. [5] and
Lafferty et al. [89] for conditional random fields and Gaussian processes.
i
For instance, in Lafferty et al. [89] parameters cy
that maximize the
c
functional gradient of the regularized log-loss are greedily included in the
reduced set. In Taskar et al. [132] a similar selection criterion is utilized
with respect to margin violations, leading to an SMO-like optimization
algorithm (Platt [107]).
4.2.4. Probabilistic inference. In dealing with structured or interdependent response variables, computing marginal probabilities of interest or computing the most probable response [cf. equation (79)] may be nontrivial.
However, for dependency graphs with small tree width, efficient inference
algorithms exist, such as the junction tree algorithm (Dawid [43] and Jensen
et al. [76]) and variants thereof. Notice that in the case of the conditional
or hidden Markov chain, the junction tree algorithm is equivalent to the
well-known forwardbackward algorithm (Baum [14]). Recently, a number
of approximate inference algorithms have been developed to deal with dependency graphs for which exact inference is not tractable (see, e.g., Wainwright
and Jordan [150]).
5. Kernel methods for unsupervised learning. This section discusses various methods of data analysis by modeling the distribution of data in feature
space. To that extent, we study the behavior of (x) by means of rather simple linear methods, which have implications for nonlinear methods on the
original data space X . In particular, we will discuss the extension of PCA to
Hilbert spaces, which allows for image denoising, clustering, and nonlinear
dimensionality reduction, the study of covariance operators for the measure
of independence, the study of mean operators for the design of two-sample
tests, and the modeling of complex dependencies between sets of random
variables via kernel dependency estimation and canonical correlation analysis.
41
5.1. Kernel principal component analysis. Principal component analysis (PCA) is a powerful technique for extracting structure from possibly
high-dimensional data sets. It is readily performed by solving an eigenvalue
problem, or by using iterative algorithms which estimate principal components.
PCA is an orthogonal transformation of the coordinate system in which
we describe our data. The new coordinate system is obtained by projection
onto the so-called principal axes of the data. A small number of principal
components is often sufficient to account for most of the structure in the
data.
The basic idea is strikingly simple: denote by X = {x1 , . . . , xn } an nsample drawn from P(x). Then the covariance operator C is given by C =
E[(x E[x])(x E[x]) ]. PCA aims at estimating leading eigenvectors of
C via the empirical estimate Cemp = Eemp [(x Eemp [x])(x Eemp [x]) ]. If
X is d-dimensional, then the eigenvectors can be computed in O(d3 ) time
(Press et al. [110]).
The problem can also be posed in feature space (Sch
olkopf et al. [119])
by replacing x with (x). In this case, however, it is impossible to compute the eigenvectors directly. Yet, note that the image of Cemp lies in the
span of {(x1 ), . . . , (xn )}. Hence, it is sufficient to diagonalize Cemp in
that subspace. In other words, we replace the outer product Cemp by an inner product matrix, leaving P
the eigenvalues unchanged, which can be computed efficiently. Using w = ni=1 i (xi ), it follows that needs to satisfy
P KP = , where P is the projection operator with Pij = ij n2 and
K is the kernel matrix on X.
Note that the problem can also be recovered as one of maximizing some
Contrast[f, X] subject to f F . This means that the projections onto the
leading eigenvectors correspond to the most reliable features. This optimization problem also allows us to unify various feature extraction methods as
follows:
For Contrast[f, X] = Varemp [f, X] and F = {hw, xi subject to kwk 1},
we recover PCA.
Changing F to F = {hw, (x)i subject to kwk 1}, we recover kernel
PCA.
For Contrast[f, X] = Curtosis[f, X] and F = {hw, xi subject to kwk 1},
we have Projection Pursuit (Friedman and Tukey [55] and Huber [72]).
Other contrasts lead to further variants, that is, the Epanechikov kernel,
entropic contrasts, and so on (Cook et al. [32], Friedman [54] and Jones
and Sibson [79]).
If F is a convex combination of basis functions and the contrast function
is convex in w, one obtains computationally efficient algorithms, as the
solution of the optimization problem can be found at one of the vertices
olkopf and Smola [118]).
(Rockafellar [114] and Sch
42
T. HOFMANN, B. SCHOLKOPF
AND A. J. SMOLA
Subsequent projections are obtained, for example, by seeking directions orthogonal to f or other computationally attractive variants thereof.
Kernel PCA has been applied to numerous problems, from preprocessing and invariant feature extraction (Mika et al. [100]) to image denoising
and super-resolution (Kim et al. [84]). The basic idea in the latter case is
to obtain a set of principal directions in feature space w1 , . . . , wl , obtained
from noise-free data, and to project the image (x) of a noisy observation x
u,v
1/2
f,g
subject to f F and g G.
43
This statistic is often extended to use the entire series 1 , . . . , d of maximal correlations where each of the function pairs (fi , gi ) are orthogonal to
the previous set of terms. More specifically Douxois and Nkiet [42] restrict
F, G to finite-dimensional linear function classes subject to their L2 norm
bounded by 1, Bach and Jordan [8] use functions in the RKHS for which
some sum of the n2 and the RKHS norm on the sample is bounded.
Gretton et al. [58] use functions with bounded RKHS norm only, which
provides necessary and sufficient criteria if kernels are universal. That is,
(X, Y, F, G) = 0 if and only if x and y are independent. Moreover,
tr P Kx P Ky P has the same theoretical properties and it can be computed
much more easily in linear time, as it allows for incomplete Cholesky factorizations. Here Kx and Ky are the kernel matrices on X and Y respectively.
The above criteria can be used to derive algorithms for Independent
Component Analysis (Bach and Jordan [8] and Gretton et al. [58]). While
these algorithms come at a considerable computational cost, they offer very
good performance. For faster algorithms, consider the work of Cardoso [26],
Hyvarinen [73] and Lee et al. [91]. Also, the work of Chen and Bickel [28]
and Yang and Amari [155] is of interest in this context.
Note that a similar approach can be used to develop two-sample tests
based on kernel methods. The basic idea is that for universal kernels the
map between distributions and points on the marginal polytope : p
Exp [(x)] is bijective and, consequently, it imposes a norm on distributions. This builds on the ideas of [52]. The corresponding distance d(p, q) :=
k[p] [q]k leads to a U -statistic which allows one to compute empirical
estimates of distances between distributions efficiently [22].
5.3. Kernel dependency estimation. A large part of the previous discussion revolved around estimating dependencies between samples X and Y for
rather structured spaces Y, in particular, (64). In general, however, such
dependencies can be hard to compute. Weston et al. [153] proposed an algorithm which allows one to extend standard regularized LS regression models,
as described in Section 3.3, to cases where Y has complex structure.
It works by recasting the estimation problem as a linear estimation problem for the map f : (x) (y) and then as a nonlinear pre-image estimation problem for finding y := argminy kf (x) (y)k as the point in Y closest
to f (x).
This problem can be solved directly (Cortes et al. [33]) without the need
for subspace projections. The authors apply it to the analysis of sequence
data.
6. Conclusion. We have summarized some of the advances in the field
of machine learning with positive definite kernels. Due to lack of space,
this article is by no means comprehensive, in particular, we were not able to
44
T. HOFMANN, B. SCHOLKOPF
AND A. J. SMOLA
45
[3] Alon, N., Ben-David, S., Cesa-Bianchi, N. and Haussler, D. (1993). Scalesensitive dimensions, uniform convergence, and learnability. In Proc. of the
34rd Annual Symposium on Foundations of Computer Science 292301. IEEE
Computer Society Press, Los Alamitos, CA. MR1328428
[4] Altun, Y., Hofmann, T. and Smola, A. J. (2004). Gaussian process classification
for segmenting and annotating sequences. In Proc. International Conf. Machine
Learning 2532. ACM Press, New York.
[5] Altun, Y., Smola, A. J. and Hofmann, T. (2004). Exponential families for conditional random fields. In Uncertainty in Artificial Intelligence (UAI) 29. AUAI
Press, Arlington, VA.
[6] Altun, Y., Tsochantaridis, I. and Hofmann, T. (2003). Hidden Markov support
vector machines. In Proc. Intl. Conf. Machine Learning 310. AAAI Press,
Menlo Park, CA.
[7] Aronszajn, N. (1950). Theory of reproducing kernels. Trans. Amer. Math. Soc. 68
337404. MR0051437
[8] Bach, F. R. and Jordan, M. I. (2002). Kernel independent component analysis.
J. Mach. Learn. Res. 3 148. MR1966051
lkopf, B., Smola, A., Taskar, B. and Vish[9] Bakir, G., Hofmann, T., Scho
wanathan, S. V. N. (2007). Predicting Structured Data. MIT Press, Cambridge, MA.
[10] Bamber, D. (1975). The area above the ordinal dominance graph and the area
below the receiver operating characteristic graph. J. Math. Psych. 12 387415.
MR0384214
[11] Barndorff-Nielsen, O. E. (1978). Information and Exponential Families in Statistical Theory. Wiley, New York. MR0489333
[12] Bartlett, P. L. and Mendelson, S. (2002). Rademacher and gaussian complexities: Risk bounds and structural results. J. Mach. Learn. Res. 3 463482.
MR1984026
[13] Basilico, J. and Hofmann, T. (2004). Unifying collaborative and content-based
filtering. In Proc. Intl. Conf. Machine Learning 6572. ACM Press, New York.
[14] Baum, L. E. (1972). An inequality and associated maximization technique in statistical estimation of probabilistic functions of a Markov process. Inequalities
3 18. MR0341782
[15] Ben-David, S., Eiron, N. and Long, P. (2003). On the difficulty of approximately
maximizing agreements. J. Comput. System Sci. 66 496514. MR1981222
[16] Bennett, K. P., Demiriz, A. and Shawe-Taylor, J. (2000). A column generation
algorithm for boosting. In Proc. 17th International Conf. Machine Learning
(P. Langley, ed.) 6572. Morgan Kaufmann, San Francisco, CA.
[17] Bennett, K. P. and Mangasarian, O. L. (1992). Robust linear programming
discrimination of two linearly inseparable sets. Optim. Methods Softw. 1 2334.
[18] Berg, C., Christensen, J. P. R. and Ressel, P. (1984). Harmonic Analysis on
Semigroups. Springer, New York. MR0747302
[19] Bertsimas, D. and Tsitsiklis, J. (1997). Introduction to Linear Programming.
Athena Scientific, Nashua, NH.
[20] Bloomfield, P. and Steiger, W. (1983). Least Absolute Deviations: Theory, Applications and Algorithms. Birkh
auser, Boston. MR0748483
[21] Bochner, S. (1933). Monotone Funktionen, Stieltjessche Integrale und harmonische
Analyse. Math. Ann. 108 378410. MR1512856
46
T. HOFMANN, B. SCHOLKOPF
AND A. J. SMOLA
lkopf,
[22] Borgwardt, K. M., Gretton, A., Rasch, M. J., Kriegel, H.-P., Scho
B. and Smola, A. J. (2006). Integrating structured biological data by kernel
maximum mean discrepancy. Bioinformatics (ISMB) 22 e49e57.
[23] Boser, B., Guyon, I. and Vapnik, V. (1992). A training algorithm for optimal margin classifiers. In Proc. Annual Conf. Computational Learning Theory
(D. Haussler, ed.) 144152. ACM Press, Pittsburgh, PA.
[24] Bousquet, O., Boucheron, S. and Lugosi, G. (2005). Theory of classification:
A survey of recent advances. ESAIM Probab. Statist. 9 323375. MR2182250
[25] Burges, C. J. C. (1998). A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Discov. 2 121167.
[26] Cardoso, J.-F. (1998). Blind signal separation: Statistical principles. Proceedings
of the IEEE 90 20092026.
[27] Chapelle, O. and Harchaoui, Z. (2005). A machine learning approach to conjoint
analysis. In Advances in Neural Information Processing Systems 17 (L. K. Saul,
Y. Weiss and L. Bottou, eds.) 257264. MIT Press, Cambridge, MA.
[28] Chen, A. and Bickel, P. (2005). Consistent independent component analysis and
prewhitening. IEEE Trans. Signal Process. 53 36253632. MR2239886
[29] Chen, S., Donoho, D. and Saunders, M. (1999). Atomic decomposition by basis
pursuit. SIAM J. Sci. Comput. 20 3361. MR1639094
[30] Collins, M. (2000). Discriminative reranking for natural language parsing. In Proc.
17th International Conf. Machine Learning (P. Langley, ed.) 175182. Morgan
Kaufmann, San Francisco, CA.
[31] Collins, M. and Duffy, N. (2001). Convolution kernels for natural language.
In Advances in Neural Information Processing Systems 14 (T. G. Dietterich,
S. Becker and Z. Ghahramani, eds.) 625632. MIT Press, Cambridge, MA.
[32] Cook, D., Buja, A. and Cabrera, J. (1993). Projection pursuit indices based
on orthonormal function expansions. J. Comput. Graph. Statist. 2 225250.
MR1272393
[33] Cortes, C., Mohri, M. and Weston, J. (2005). A general regression technique
for learning transductions. In ICML05 : Proceedings of the 22nd International
Conference on Machine Learning 153160. ACM Press, New York.
[34] Cortes, C. and Vapnik, V. (1995). Support vector networks. Machine Learning
20 273297.
[35] Crammer, K. and Singer, Y. (2001). On the algorithmic implementation of multiclass kernel-based vector machines. J. Mach. Learn. Res. 2 265292.
[36] Crammer, K. and Singer, Y. (2005). Loss bounds for online category ranking.
In Proc. Annual Conf. Computational Learning Theory (P. Auer and R. Meir,
eds.) 4862. Springer, Berlin. MR2203253
[37] Cristianini, N. and Shawe-Taylor, J. (2000). An Introduction to Support Vector
Machines. Cambridge Univ. Press.
[38] Cristianini, N., Shawe-Taylor, J., Elisseeff, A. and Kandola, J. (2002). On
kernel-target alignment. In Advances in Neural Information Processing Systems
14 (T. G. Dietterich, S. Becker and Z. Ghahramani, eds.) 367373. MIT Press,
Cambridge, MA.
[39] Culotta, A., Kulp, D. and McCallum, A. (2005). Gene prediction with conditional random fields. Technical Report UM-CS-2005-028, Univ. Massachusetts,
Amherst.
[40] Darroch, J. N. and Ratcliff, D. (1972). Generalized iterative scaling for loglinear models. Ann. Math. Statist. 43 14701480. MR0345337
47
[41] Das, D. and Sen, P. (1994). Restricted canonical correlations. Linear Algebra Appl.
210 2947. MR1294769
[42] Dauxois, J. and Nkiet, G. M. (1998). Nonlinear canonical analysis and independence tests. Ann. Statist. 26 12541278. MR1647653
[43] Dawid, A. P. (1992). Applications of a general propagation algorithm for probabilistic expert systems. Stat. Comput. 2 2536.
lkopf, B. (2002). Training invariant support vector ma[44] DeCoste, D. and Scho
chines. Machine Learning 46 161190.
[45] Dekel, O., Manning, C. and Singer, Y. (2004). Log-linear models for label
ranking. In Advances in Neural Information Processing Systems 16 (S. Thrun,
L. Saul and B. Sch
olkopf, eds.) 497504. MIT Press, Cambridge, MA.
[46] Della Pietra, S., Della Pietra, V. and Lafferty, J. (1997). Inducing features
of random fields. IEEE Trans. Pattern Anal. Machine Intelligence 19 380393.
[47] Einmal, J. H. J. and Mason, D. M. (1992). Generalized quantile processes. Ann.
Statist. 20 10621078. MR1165606
[48] Elisseeff, A. and Weston, J. (2001). A kernel method for multi-labeled classification. In Advances in Neural Information Processing Systems 14 681687.
MIT Press, Cambridge, MA.
[49] Fiedler, M. (1973). Algebraic connectivity of graphs. Czechoslovak Math. J. 23
298305. MR0318007
[50] FitzGerald, C. H., Micchelli, C. A. and Pinkus, A. (1995). Functions that
preserve families of positive semidefinite matrices. Linear Algebra Appl. 221
83102. MR1331791
[51] Fletcher, R. (1989). Practical Methods of Optimization. Wiley, New York.
MR0955799
[52] Fortet, R. and Mourier, E. (1953). Convergence de la reparation empirique vers
48
T. HOFMANN, B. SCHOLKOPF
AND A. J. SMOLA
49
50
T. HOFMANN, B. SCHOLKOPF
AND A. J. SMOLA
51
52
T. HOFMANN, B. SCHOLKOPF
AND A. J. SMOLA
[140] Vapnik, V. (1995). The Nature of Statistical Learning Theory. Springer, New York.
MR1367965
[141] Vapnik, V. (1998). Statistical Learning Theory. Wiley, New York. MR1641250
[142] Vapnik, V. and Chervonenkis, A. (1971). On the uniform convergence of relative
frequencies of events to their probabilities. Theory Probab. Appl. 16 264281.
[143] Vapnik, V. and Chervonenkis, A. (1991). The necessary and sufficient conditions
for consistency in the empirical risk minimization method. Pattern Recognition
and Image Analysis 1 283305.
[144] Vapnik, V., Golowich, S. and Smola, A. J. (1997). Support vector method for
function approximation, regression estimation, and signal processing. In Advances in Neural Information Processing Systems 9 (M. C. Mozer, M. I. Jordan
and T. Petsche, eds.) 281287. MIT Press, Cambridge, MA.
[145] Vapnik, V. and Lerner, A. (1963). Pattern recognition using generalized portrait
method. Autom. Remote Control 24 774780.
[146] Vishwanathan, S. V. N. and Smola, A. J. (2004). Fast kernels for string and
tree matching. In Kernel Methods in Computational Biology (B. Sch
olkopf,
K. Tsuda and J. P. Vert, eds.) 113130. MIT Press, Cambridge, MA.
[147] Vishwanathan, S. V. N., Smola, A. J. and Vidal, R. (2007). BinetCauchy
kernels on dynamical systems and its application to the analysis of dynamic
scenes. Internat. J. Computer Vision 73 95119.
[148] Wahba, G. (1990). Spline Models for Observational Data. SIAM, Philadelphia.
MR1045442
[149] Wahba, G., Wang, Y., Gu, C., Klein, R. and Klein, B. (1995). Smoothing spline
ANOVA for exponential families, with application to the Wisconsin Epidemiological Study of Diabetic Retinopathy. Ann. Statist. 23 18651895. MR1389856
[150] Wainwright, M. J. and Jordan, M. I. (2003). Graphical models, exponential
families, and variational inference. Technical Report 649, Dept. Statistics, Univ.
California, Berkeley.
[151] Watkins, C. (2000). Dynamic alignment kernels. In Advances in Large Margin
Classifiers (A. J. Smola, P. L. Bartlett, B. Sch
olkopf and D. Schuurmans, eds.)
3950. MIT Press, Cambridge, MA. MR1820960
[152] Wendland, H. (2005). Scattered Data Approximation. Cambridge Univ. Press.
MR2131724
lkopf, B. and Vapnik, V.
[153] Weston, J., Chapelle, O., Elisseeff, A., Scho
(2003). Kernel dependency estimation. In Advances in Neural Information Processing Systems 15 (S. T. S. Becker and K. Obermayer, eds.) 873880. MIT
Press, Cambridge, MA. MR1820960
[154] Whittaker, J. (1990). Graphical Models in Applied Multivariate Statistics. Wiley,
New York. MR1112133
[155] Yang, H. H. and Amari, S.-I. (1997). Adaptive on-line learning algorithms for
blind separationmaximum entropy and minimum mutual information. Neural
Comput. 9 14571482.
[156] Zettlemoyer, L. S. and Collins, M. (2005). Learning to map sentences to logical form: Structured classification with probabilistic categorial grammars. In
Uncertainty in Artificial Intelligence UAI 658666. AUAI Press, Arlington,
Virginia.
tsch, G., Mika, S., Scho
lkopf, B., Lengauer, T. and Mu
ller,
[157] Zien, A., Ra
K.-R. (2000). Engineering support vector machine kernels that recognize translation initiation sites. Bioinformatics 16 799807.
53
lkopf
B. Scho
Max Planck Institute
for Biological Cybernetics
bingen
Tu
Germany
E-mail: bs@tuebingen.mpg.de
A. J. Smola
Statistical Machine Learning Program
National ICT Australia
Canberra
Australia
E-mail: Alex.Smola@nicta.com.au