978 1 4757 2440 0
978 1 4757 2440 0
978 1 4757 2440 0
The Nature of
Statistical Learning Theory
With 33 Illustrations
, Springer
Vladimir N. Vapnik
AT&T Bell Laboratories
101 Crawfords Corner Road
Holmdel, NJ 07733 USA
All rights reserved. This work may not be translated or copied in whole or in part without the
written permission of the publisher (Springer Science+Business Media, LLC), except for brief
excerpts in connection with reviews or scholarly
analysis. Use in connection with any form of information storage and retrieval, electronic
adaptation, computer software, or by similar or dissimilar methodology now known or hereaf-
ter developed is forbidden.
The use of general descriptive names, trade names, trademarks, etc., in this publication, even
if the former are not especially identified, is not to be taken as a sign that such names, as
understood by the Trade Marks and Merchandise Marks Act, may accordingly be used freely
by anyone.
987654321
In memory of my mother
Preface
(iii) Discovery of the law of large numbers in functional space and its
relation to the learning processes by Vapnik and Chervonenkis.
These four discoveries also form a basis for any progress in the studies of
learning processes.
The problem of learning is so general that almost any question that
has been discussed in statistical science has its analog in learning theory.
Furthermore, some very important general results were first found in the
framework of learning theory and then reformulated in the terms of statis-
tics.
In particular learning theory for the first time stressed the problem of
small sample statistics. It was shown that by taking into account the size
of sample one can obtain better solutions to many problems of function
estimation than by using the methods based on classical statistical tech-
niques.
Small sample statistics in the framework of the new paradigm constitutes
an advanced subject of research both in statistical learning theory and in
theoretical and applied statistics. The rules of statistical inference devel-
oped in the framework of the new paradigm should not only satisfy the
existing asymptotic requirements but also guarantee that one does one's
best in using the available restricted information. The result of this theory
are new methods of inference for various statistical problems.
To develop these methods (that often contradict intuition), a compre-
hensive theory was built that includes:
(i) Concepts describing the necessary and sufficient conditions for con-
sistency of inference.
(ii) Bounds describing the generalization ability of learning machines
based on these concepts.
(iii) Inductive inference for small sample sizes, based on these bounds.
(iv) Methods for implementing this new type of inference.
Two difficulties arise when one tries to study statistical learning theory:
a technical one and a conceptual one - to understand the proofs and to
understand the nature of the problem, its philosophy.
To overcome the technical difficulties one has to be patient and persistent
in following the details of the formal inferences.
To understand the nature of the problem, its spirit, and its philosophy,
one has to see the theory as a whole, not only as a collection of its different
parts. Understanding the nature of the problem is extremely important
Preface ix
because it leads to searching in the right direction for results and prevents
searching in wrong directions.
The goal of this book is to describe the nature of statistical learning the-
ory. I would like to show how the abstract reasoning implies new algorithms.
To make the reasoning easier to follow, I made the book short.
I tried to describe things as simply as possible but without conceptual
simplifications. Therefore the book contains neither details of the theory
nor proofs of the theorems (both details of the theory and proofs of the the-
orems can be found (partly) in my 1982 book Estimation of Dependencies
Based on Empirical Data, Springer and (in full) in my forthcoming book
Statistical Learning Theory, J. Wiley, 1996). However to describe the ideas
without simplifications I needed to introduce new concepts (new mathe-
matical constructions) some of which are non-trivial.
the book does describe a (conceptual) forest even if it does not consider
the (mathematical) trees.
In writing this book I had one more goal in mind: I wanted to stress the
practical power of abstract reasoning. The point is that during the last few
years at different computer science conferences, I heard repetitions of the
following claim:
Complex theories do not work, simple algorithms do.
One of the goals of this book is to show that, at least in the problems
of statistical inference, this is not true. I would like to demonstrate that in
this area of science a good old principle is valid:
Nothing is more practical than a good theory.
The book is not a survey of the standard theory. It is an attempt to
promote a certain point of view not only on the problem of learning and
generalization but on theoretical and applied statistics as a whole.
It is my hope that the reader will find the book interesting and useful.
ACKNOWLEDGMENTS
This book became possible due to support of Larry Jackel, the head of
Adaptive System Research Department, AT&T Bell Laboratories.
It was inspired by collaboration with my colleagues Jim Alvich, Jan
Ben, Yoshua Bengio, Bernhard Boser, Leon Bottou, Jane Bromley, Chris
Burges, Corinna Cortes, Eric Cosatto, Joanne DeMarco, John Denker, Har-
ris Drucker, Hans Peter Graf, Isabelle Guyon, Donnie Henderson, Larry
Jackel, Yann LeCun, Robert Lyons, Nada Matic, Drs Mueller, Craig Nohl,
Edwin Pednault, Eduard Sackinger, Bernhard Sch6lkopf, Patrice Simard,
Sara Solla, Sandi von Pier, and Chris Watkins.
Chris Burges, Edwin Pednault, and Bernhard Sch6lkopf read various
versions of the manuscript and improved and simplified the exposition.
When the manuscript was ready I gave it to Andrew Barron, Yoshua
Bengio, Robert Berwick, John Denker, Federico Girosi, Ilia Izmailov, Larry
Jackel, Yakov Kogan, Esther Levin, Tomaso Poggio, Edward Reitman,
Alexander Shustorovich, and Chris Watkins for remarks. These remarks
also improved the exposition.
I would like to express my deep gratitude to everyone who helped make
this book.
Vladimir N. Vapnik
AT&T Bell Laboratories,
Holmdel, March 1995
Contents
Preface vii
References 177
Remarks on References. 177
References . 178
Index 185
Introduction:
Four Periods in the Research of the
Learning Problem
In the history of research of the learning problem one can extract four
periods that can be characterized by four bright events:
y = sign{(w· x) - b},
where (u . v) is the inner product between two vectors, b is a threshold
value, and sign(u) = 1 if u > 0 and sign(u) = -1 if u ::; O.
Geometrically speaking the neurons divide the space X into two regions:
a region where the output y takes the value 1 and a region where the output
y takes the value -1. These two regions are separated by the hyperplane
(w· x) - b = O.
The vector w and the scalar b determine the position of the separating hy-
perplane. During the learning process the Perceptron chooses appropriate
coefficients of the neuron.
Rosenblatt considered a model that is a composition of several neurons:
he considered several levels of neurons, where outputs of neurons of the pre-
vious level are inputs for neurons of the next level (the output of one neuron
can be input to several neurons). The last level contains only one neuron.
Therefore, the (elementary) Perceptron has n inputs and one output.
Geometrically speaking, the Perceptron divides the space X into two
parts separated by a piecewise linear surface (Fig. 0.2).Choosing appropri-
ate coefficients for all neurons of the net, Percept ron specifies two regions
in X space. These regions are separated by piecewise linear surfaces (not
necessarily connected). Learning in this model means finding appropriate
coefficients for all neurons using given training data.
In the 1960s it was not clear how to choose the coefficients simultane-
ously for all neurons of the Percept ron (the solution came 25 years later).
ing a decision rule separating two categories of vectors using given probability
distribution functions for these categories of vectors.
0.1. Rosenblatt's Perceptron (The 1960s) 3
y = sign [(w • x) - b.
x"
(a)
x"
(b)
(w. xl - b = 0
FIGURE 0.1. (a) Model of a neuron. (b) Geometrically, a neuron defines two
regions in input space where it takes the values -1 and 1. These regions are
separated by a hyperplane (w . x) - b = O.
x"
(b)
FIGURE 0.2. (a) The Perceptron is a composition of several neurons. (b) Geo-
metrically, the Perceptron defines two regions in input space where it takes the
values -1 and 1. These regions are separated by a piecewise linear surface.
0.1. Rosenblatt's Perceptron (The 1960s) 5
N S [~:]
corrections the hyperplane that separates the training data will be con-
structed.
This theorem played an extremely important role in creating learning
theory. It somehow connected the cause of generalization ability with the
principle of minimizing the number of errors on the training set. As we
will see in the last chapter, the expression [R2 / p2] describes an impor-
tant concept that for a wide class of learning machines, allows control of
generalization ability.
6 Introduction: Four Periods in the Research of the Learning Problem
f<
1 + 4ln
P
B-In 7] [R2]
-
- -In(l- c) p2
steps,
(ii) by the stopping moment it will have constructed a decision rule that
with probability 1 - 7] has a probability of error on the test set less
than c (Aizerman, Braverman, and Rozonoer, 1964).
After these results many researchers thought that minimizing the error
on the training set is the only cause of generalization (small probability of
test errors). Therefore the analysis of learning processes was split into two
branches, call them Applied Analysis of learning processes and Theoretical
Analysis of learning processes.
The philosophy of Applied Analysis of the learning process can be de-
scribed as follows:
To get a good generalization it is sufficient to choose the coeffi-
cients of the neuron that provide the minimal number of train-
ing errors. The principle of minimizing the number of training
errors is a self-evident inductive principle and from the practi-
cal point of view does not need justification. The main goal of
Applied Analysis is to find methods for constructing the coef-
ficients simultaneously for all neurons such that the separating
surface provides the minimal number of errors on the training
data.
0.2. Construction of the Fundamentals ofthe Learning Theory 7
This book shows that indeed the principle of minimizing the number
of training errors is not self-evident and that there exists another more
intelligent inductive principle that provides a better level of generalization
ability.
These results were inspired by the study of learning processes. They are
the main subject of the book.
AI = F, IE:F
(finding IE :F that satisfies the equality) is ill-posed; even if there exists a
unique solution to this equation, a small deviation on the right-hand side
of this equation (F6 instead of F, where IIF - F611 < b is arbitrarily small)
can cause large deviations in the solutions (it can happen that 11/6 - III is
large).
In this case if the right-hand side F of the equation is not exact (e.g., it
equals F6, where F6 differs from F by some level b of noise) the functions
16 that minimize the functional
R(f) = IIAI - F6W
do not guarantee a good approximation to the desired solution even if b
tends to zero.
Hadamard thought that ill-posed problems are a pure mathematical phe-
nomenon and that all real-life problems are ''well-posed.'' However in the
second half of the century a number of very important real-life problems
were found to be ill-posed. In particular ill-posed problems arise when one
tries to reverse the cause-effect relations: to find unknown causes from the
known consequences. Even if the cause-effect relationship forms a one-to-
one mapping, the problem of inverting it can be ill-posed.
For our discussion it is important that one of the the main problems of
statistics, estimating the density function from the data, is ill-posed.
In the middle of the 1960s it was discovered that if instead of the func-
tional R(f) one minimizes another so-called regularized functional
of minimizing the functional R(f) does not work, the not "self-evident"
method of minimizing the functional R* (f) does.
The influence of the philosophy created by the theory of solving ill-posed
problem is very deep. Both the regularization philosophy and the regular-
ization technique became widely spread in many areas of science, including
statistics.
All these new ideas are still being developed. However they did shift
the main understanding as to what can be done in the problem of the
dependency estimation on the basis of a limited number of empirical data.
y=8{(w·x)-b}
5The back-propagation method was actually found in 1963 for solving some
control problems (Brison, Denham and Dreyfuss, 1963) and was rediscovered for
Perceptrons.
12 Introduction: Four Periods in the Research of the Learning Problem
60f course it is very interesting to know how humans can learn. However, this
is not necessarily the best way for creating an artificial learning machine. It has
been noted that the study of birds flying was not very useful for constructing the
airplane.
7L.G. Valiant, 1984, "A theory oflearnability", Commun. ACM 27(11),1134-
1142.
8 "If the computational requirement is removed from the definition then we
are left with the notion of nonparametric inference in the sense of statistics, as
discussed in particular by Vapnik." (L. Valiant, 1991, "A view of computational
learning theory" , In the book: "Computation and Cognition", Society for Indus-
trial and Applied Mathematics, Philadelphia, p. 36.)
14 Introduction: Four Periods in the Research of the Learning Problem
IThis is the general case which includes the case where the supervisor uses a
function y = I(x).
2Note that the elements Q E A are not necessarily vectors. They can be any
abstract parameters. Therefore, we in fact consider any set of functions.
16 1. Setting of the Learning Problem
FIGURE 1.1. A model of learning from examples. During the learning process,
the Learning Machine observes the pairs (x, y) (the training set). After training,
the machine must on any given x return a value y. The goal is to return a value
y which is close to the supervisor's response y.
The goal is to find the function I{x, 0:0) which minimizes the risk functional
R{o:) (over the class of functions I(x, 0:), 0: E A) in the situation where
the joint probability distribution function F(x, y) is unknown and the only
available information is contained in the training set (1.1).
For this loss function, the functional (1.2) determines the probability of
different answers given by the supervisor and by the indicator function
! (x, a). We call the case of different answers a classification error.
The problem, therefore, is to find a function which minimizes the prob-
ability of classification error when the probability measure F(x, y) is un-
known, but the data (1.1) are given.
!(x,ao) = j y dF(ylx).
It is known that the regression function is the one which minimizes the
functional (1.2) with the following loss-function3 :
L(y, !(x, a)) = (y - !(x, a))2. (1.4)
Thus the problem of regression estimation is the problem of minimizing the
risk functional (1.2) with the loss function (1.4) in the situation where the
probability measure F(x, y) is unknown but the data (1.1) are given.
3lfthe regression function f(x) does not belong to f(x,a) ,a E A, then the
function f(x, ao) minimizing the functional (1.2) with loss function (1.4) is the
closest to the regression in the metric L 2 (F):
Xb···,X n
are given.
R(a) = J
Q(z, a)dF(z), a E A, (1.6)
is given.
The learning problems considered above are particular cases of this gen-
eral problem of minimizing the risk functional (1.6) on the basis of empirical
data (1.7), where z describes a pair (x, y) and Q(z, a) is the specific loss-
function (e.g., one of Eqs. (1.3), (1.4), or (1.5)). In the following we will
describe the results obtained for the general statement of the problem. To
apply them to specific problems, one has to substitute the corresponding
loss-functions in the formulas obtained.
(1.8)
(ii) One approximates the function Q(z, ao) which minimizes risk (1.6)
by the function Q(z, al) minimizing the empirical risk (1.8).
This principle is called the Empirical Risk Minimization inductive principle
(ERM principle).
We say that an inductive principle defines a learning process if for any
given set of observations the learning machine chooses the approximation
using this inductive principle. In learning theory the ERM principle plays
a crucial role.
The ERM principle is quite general. The classical methods for the solu-
tion of a specific learning problem, such as the least-squares method in the
problem of regression estimation or the maximum likelihood (ML) method
in the problem of density estimation, are realizations of the ERM principle
for the specific loss-functions considered above.
Indeed, substituting the specific loss-function (1.4) in Eq. (1.8) one ob-
tains the functional to be minimized:
that forms the least-squares method, while substituting the specific loss-
function (1.5) in Eq. (1.8) one obtains the functional to be minimized:
called the discriminant function (rule) that assigns value 1 for representa-
tives of the first cathegory and value -1 for representatives of the second
cathegory. To find the discriminant rule one has to estimate two densities:
P1(x,a) and P2(X,{3). In the classical paradigm one uses the ML method
to estimate the parameters a* and (3* of these densities.
fo(x) = f(x,ao),
one can estimate the parameters ao of the unknown function f(x, ao) by
the ML method, namely by maximizing the functional
l
L(a) = 2:1np(Yi - f(xi,a)).
i=l
(Recall that p(~) is a known function and that ~ =Y- f(x,ao).) Taking
the normal law
p(~) 1 exp
= aJ2ir {e }
- 2a2
with zero mean and some fixed variance as a model of noise, one obtains
the least-squares method:
l
L*(a) = - 2!2 2:(Yi - f(xi,a))2 - fln(V21ra)
.=1
24 1. Informal Reasoning and Comments - 1
p(x,a,O') = ~exp{
20' 271'
(x-a)2}
20'2
+ _1 e
2J2; xp {_ x2}
2'
> In ( ~) + tIn
20'0 271'i=2
( ~exp{- X2~})
2v 271'
i 2
-In 0'0 - ~ X; - lIn 2V21r > A.
From this inequality one concludes that the maximum of the likelihood
does not exist and therefore the ML method does not provide a solution to
estimating the parameters a and a.
Thus, the ML method can only be applied to a very restrictive set of
densities.
In the beginning of the 1960s several authors suggested various new meth-
ods, so-called nonparametric methods, for density estimation. The goal of
these methods was to estimate a density from a rather wide set of functions
that is not restricted to be a parametric set of functions (M. Rosenblatt,
1957), (Parzen, 1962), and (Chentsov, 1963).
(i) Parzen's estimator is consistent (in the various metrics) for estimating
a density from a very wide classes of densities.
x. x 0 x· x
11 12 '£
FIGURE 1.2. The empirical distribution function Fl(X), constructed from the
data Xl, ... , Xl, approximates the probability distribution function F(x).
Solve the integral equation (1.10) in the case where the proba-
bility distribution function is unknown, but Li.d. Xl, ... ,Xe, . ..
data in accordance to this function are given.
Using these data one can construct the empirical distribution function
Fe (X). Therefore one has to solve the integral equation (1.10) for the case
where instead of the exact right-hand side, one knows an approximation
that converges uniformly to the unknown function as the number of obser-
vations increases.
Note that the problem of solving this integral equation in a wide class of
functions {pet)} is ill-posed. This bring us to two conclusions:
(i) Generally speaking the estimation of a density is a hard (ill-posed)
computational problem.
(ii) To solve this problem well one has to use regularization (i.e., not
"self-evident" ) techniques.
We now formulate the main principle for solving problems using a restricted
amount of information:
When solving a given problem, try to avoid solving a more geneml prob-
lem as an intermediate step.
Although this principle is obvious, it is not easy to follow it. For our
problems of dependency estimation this principle means that to solve the
problem of pattern recognition or regression estimation, one must try to
find the desired function "directly" (in the next section we will specify what
this means) rather than first estimating the densities and then using the
estimated densities to construct the desired function.
Note that estimation of densities is a universal problem of statistics
(knowing the densities one can solve various problems). Estimation of den-
sities in general is an ill-posed problem, therefore it requires a lot of ob-
servations in order to be solved well. In contrast, the problems which we
really need to solve (decision rule estimation or regression estimation) are
quite particular ones; often they can be solved on the basis of a reasonable
number of observations.
To illustrate this idea let us consider the following situation. Suppose one
wants to construct a decision rule separating two sets of vectors described
by two normal laws: N(11-1, E 1 ) and N(11-2, E2)' In order to construct the dis-
criminant rule (1.9), one has to estimate from the data two n-dimensional
vectors, the means 11-1 and 11-2, and two n x n covariance matrices El and
E 2 • As a result one obtains a separating polynomial of degree two:
practice. In practice, the linear discriminant function that occurs when the
two covariance matrices coincide, is used, I:l = I:2 = I::
In what follows we argue that the setting of learning problems given in this
chapter not only allows us to consider estimating problems in any given
set of functions, but also to implement the main principle for using small
samples: avoiding the solution of unnecessarily general problems.
5In the 1960s the problem of constructing the best linear discriminant function
(in the case where a quadratic function is optimal) was solved (Andersen and
Bahadur, 1966). For solving real life problems the linear discriminant functions
usually are used even if it is known that the optimal solution belongs to quadratic
discriminant functions.
30 1. Informal Reasoning and Comments - 1
where fo(x) is the regression function. Note that the second term in Eq.
(1.11) does not depend on the chosen function. Therefore, minimizing this
functional is equivalent to minimizing the functional
The last functional equals the squared L2 (F) distance between a function
of the set of admissible functions and the regression. Therefore, we con-
sider the following problem: using the sample, find in the admissible set of
functions, the closest one to the regression (in metrics L2(F)).
If one accepts the L2(F) metrics, then the formulation of the regression
estimation problem (minimizing R(a)) is direct. (It does not require solving
a more general problem, for example, finding F(x,y)).)
Let us add to this functional a constant (a functional that does not depend
on the approximating functions)
c = /lnpo(t)dF(t),
where po(t) and F(t) are the desired density and its probability distribution
function. We obtain
- J p(t, a)
In po(x) Po(t)dt.
R(a) = J
Q(z,a)dF(z)
Zl, .. ·, Zl
6In 1967 this theory was also suggested by S. Amari (Amari, 1967).
32 1. Informal Reasoning and Comments - 1
(i) When for any element of the training data the gradient is so small
that the learning process cannot be continued.
(ii) When the learning process is not saturated but satisfies some stopping
criterion.
It is easy to see that in the first case the stochastic approximation method is
just a special way of minimizing the empirical risk. The second case consti-
tutes a regularization method of minimizing the risk functional. 7 Therefore,
in the "non-wasteful regimes" the stochastic approximation method can be
explained as either inductive properties of the ERM method or inductive
properties of the regularization method.
To complete the discussion on classical inductive inferences it is neces-
sary to consider Bayesian inference. In order to use this inference one must
possess additional a priori information complementary to the set of para-
metric functions containing the desired one. Namely, one must know the
distribution function that describes the probability for any function from
the admissible set of functions to be the desired one. Therefore, Bayesian
inference is based on using strong a priori information (it requires that the
desired function belongs to the set of functions of the learning machine).
In this sense it does not define a general way for inference. We will discuss
this inference later in the comments on Chapter 4.
Thus, along with the ERM inductive principle one can use other inductive
principles. However, the ERM principle (compared to other ones) looks
more robust (it uses empirical data better, it does not depend on a priori
information, and there are clear ways to implement it).
Therefore, in the analysis of learning processes, the key problem became
exploring the ERM principle.
The goal of this part of the theory is to describe the conceptual model
for learning processes that are based on the Empirical Risk Minimization
inductive principle. This part of the theory has to explain when a learning
machine that minimizes empirical risk can achieve a small value of actual
risk (can generalize) and when it can not. In other words, the goal of this
part is to describe the necessary and sufficient conditions for the consistency
of learning processes that minimizes the empirical risk.
The following question arises:
FIGURE 2.1. The learning process is consistent if both the expected risks R(at)
and the empirical risks Remp(at) converge to the minimal possible value of the
risk, infaEA R(a).
Let Q(z, at) be a function that minimizes the empirical risk functional
1 t
Remp =l LQ(zi,a)
i=l
Q (z, a), a c A
o z
It is clear (Fig. 2.2)that for the extended set of functions (containing ¢(z))
the ERM method will be consistent. Indeed for any distribution function
and for any number of observations, the minimum of the empirical risk
will be attained on the function ¢(z) that also gives the minimum of the
expected risk.
This example shows that there exist trivial cases of consistency which
depend on whether the given set of functions contains a minorizing function.
Therefore, any theory of consistency that uses the classical definition
must determine whether a case of trivial consistency is possible. That means
that the theory should take into account the specific functions in the given
set.
36 2. Consistency of Learning Processes
the convergence
inf Remp(a) ~ inf R(a) (2.3)
aEA(c) l-+oo aEA(c)
is valid.
In other words, the ERM is nontrivially consistent if it provides conver-
gence (2.3) for the subset of functions that remain after the functions with
the smallest values of the risks are excluded from this set.
Note that in the classical definition of consistency described in the pre-
vious section one uses two conditions: (2.1) and (2.2). In the definition of
nontrivial consistency one uses only one condition (2.3). It can be shown
that condition {2.1} will be satisfied automatically under the condition of
nontrivial consistency.
In this chapter we will study conditions for nontrivial consistency which
for simplicity we will call consistency.
The Key Theorem of Learning Theory is the following (Vapnik and Cher-
vonenkis, 1989):
Theorem 2.1. Let Q(z, a), a E A be a set of functions that satisfy the
J
condition
A::; Q(z, a)dF(z) ::; B (A::; R(a) ::; B).
is valid, where Xl, ... , Xl is an Li.d. sample obtained according to the density
po(x).
In other words we define the ML method to be nontrivially consistent if it
is consistent for estimating any density from the admissible set of densities.
For the ML method the following Key Theorem is true (Vapnik and
Chervonenkis, 1989):
Theorem 2.2. For the ML method to be nontrivially consistent on the
set of densities:
2The following fact confirms the importance of this theorem. Toward the end of
the 1980s and beginning of the 1990s several alternative approaches to learning
theory were attempted based on the idea that statistical learning theory is a
theory of "worst case analysis" . In these approaches authors expressed a hope to
develop a learning theory for "real case analysis" . According to the Key Theorem,
this type of theory for the ERM principle is impossible.
38 2. Consistency of Learning Processes
We call this sequence of random variables which depend both on the proba-
bility measure F(z) and on the set of functions Q(z,a), a E A, a two-sided
empirical process. The problem is to describe conditions under which this
empirical process converges in probability to zero. The convergence in prob-
ability of the process (2.5) means that the equality
holds true.
Along with the empirical process el, we consider the one-sided empirical
process given by the sequence of random variables
holds true. According to the Key Theorem, the uniform one-sided conver-
gence (2.8) is a necessary and sufficient condition for consistency of the
ERM method.
We will see that conditions for uniform two-sided convergence plays an
important role in the constructing conditions of uniform one-sided conver-
gence.
Key Theorem pointing the way for analysis of the problem of consistency
of the ERM inductive principle.
The necessary and sufficient conditions for both uniform one-sided con-
vergence and uniform two-sided convergence are obtained on the basis of a
concept which is called the entropy of the set of functions Q(z,a), a E A,
on a sample of size i.
For simplicity we will introduce this concept in two steps: first for the
set of indicator functions (that take only the two values 0 and 1) and then
for the set of real bounded functions.
q(a) = (Q(zl,a),oo.,Q(zt,a», a EA
that one obtains when a takes various values from A. Then geometri-
cally speaking NA(Zb' ., Zt) is the number of different vertices of the l-
dimensional cube that can be obtained on the basis of the sample Zl, ... , Zt
and the set of functions Q(z, a) E A (Fig. 2.3).
Let us call the value
the mndom entropy. The random entropy describes the diversity of the set
of functions on the given data. HA(Zb . . , Zt) is a random variable since it
was constructed using the Li.d. data. Now we consider the expectation of
the random entropy over the joint distribution function F(Zb"" Zt):
We call this quantity the entropy of the set of indicator functions Q(z, a),
a E A on samples of size i. It depends on the set of functions Q(z, a),
a E A, the probability measure, and the number of observations l and it
describes the expected diversity of the given set of indicator functions on
the sample of size i.
2.3. Uniform Two-Sided Convergence 41
This set of vectors belongs to the i-dimensional cube (Fig. 2.4)and has a fi-
nite minimal c-net in the metric3 C (or in metric Lp). Let N = NA(c; Zl, ... , ze)
3The set of vectors q(a), a E A has a minimal c-net q(aI), ... , q(aN) if:
(i) There exist N = NA(c;zl, ... ,Zt) vectors q(al), ... ,q(aN), such that for
any vector q(a*), a* E A one can find among these N vectors one q(a r )
which is c-close to q(a*) (in a given metric). For the metric C that means
Q (z£ ,a)
q (a). a f A
Q (z\ , a)
in the following respect: N A (£) is the cardinality ofthe minimal £-net ofthe set of
functions Q(z, a), a E A while the VC entropy is the expectation of the diversity
of the set of functions on the sample of size i.
2.4. Uniform One-Sided Convergence 43
Note that the given definition of the entropy of a set of real functions is
a generalization of the definition of the entropy given for a set of indicator
functions. Indeed, for a set of indicator functions the minimal €-net for
€ < 1 does not depend on € and is a subset of the vertices of the unit cube.
Therefore, for € < 1,
Below we will formulate the theory for the set of bounded real functions.
The obtained general results are of course valid for the set of indicator
functions.
lim HA(€, £)
£-+00 £
= 0, V€ > ° (2.10)
be valid.
· HA(£)
11m
£-+00
n
-
-,
{.
°
which is a particular case of equality (2.10).
This condition for uniform two-sided convergence was obtained in 1968
(Vapnik and Chervonenkis 1968, 1971). The generalization of this result for
bounded sets of functions (Theorem 2.3) was found in 1981 (Vapnik and
Chervonenkis 1981).
44 2. Consistency of Learning Processes
i~~ p {[s~p (R(a) - Remp(a)) > c] or [s~p (Remp(a) - R(a)) > c]}
= o. (2.11)
The condition (2.11) includes uniform one-sided convergence and therefore
forms a sufficient condition for consistency of the ERM method. Note, how-
ever, when solving learning problems we face a nonsymmetrical situation:
we require consistency in minimizing the empirical risk but we do not care
about consistency with respect to maximizing the empirical risk. So for
consistency of the ERM method the second condition on the left-hand side
of Eq. (2.11) can be violated.
The next theorem describes a condition under which there exists consis-
tency in minimizing the empirical risk but not necessarily in maximizing
the empirical risk (Vapnik and Chervonenkis, 1989).
Consider the set of bounded real functions Q(z, a), a E A together with
a new set of functions Q*(z,a*),a* E A* satisfying some conditions of
measurability as well as the following conditions: for any function from
Q(z,a), a E A there exists a function in Q*(z,a*), a* E A* such that
(Fig. 2.5)
Q(z,a) - Q*(z,a*) ~ 0, Vz,
J (Q(z, a) - Q*(z,a*))dF(z):::; 6.
(2.12)
Q (z, ex)
o z
FIGURE 2.5. For any function Q(z, a), a E A, one considers a function
Q*(z,a*), a* E A*, such that Q*(z,a*) does not exceed Q(z,a) and is close
to it.
2.5. Theory of Nonfalsifiability 45
. H A*( c, f)
11m
£--+00
n
(.
<",. (2.13)
then the learning machine with functions Q(z, 0:), 0: E A is faced with
a situation that in the philosophy of science corresponds to a so-called
nonfalsifiable theory.
Before we describe the formal part of the theory, let us remind the reader
what the idea of nonfalsifiability is.
(iv) Finally, in both theories, inference has the same level of formaliza-
tion. It contains two parts: the formal description of reality and the
informal interpretation of it.
Let us come back to our example. Both meteorology and astrology make
weather forecasts. Consider the following assertion:
Once in New Jersey, in July there was a tropical rain storm and then
snowfall.
5Recall Laplace's calculations of conditional probability that the sun will rise
tomorrow given that it rose every day up to this day. It will rise for sure according
to the models that we use and in which we believe. However with probability one
we can assert only that the sun rose every day up to now during the thousands
of years of recorded history.
48 2. Consistency of Learning Processes
Suppose now that for the VC entropy of the set of indicator functions
Q(z, a), a E A the following equality is true:
lim H~(i) = In 2.
l-+oo (.
It can be shown that the ratio of the entropy to the number of obser-
vations HA(i)ji monotonically decreases as the number of observations i
increases. 6 Therefore, if the limit of the ratio of the entropy to the number
of observations tends to In 2, then for any finite number i the equality
HA(i) = In2
i
holds true.
This means that for almost all samples Zl,"" Zl (Le. all but a set of
measure zero) the equality
is valid.
In other words, thE set of functions of the learning machine is such that
almost any sample Zl, .•. ,Zl (of arbitrary size i) can be separated in all
possible ways by functions of this set. This implies that the minimum of the
empirical risk for this machine equals zero. We call this learning machine
nonfalsifiable because it can give a general explanation (function) for almost
any data (Fig. 2.6).
Note that the minimum value of the empirical risk is equal to zero inde-
pendent of the value of the expected risk.
1 ----- - ------~---
I
: ~Q(z,a*)
o Z3 Zl _2 Zl _ 1 Zl Z
Theorem 2.5. For the set of indicator functions Q(z, 0:), a E A let the
convergence
lim HA(l) = c > 0
l-HXl l
be valid.
Then there exists a subset Z* of the set Z for which the probability mea-
sure is
P(Z*) = a(c) f:. 0
such that for the intersection of almost any training set
Thus, if the conditions for uniform two-sided convergence fail, then there
exists some subspace of the input space where the learning machine is
nonfalsifiable (Fig. 2.7).
50 2. Consistency of Learning Processes
/ Q(z, cr*)
~----,-- ---- ~
I
I
I r I : 1
o ZI Z2 ttiiWH~W£B¥2hl~.Kt*i[t z
z*
such that
(i) There exists a positive constant c for which the equality
and any c > 0, one can find a function Q(z, a*) in the set of functions
Q(z, a), a E A for which the inequalities
hold true.
(i) In the simplest example considered in Section 2.6.1, for the set of indi-
cator functions Q(z, a), a E A, we use this concept of nonfalsifiability
where tPl(Z) = 1 and tPo(z) = 0,
(ii) in Theorem 2.5 we can use the functions
I if
tPl(Z) = { Q(z) if
z E Z*
z f/. Z* , tPo(z) = { °Q(z) if
if
z E Z*
z f/. Z* ,
where Q(z) is some indicator function.
be valid.
52 2. Consistency of Learning Processes
I ljIo (Z)
Then the learning machine with this set of functions is potentially non-
falsifiable.
Thus, if the conditions of Theorem 2.4 fail (in this case, of course, the
conditions of Theorem 2.3 will also fail), then the learning machine is non-
falsifiable. This is the main reason why the ERM principle may be incon-
sistent.
Below we again consider the set of indicator functions Q(z, a), a E A (Le.,
we consider the problem of pattern recognition). As mentioned above, in
2.7. Three Milestones in Learning Theory 53
the case of indicator functions Q(z, a), a E A, the minimal €-net of the
vectors q(a), a E A (see Section 2.3.3) does not depend on € if € < 1. The
number of elements in the minimal €-net
These concepts are defined in such a way that for any £ the inequalities
are valid.
On the basis of these functions the main milestones of learning theory
are constructed.
In Section 2.3.4 we introduced the equation
lim HA(£) = 0
l-+oo £
describing a sufficient condition for consistency of the ERM principle (the
necessary and sufficient conditions are given by a slightly different con-
struction (2.13)). This equation is the first milestone in learning theory: we
require that machine minimizing empirical risk should satisfy it.
However this equation says nothing about the rate of convergence of the
obtained risks R(O:l) to the minimal one R(o:o). It is possible to construct
examples where the ERM principle is consistent, but where the risks have
an arbitrarily slow asymptotic rate of convergence.
The question is:
Under what conditions is the asymptotic rote of convergence fast?
54 2. Consistency of Learning Processes
We say that the asymptotic rate of convergence is fast if for any £ > £0,
the exponential bound
lim H~n(£) = 0
£-+00 £
is a sufficient condition for a fast rate of convergence. 8 This equation is
the second milestone of the learning theory: it guarantees a fast asymptotic
rate of convergence.
Thus far, we have considered two equations: one that describes a neces-
sary and sufficient condition for the consistency of the ERM method, and
one that describes a sufficient condition for a fast rate of convergence of
the ERM method. Both equations are valid for a given probability measure
F(z) on the observations (both the VC entropy HA(£) and the VC An-
nealed entropy H ~n (£) are constructed using this measure). However, our
goal is to construct a learning machine capable of solving many different
problems (for many different probability measures).
The question is:
Under what conditions is the ERM principle consistent and simultane-
ously provides a fast rote of convergence, independent of the probability
measure?
The following equation describes the necessary and sufficient conditions
for consistency of ERM for any probability measure:
· GA(£) - 0
11m n -.
£-+00 {.
It is also the case that if this condition holds true then the rate of conver-
gence is fast.
This equation is the third milestone in the learning theory. It describes
a necessary and sufficient condition under which a learning machine that
implements the ERM principle has a high asymptotic rate of convergence
independent of the probability measure (Le., independent of the problem
that has to be solved).
These milestones form the foundation for constructing both distribution
independent bounds for the rate of convergence of learning machines, and
rigorous distribution dependent bounds which we will consider in Chapter
3.
The weak mode estimation of probability measures forms one of the most
important problems in the foundations of statistics, the so-called General
Glivenko-Cantelli problem. The results described in Chapter 2 provide a
complete solution to this problem.
(i) Z E :F;
(ii) if A E :F then A E :F;
(iii) if Ai E :F then U:l Ai E :F.
Example. Let us describe a model of the random experiments that
are relevant to the following situation: somebody throws two dice, say
red and black, and observes the result of the experiment. The space of
elementary events Z of this experiment can be described by pairs of
9 About u-algebra one can read in any advanced textbook on probability the-
ory. (See, for example, A. N. Schiryaev, Probability, Springer, New York, p. 577.)
This concept makes it possible to use the formal tools developed in measure
theory for constructing the foundations of probability theory.
2.8. The Basic Problems of Probability Theory and Statistics 57
red
Ar > b
5
AIO
4
• • • • • •
2 3 4 5 6 black
FIGURE 2.9. The space of elementary events for a two-dice throw. The events
AlO and Ar>b are indicated.
integers where the first number describes the points on the red dice
and the second number describes the points on the black one. An
event in this experiment can be any subset of this set of elementary
events. For example, it can be the subset AlO of elementary events for
which the sum of points on the two dice is equal to 10, or it can be
the subset of elementary events Ar>b where the red dice has a larger
number of points than the black one etc. (Fig. 2.9 ).
The pair (Z, F) consisting of the set Z and the a-algebra F of events
A E F is an idealization of the qualitative aspect of random experiments.
The quantitative aspect of experiments is determined by a probability
measure P(A) defined on the elements A of the set F. The function P(A)
defined on the elements A E F is called a countably additive probability
measure on F or, for simplicity, a probability measure, provided
(i) P(A) ~ OJ
(ii) P(Z) = Ij
is valid.
Let (2.14) be the result of i independent trials with the model (Z,F,P).
Consider the random variable V(Zb ... , Zl; A) defined for a fixed event A E
F by the value
nA
vl(A) = V(Zb ... ,Zl; A) = £'
where nA is the number of elements of the set Zl, ... , Zl belonging to event
A. The random variable vl(A) is called the frequency of occurrence of an
event A in a series of i independent, random trials.
In terms of these concepts we can formulate the basic problems of prob-
ability theory and the theory of statistics.
The basic problem of probability theory
Given a model (Z, F, P) and an event A *, estimate the distribution
(or some of its characteristics) of the frequency of occurrence of the
event A * in a series of i independent random trials. Formally, this
amounts to finding the distribution function
lOThe concept of independent trials actually is the one which makes probability
theory different from measure theory. Without the concept of independent trials
the axioms of probability theory define a model from the theory of measure.
2.9. Two Modes of Estimating a Probability Measure 59
Example. In our example with two dice it can be the following prob-
lem. What is the probability that the frequency of event AlO (sum of
points equals 10) be less than ~ if one throws the dice f times?
(ii) We say that the estimator ee(A) estimates the probability measure
P in the weak mode determined by some subset F* cF if
p
sup IP(A) - el(A)1 --t 0, (2.17)
AEF* e--+oo
Q
Q(z. <xl
P (Q(z, <xl > :~)
o z
FIGURE 2.10. The Lebesgue integral defined in (2.18) is the limit of a sum of
products, where factor P {Q(z, a) > iB/m}, is the (probability) measure of the
set {z: Q(z, a) > iB/m} and the factor B/m is the height of a slice.
where the subset F* (of the set F) does not necessarily form a a-
algebra.
For our reasoning it is important that if one can estimate the probability
measure in one of these modes (with respect to a special set :F* described
below for the weak mode) then one can minimize the risk-functional in a
given set of functions.
Indeed, consider the case of bounded risk-functions 0 ~ Q(z, a) ~ B. Let
us rewrite the risk-functional in an equivalent form, using the definition of
the Lebesgue integral (Fig. 2.10):
If the estimator [leA) approximates peA) well in the strong mode, i.e,
approximates uniformly well the probability of any event A (including the
events A~ ,i = {Q( z, a) > iB / m }) then the functional
R*(a) = m--+oo
lim "m ~[l B {Q(z,a) > _z'B}
L....IiIrn m
(2.19)
i=l
2.10. Strong Mode Estimation of Probability Measures 61
is sufficient.
Thus, in order to find the function that minimizes risk (Eq. (2.18)) with
unknown probability measure P{A} one can minimize the functional (2.19)
where instead of P{A} an approximation t'£{A} that converges to P{A} in
any mode (with respect to events A~ i' 0: E A, i = 1, ... , m for weak mode)
is used. '
is valid, i.e., the strong mode distance between the approximation of the
probability measure and the actual measure is bounded by the Ll distance
between the approximation of the density and the actual density.
Thus, to estimate the probability measure in the strong mode, it is suffi-
cient to estimate a density function. In Section 1.8 we stressed that estimat-
ing a density function from the data forms an ill-posed problem. Therefore,
generally speaking, one can not guarantee a good approximation using a
fixed number of observations.
Fortunately, as we saw above, to estimate the function that minimizes the
risk-functional one does not necessarily need to approximate the density.
It is sufficient to approximate the probability measure in the weak mode
where the set of events F* depends on the admissible set of functions
Q(z, a), a E A: it must contain the events
In the 1930's Glivenko and Cantelli proved a theorem that can be con-
sidered as the most important result in the foundation of statistics. They
proved that any probability distribution function of one random variable e
F(z) = p{e < z}
can be approximated arbitrarily well by the empirical distribution function
1
Fl(Z) =i L B(z - Zi),
l
i=l
where Zl, ... ,Zl is LLd. data obtained according to the unknown density 12
(Fig. 1.2). More precisely the Glivenko-Cantelli theorem asserts that for
any c > 0 the equality
lim P{sup IF(z) - Fl(Z)1 > c} = 0
l-+oo z
(the set of rays on the line pointing to -00). For any event Az of this set
of events one can evaluate its probability
(2.21)
Using an i.i.d. sample of size lone can also estimate the frequency of
occurrence of the event Az in independent trials:
(2.22)
The bounds are nontrivial (Le., for any g > 0 the right-hand side tends
to zero when the number of observations l goes to infinity) if
lim H&tn(l) = O.
/.-+00 l
(Recall that in Section 2.7 we called this condition the second milestone of
learning theory.)
To discuss the difference between these two bounds let us recall that for
any indicator function Q(z, a) the risk functional
a = JR(a)(l- R(a))
and therefore the maximum of the variance is achieved for the events with
probability R(a*) ~ 1/2. In other words the largest deviations are associ-
ated with functions that possess large risk.
In Section 3.3, using the bound on the rate of convergence, we will obtain
a bound on the risk where the confidence interval is determined by the rate
of uniform convergence, i.e., by the function with risk R(a*) ~ 1/2 (the
"worst" function in the set).
To obtain a smaller confidence interval one can try to construct the
bound on the risk using a bound for another type of uniform convergence,
namely, the uniform relative convergence
Therefore, for any distribution function F(z), the following inequalities hold
true:
(3.5)
(Recall that in Section 2.7 we called this equation the third milestone in
learning theory).
It is important to note that conditions (3.5) are necessary and sufficient
for distribution free uniform convergence (3.3). In particular,
if condition (3.5) is violated then there exist probability measures F(z)
on Z for which uniform convergence
There are several ways to generalize the results obtained for the set of
indicator functions to the set of real functions. Below we consider the sim-
plest and most effective (it gives better bounds and is valid for the set of
unbounded real functions) (Vapnik 1979, 1996).
Let Q(z, 0:), 0: E A now be a set of real functions, with
A = infQ(z,
o:,z
o:) ~ Q(z, 0:) ~ supQ(z,o:) = B
o:,z
3.2. Generalization for the Set of Real Functions 69
I (Q (z, a) -~)
Q(z,a);aeA
o z
FIGURE 3.1. The indicator of level f3 for the function Q(z, a) shows for which z
the function Q(z, a) exceeds f3 and for which it does not. The function Q(z, a)
can be described by the set of all its indicators.
(here A can be -00 and/or B can be +00). We denote the open interval
(A, B) by B. Let us construct a set of indicators (Fig. 3.1)of the set ofreal
functions Q(z, 0:), 0: E A:
For a given function Q(z,o:*) and for a given 13* the indicator I(z, 0:*,13*)
indicates by 1 the region z E Z where Q(z, 0:*) 2: 13* and indicates by 0 the
region z E Z where Q(z,o:*) < 13*.
In the case where Q(z,o:), 0: E A are indicator functions, the set of
indicators fez, a, (3), a E A, (3 E (0,1) coincides with this set Q(z, a}, a E
A.
For any given set of real functions Q(z, 0:), 0: E A we will extend the
results of the previous section by considering the corresponding set of in-
dicators I(z, 0:, 13), 0: E A, 13 E B.
Let HA,B(i) be the VC entropy for the set of indicators, H~n~(i) be the
Annealed entropy for the set, and GA,B(i) be the Growth function.
Using these concepts we obtain the basic inequalities for the set of real
functions as generalizations of inequalities (3.1) and (3.2). In our general-
ization we distinguish between three cases:
~ 4exp { ( H:in~(2f)
f -
€2)}f
(B _ A)2 . (3.6)
p { sup
J Q(z, a)dF(z) - i L:~=1 Q(Zi' a) >€}
nEA JJQ(z,a)dF(z)
where
1 We consider p > 2 only to simplify the formulas. Analogous results hold true
for p > 1 (Vapnik, 1979, 1996).
3.3. The Main Distribution Independent Bounds 71
lim H~(£) = O.
£-+00 £
The bounds (3.6), (3.7), and (3.8) were distribution dependent: the right-
hand sides of the bounds use the Annealed entropy H~(£) that is con-
structed on the basis of the distribution function F(z). To obtain distribution-
independent bounds one replaces the Annealed entropy H~ (£) on the
right-hand sides of bounds (3.6), (3.7), (3.8) with the Growth function
GA,B(£). Since for any distribution function the Growth function GA,B(£)
is not smaller than the Annealed entropy H~(£), the new bound will be
truly independent of the distribution function F(x).
Therefore, one can obtain the following distribution-independent bounds
on the rate of various types of uniform convergence:
(i) For the set of totally bounded functions -00 < A :::; Q(z, a) :::; B < 00
GA,B(2£) c2 )}
:::; 4exp { ( £ - (B _ A)2 i . (3.10)
(ii) For the set of non-negative totally bounded functions 0 :::; Q(z, a) :::;
B<oo:
(3.11)
(iii) For the set of non-negative real functions 0 :::; Q(z, a) whose pth
normalized moment exists for some p > 2:
(3.12)
. GA,B(£)
hm
e---+OC!
£ = o. (3.13)
(A) What actual risk R(at) is provided by the function Q(z, ae) that
achieves minimal empirical risk Remp(ae)?
(B) How close is this risk to the minimal possible infa R(a), a E A, for
the given set of functions?
(A) The following inequalities hold with probability of at least 1-1] simul-
taneously for all functions of Q(z, a), a E A (including the function
that minimizes the empirical risk):
(B - A) J;;
Remp(a) - 2 v£:::; R(a).
(B) The following inequality holds with probability of at least 1 - 2", for
the function Q(z, at) that minimizes the empirical risk:
(A) The following inequality holds with probability of at least 1 - ", si-
multaneously for all functions Q(z, a) :::; B, a E A (including the
function that minimizes the empirical risk)
B£ ( 1 + 4Remp(a)
R(a) :::; Remp(a) +2 1+ B£ (3.17)
(B) The following inequality holds with probability of at least 1 - 2", for
the function Q(z, at) that minimizes the empirical risk
(3.18)
sup
UQP(z, a)dF(z)) lip
< T < 00 (3.19)
aEA J Q(z, a)dF(z) -
74 3. Bounds on the Rate of Convergence of Learning Processes
a(p)=
~ (p _l)P-l
2 p-2
holds true simultaneously for all functions satisfying (3.19), where
(u) + = max( u, 0). (This bound is a corollary of the bound on the
rate of uniform convergence (3.12) and constraint (3.19).)
holds for the function Q(z, al) that minimizes the empirical risk.
The inequalities (3.15), (3.17), and (3.20) bound the risks for all functions in
the set Q(z, a), a E A , including the function Q(z, al) that minimizes the
empirical risk. The inequalities (3.16), (3.18), and (3.21) evaluate how close
the risk obtained by using the ERM principle is to the smallest possible
risk.
Note that if £ < 1 then bound (3.17) obtained from the rate of uniform
relative deviation is much better than bound (3.15) obtained from the rate
of uniform convergence: for a small value of empirical risk the bound (3.17)
has a confidence interval whose order of magnitude is £, but not vie, as in
bound (3.15).
CA(h) = hln2,
are valid, the finiteness of the VC dimension of the set of indicator functions
implemented by a learning machine is a sufficient condition for consistency
of the ERM method independent of the probability measure. Moreover, a
finite VC dimension implies a fast rate of convergence.
76 3. Bounds on the Rate of Convergence of Learning Processes
- - - - - h (In (£jh) + 1)
o h £
3 Any indicator function separates a given set of vectors into two subsets: the
subset of vectors for which this indicator function takes the value 0 and subset
of vectors for which this indicator function takes the value 1.
3.6. The VC Dimension of a Set of Functions 77
0 ifz<O
O(z) = { 1 if z 2:: O.
z2
Z2
,,
z2 ,
e
,'" '"
'",
'",, '", '" ,
'" '"
'" '" ,, e Z3
,-t , ,
eZ I
e z3
eZ I
, ,
0 Z1 0 Z1
FIGURE 3.3. The VC dimension of the lines in the plane is equal to three since
they can shatter three vectors, but not four: The vectors Z2, Z4 cannot be sepa-
rated by a line from the vectors Zl, Z3.
78 3. Bounds on the Rate of Convergence of Learning Processes
a = 7r (t,(1 - 8i )lOi + 1)
This example reflects the fact that choosing an appropriate coefficient
a one can for any number of appropriate chosen points approximate
values of any function bounded by (-1, +1) (Fig. 3.4 ) using sin ax.
In Chapter 5 we will consider a set of functions for which the VC dimension
is much less than the number of parameters.
Thus, generally speaking the VC dimension of a set of functions does
not coincide with the number of parameters. It can be both larger than
the number of parameters (as in Example 2) and smaller than the number
of parameters (we will use sets of functions of this type in Chapter 5 for
constructing a new type of learning machine).
In the next section we will see that the VC dimension of the set of func-
tions (rather than number of parameters) is responsible for generalization
ability of learning machines. This opens remarkable opportunities to over-
come the "curse of dimensionality": to generalize well on the basis of a set
of functions containing a huge number of parameters but possesing a small
VC dimension.
3.7. Constructive Distribution-Independent Bounds 79
'- )~
~
r--
~ r-.
t--
---
~ J :""'1
z
I
\
~..., y = sin az
1--''
V V V V V 1 V V v v v \J v V v
FIGURE 3.4. Using a high frequency function sin(az), one can approximate the
value of any function -1 ~ j(z) ~ 1 in any f. points well.
In this section we will present the bounds on the risk functional that in
Chapter 4 we use for constructing the methods for controlling the general-
ization ability of learning machines.
Consider sets of functions which possess a finite VC dimension h. In this
case Theorem 3.3 states that the bound
(3.23)
£ =4
h (In ¥ + 1)£ -In(11/4) . (3.24)
We also will consider the case where the set of loss-functions Q(z, a), a E
A contains a finite number N of elements. For this case one can use the
expression
(3.25)
Thus, the following constructive bounds hold true, where in the case of
the finite VC dimension one uses the expression for £ given in Eq. (3.24)
80 3. Bounds on the Rate of Convergence of Learning Processes
and in the case of a finite number of functions in the set one uses the
expression for E given in Eq. (3.25).
Case 1. The set of totally bounded functions.
Let A ::; Q( z, a) ::; B, a E A be a set of totally bounded functions. Then
(A) The following inequalities hold with probability of at least 1 - '" si-
multaneously for all functions Q(z, a), a E A (including the function
that minimizes the empirical risk):
(B - A) r;;
R(a) ~ Remp(a) - 2 yE.
(B) The following inequality holds with probability of at least 1 - 2", for
the function Q(z, at) that minimizes the empirical risk:
(3.27)
(A) The following inequality holds with probability of at least 1- '" simul-
taneously for all functions Q(z, a) ::; B, a E A (including the function
that minimizes the empirical risk):
BE ( 1 +
R(a) ::; Remp(a) + 2 V+ 1 4Remp
BE (a)) . (3.28)
(B) The following inequality holds with probability of at least 1 - 2", for
the function Q(z, at) that minimizes the empirical risk:
(3.29)
(3.30)
a(p)=
! (p _l)P-l
2 p-2
holds true simultaneously for all functions satisfying Eq. (3.19), where
(u)+ = max(u,O).
(3.31)
holds for the function Q(z, al) that minimizes the empirical risk.
To construct rigorous bounds on the risk one has to take into account infor-
mation about the probability measure. Let 'Po be the set of all probability
measures on Zl and let 'P c 'Po be a subset of the set 'Po. We say that one
has a priori information about the unknown probability measure F(z) if
one knows a set of measures 'P that contains F(z).
Consider the following generalization of the Growth function:
For the extreme case where 'P = 'Po, the Generalized Growth function
g~(£) coincides with the Growth function GA(£) because the measure that
assigns probability one on Zl, ... , Zl is contained in 'P. For another extreme
case where 'P contains only one function F(z), the Generalized Growth
function coincides with the Annealed VC entropy.
4There exist lower bounds on the rate of uniform convergence where the order
of magnitude is close to the order of magnitude obtained for the upper bounds
(Jhii in the lower bounds instead of J(h/l) In(l/h) in the upper bounds; see
(Vapnik and Chervonenkis, 1974) for lower bounds).
82 3. Bounds on the Rate of Convergence of Learning Processes
The rigorous bounds for the risk can be derived in terms of the Gener-
alized Growth function.
They have the same functional form as the distribution independent
bounds (3.15), (3.17), and (3.21) but a different expression for e. The new
expression for e is
e = 4 g~(2i) -In 1}/4.
f
However these bounds are nonconstructive because no general methods
have yet been found to evaluate the Generalized Growth function (in con-
trast to the original Growth function, where constructive bounds were ob-
tained on the basis of the VC dimension of the set of functions).
To find rigorous constructive bounds one has to find a way of evaluating
the Generalized Growth function for different sets P of probability mea-
sures. The main problem here is to find a subset P different from Po for
which the Generalized Growth function can be evaluated on the basis of
some constructive concepts (much as the Growth function was evaluated
using the VC dimension of the set of functions).
Informal Reasoning and Comments
3
A particular case of the bounds obtained in this chapter were already un-
der investigation in classical statistics. They are known as Kolmogorov-
Smirnov distributions, widely used in both applied and theoretical statis-
tics.
The bounds obtained in learning theory are different from the classical
ones in two respects:
(i) They are more general (they are valid for any set of indicator-functions
with finite VC dimension).
(ii) They are valid for a finite number of observations (the classical bounds
are asymptotic.)
This equality describes one of the main statistical laws according to which
the distribution of the random variable
does not depend on the distribution function F(z) and has the form of Eq.
(3.32).
Simultaneously Smirnov found the distribution function for one-sided de-
viations of the empirical distribution function from the actual one (Smirnov,
1933). He proved that for continuous F(z) and sufficiently large £ the fol-
lowing equalities asymptotically hold:
6 = VlIF(x) - Ft(x)l,
~2 = Vl(F(x) - Fe(x))
are called the Kolmogorov-Smirnov statistics.
When the Glivenko-Cantelli theorem was generalized for multidimen-
sional distribution functions, 5 it was proved that for any c > 0 there exists
a sufficiently large £0 such that for £ > £0 the inequality
Note that the results obtained in learning theory have the form of inequali-
ties, rather than equalities as obtained for a particular case by Kolmogorov
and Smirnov. For this particular case it is possible to evaluate how close
to the exact values the obtained general bounds are.
Let Q(z, a), a E A be the set of indicator functions with VC dimension
h. Let us rewrite the bound (3.3) in the form
where the coefficient a equals one. In the Glivenko-Cantelli case (for which
the Kolmogorov-Smirnov bounds are valid) we actually consider a set of
indicator functions Q(z, a) = (J(z - a). (For these indicator functions
F(a) = J
(J(z - a)dF(z),
where Zl, ... , Zl is LLd. data.) Note that for this set of indicator functions
the VC dimension is equal to one: using indicators of rays (with one di-
rection) one can shatter only one point. Therefore, for a sufficiently large
f, the second term in the parentheses of the exponent on the right-hand
side of Eq. (3.33) is arbitrarily small and the bound is determined by the
first term in the exponent. This term in the general formula coincides with
the (main) term in the Kolmogorov-Smirnov formulas up to a constant:
instead of a = 1 Kolmogorov-Smirnov bounds have constant 6 a = 2.
In 1988 Devroye found a way to obtain a nonasymptotic bound with the
constant a = 2 (Devroye, 1988). However, in the exponent of the right-hand
side of this bound the second term is
6In the first result obtained in 1968 the constant was a = 1/8 (Vapnik and
Chervonenkis, 1968, 1971), then in 1979 it was improved to a = 1/4 (Vapnik,
1979), in 1991 L. Bottou showed me a proof with a = 1. This bound also was
obtained by J. M. Parrondo and C. Van den Broeck (Parrano and Van den Broeck,
1993).
86 3. Informal Reasoning and Comments - 3
instead of
h(ln2ljh + 1)
(3.34)
i
For the case that is important in practice, namely, where
the bound with coefficient a = 1 and term (3.34) described in this chapter
is better.
The bounds obtained for the set of real functions are generalizations of the
bounds obtained for the set of indicator functions. These generalizations
were obtained on the basis of a generalized concept of VC dimension that
was constructed for the set of real functions.
There exist, however, several ways to construct a generalization of the
VC dimension concept for sets of real functions that allow us to derive the
corresponding bounds.
One of these generalizations is based on the concept of a VC subgraph
introduced by Dudley (Dudley, 1978) (in the AI literature, this concept
was renamed pseudo-dimension). Using the VC subgraph concept Dudley
obtained a bound on the metric c-entropy for the set of bounded real func-
tions. On the basis of this bound, Pollard derived a bound for the rate
of uniform convergence of the means to their expectation (Pollard, 1984).
This bound was used by Haussler for learning machines. 7
Note that the VC dimension concept for the set of real functions de-
scribed in this chapter forms a slightly stronger requirement on the capac-
ity of the set of functions than Dudley's VC subgraph. On the other hand
using VC dimension concept one obtains more attractive bounds:
(i) They have a form that has a clear physical sense (they depend on the
ratio ijh).
(ii) More importantly, using this concept one can obtain bounds on uni-
form relative convergence for sets of bounded functions as well as for
sets of unbounded functions. The rate of uniform convergence (or uni-
form relative convergence) of the empirical risks to actual risks for
the unbounded set of loss-functions is the basis for an analysis of the
regression problem.
7D. Haussler (1992), "Decision theoretic generalization of the PAC model for
neural net and other applications", Inform. Compo 100 (1) pp. 78-150.
3.10. Racing for the Constant 87
(4.1)
and the bounds for the generalization ability of learning machines with sets
of unbounded functions
" = 4--'----""----i:'---~--'-
"
h (In ~ + 1) -In(17/4)
4.1
STRUCTURAL RISK MINIMIZATION (SRM)
INDUCTIVE PRINCIPLE
The ERM principle is intended for dealing with large sample sizes. It can
be justified by considering the inequalities (4.1) or (4.2).
e
When i/h is large, is small. Therefore, the second summand on the
right-hand side of inequality (4.1) (the second summand in denominator
(4.2)) becomes small. The actual risk is then close to the value of the
empirical risk. In this case, a small value of the empirical risk guarantees
a small value of the (expected) risk.
However, if i/h is small, a small Remp(al) does not guarantee a small
value of the actual risk. In this case, to minimize the actual risk R( a) one
has to minimize the right-hand side of inequality (4.1) (or (4.2)) simultane-
ously over both terms. Note, however, that the first term in inequality (4.1)
depends on a specific function of the set of functions, while the second term
depends on the VC dimension of the whole set of functions. To minimize
the right-hand side of the bound of risk, (4.1) (or (4.2)), simultaneously
over both terms, one has to make the VC dimension a controlling variable.
The following general principle, which is called the Structuml Risk Min-
imization (SRM) inductive principle, is intended to minimize the risk-
functional with respect to both terms, the empirical risk, and the confidence
interval (Vapnik and Chervonenkis, 1974).
Let the set S of functions Q(z, a), a E A, be provided with a structure
consisting of nested subsets of functions Sk = {Q(z, a), a E Ad, such that
(Fig. 4.1)
(4.3)
where the elements of the structure satisfy the following two properties:
,,
:, ©
G)
- .~
..... ------- ...
82
.. ,
,,:
, ,
. . ....... _---_ ......... ~ ~ '
1
(J QP(z, a)dF(z)) P
sup
aEA k
J Q(z, a)dF(z) p>2 (4.4)
I Confidence interval
Empirical risk
FIGURE 4.2. The bound on the risk is the sum of the empirical risk and of the
confidence interval. The empirical risk is decreased with the index of element of
the structure, while the confidence interval is increased. The smallest bound of
the risk is achieved on some appropriate element of the structure.
USk·
00
S* =
k=l
4.2. Asymptotic Analysis of the Rate of Convergence 93
For asymptotic analysis of the SRM principle one considers a law deter-
mining, for any given £, the number
n = n(£) (4.5)
R(o:o) = aEA
inf jQ(z,o:)dF(Z)
. T;,(£)hn(£) lni
hm
£-+00
i = 0, (4.7)
where
p(Q(z,o:*),R(z,{3*)) ~ c
holds true.
3We say that the random variables ~l, l = 1,2, ... converge to the value ~o
with asymptotic rate V (l) if there exists a constant C such that
94 4. Controlling the Generalization Ability of Learning Processes
To provide the best rate of convergence one has to know the rote of
approximation r n for the chosen structure. The problem of estimating r n
for different structures on sets of functions is the subject of classical function
approximation theory. We will discuss this problem in next section. If one
knows the rate of approximation r n one can a priori find the law n = n(i)
which provides the best asymptotic rate of convergence by minimizing the
right-hand side of equality (4.6).
Example. Let Q(z, a), a E A be a set of functions satisfying the in-
equality (4.4) for p > 2 with Tk < T* < 00. Consider a structure for which
n = hn . Let the asymptotic rate of approximation be described by the law
V(l) (Inl)
= T
2C+1
(4.9)
4Note, however, that a high asymptotic rate of convergence does not neces-
sarily reflect a high rate of convergence on a limited sample size.
4.3. The Problem of Function Approximation in Learning Theory 95
the admissible structure (on the sequence of pairs (h n Tn), n = 1,2, ... )
and also depends on the rate of approximation Tn, n = 1,2, ....
On the basis on this information one can evaluate the rate of convergence
by minimizing Eq. (4.6). Note that in equation (4.6), the second term,
which is responsible for the stochastic behavior of the learning processes, is
determined by nonasymptotic bounds of the risk (see Eqs. (4.1) and (4.2)).
The first term (which describes the deterministic component of the learning
processes) usually only has an asymptotic bound, however.
Classical approximation theory studies connections between the smooth-
ness properties of functions and the rate of approximation of the function
by the structure with elements Sn containing polynomials (algebraic or
trigonometric) of degree n, or expansions in other series with n terms. Usu-
ally, smoothness of an unknown function is characterized by the number s
of existing derivatives. Typical results of the asymptotic rate of approxi-
mation have the form
(4.10)
where N is the dimensionality of the input space (Lorentz, 1966). Note that
this implies that a high asymptotic rate of convergence5 in high-dimensional
spaces can be guaranteed only for very smooth functions.
In learning theory, we would like to find the rate of approximation in the
following case:
(ii) The elements Sk of the structure are not necessarily linear manifolds.
(They can be any set of functions with finite VC dimension.)
d~ O. (4.11)
In terms of this concept the following theorem for the rate of approximation
r n holds true:
Theorem 4.2. (Jones, Barron, and Breiman) Let the set of functions
f(x) satisfy Eq. (4.11). Then the rate of approximation of the desired func-
tions by the best function of the elements of the structure is bounded by
.in)
o ( if one of the following holds:
(i) The set of functions {I(x)} is determined by Eq. (4.11) with d =0
and the elements Sn of the structure contain the functions
n
f(x, (x, w, v) = I: (Xi sin [(x· Wi) + Vi] , (4.12)
i=l
where (Xi and Vi are arbitrary values and Wi are arbitrary vectors
(Jones, 1992).
(ii) The set of functions {I(x)} is determined by equation (4.11) with
d = 1 and the elements Sn of the structure contain the functions
n
f(x, (x, w, v) = L (XiS [(x· Wi) + Vi], (4.13)
i=l
where (Xi and Vi are arbitrary values, Wi are arbitrary vectors, and
S (u) is a sigmoid function (a monotonically increasing function such
that limu -+_ oo S(u) = -1, lim u -+ oo S(u) = 1)
(Barron, 1993).
(iii) The set of functions {I(x)} is determined by Eq. (4.11) with d = 2
and the elements Sn of the structure contain the functions
n
f(x, (X, w, v) = I: (Xi I(x· Wi) + vil+, lul+ = max:(O, u), (4.14)
i=l
where (Xi and Vi are arbitrary values and Wi are arbitrary vectors
(Breiman, 1993).
In spite of the fact that in this theorem the concept of smoothness is dif-
ferent from the number of bounded derivatives, one can observe a similar
phenomenon here as in the classical case: to keep a high rate of convergence
for a space with increasing dimensionality, one has to increase the smooth-
ness of the functions simultaneously as the dimensionality of the space is
4.4. Examples of Structures for Neural Nets 97
1 l
E(w,'Yp) = f LL(Yi,/(Xi'W» +'YpllwI1 2
i=1
with appropriately chosen Lagrange multipliers 'Y1 > 'Y2 > ... > 'Yn. The
well-known ''weight decay" procedure refers to the minimization of this
functional.
3. A structure given by preprocessing.
Consider a neural net with fixed architecture. The input representation is
modified by a transformation z = K(x, (3), where the parameter {3 controls
the degree of degeneracy introduced by this transformation ({3 could for
instance be the width of a smoothing kernel).
A structure is introduced in the set of functions S = {J(K(x, (3), w), w E
W} through {3 ~ Cp , and C1 > C2 > ... > Cn.
To implement the SRM principle using these structures, one has to know
(estimate) the VC dimension of any element Sk of the structure, and has
to be able for any Sk to find the function which minimizes the empirical
risk.
----------r-........,.....-...,
o x 0 x
(a) (b)
function K(x, Xoj (3) which embodies the concept of neighborhood. This
function depends on the point Xo and a "locality" parameter (3 E (0, 00)
and satisfies two conditions:
K 1 (x x . (3)
, 0,
= {I if Ilx - xoll <
0 otherwise
~ (4.16)
(4.17)
For the set of functions !(x, a), a E A, let us consider the set of loss-
functions Q(z,a) = L(y,!(x,a», a E A. Our goal is to minimize the local
risk functional
(4.19)
over both the set of functions !(x, a), a E A and different vicinities of
the point Xo (defined by parameter (3) in situations when the probability
100 4. Controlling the Generalization Ability of Learning Processes
°
simultaneously for all bounded functions A ~ L(y,J(x, a) ~ B, a E A and
all functions ~ K(x, xo, (3) ~ 1, f3 E (0,00) the inequality
Along with the SRM inductive principle, which is based on the statistical
analysis of the rate of convergence of empirical processes, there exists an-
other principle of inductive inference for small sample sizes, the so-called
Minimum Description Length (MDL) principle, that is based on an infor-
mation theoretic analysis of the randomness concept. In this section we
consider the MDL principle and point out the connections between the
SRM and the MDL principles for the pattern recognition problem.
In 1965 Kolmogorov defined a random string using the concept of algo-
rithmic complexity.
4.6. The Minimum Description Length (MDL) and SRM Principles 101
•
FIGURE 4.5. Using linear functions one can estimate the unknown smooth func-
tion in the vicinity of any point of interest.
(4.20)
wr,···,w; (4.22)
for which the Hamming distance between string (4.20) and string (4.22) is
minimal (Le., the number of errors in decoding string (4.20) by this table
T is minimal).
Suppose we found a perfect table To for which the Hamming distance
between the generated string (4.22) and string (4.20) is zero. This table
decodes the string (4.20).
Since the code-book Cb is fixed, to describe the string (4.20) it is sufficient
to give the number 0 of table To in the code-book. The minimal number of
7Formally speaking, to get tables of finite length in the code-book, the input
vector x has to be discrete. However, as we will see, the number of levels in quan-
tization will not affect the bounds on generalization ability. Therefore one can
consider any degree of quantization, even giving tables with an infinite number
of entries.
4.6. The Minimum Description Length (MDL) and SRM Principles 103
bits to describe the number of anyone of the N tables is Pg2 Nl, where
rA 1is the minimal integer that is not smaller than A. Therefore in this case
to describe string (4.20) we need Pg2 Nl (rather than £) bits. Thus using a
code book with a perfect decoding table, we can compress the description
length of string (4.20) by a factor
Let us call K(T) the coefficient of compression for the string (4.20).
e
Consider now the general case: the code-book b does not contain the
perfect table. Let the smallest Hamming distance between the strings (gen-
erated string (4.22) and desired string (4.20)) be d ;::: O. Without loss of
generality we can assume that d:::; £/2. (Otherwise, instead of the smallest
distance one could look for the largest Hamming distance and during de-
coding change one to zero and vice-versa. This will cost one extra bit in the
coding scheme). This means that to describe the string one has to make d
corrections to the results given by the chosen table in the code-book.
For fixed d there are et different possible corrections to the string of
length £. To specify one of them (i.e., to specify an number of one of the
et variants) one needs Pg2 etl bits.
Therefore to describe the string (4.20) we need: Pg2 Nl bits to define
the number of the table, and Pg2 etl bits to describe the corrections. We
also need Pg2 dl + dd bits to specify the number of corrections d,where
dd < 2lg2 192 d, d > 2. Altogether, we need Pg2 Nl + Pg2 etl + Pg2 dl + dd
bits for describing string (4.20). This number should be compared to £,
the number of bits needed to describe an arbitrary binary string (4.20).
Therefore the coefficient of compression is
(4.24)
In the beginning of this section we considered the bound (4.1) for the
generalization ability of a learning machine, for the pattern recognition
problem. For the particular case where the learning machine has a finite
number N of functions, we obtained that with probability of at least 1- 'T},
the inequality
1+ _2Re---,m~p,,--(_Ti_)f) (4.25)
InN -In'T}
holds true simultaneously for all N functions in the given set of functions
(for all N tables in the given code-book). Let us transform the right-hand
side of this inequality using the concept of the compression coefficient, and
the fact that
d
g+
InN -In'T} (1
f + 1+ -In-N,.,,2-~-I-n-'T} )
< 2 (rlnNl + rlnetl + flg2 dl + Lld _ l~'T}) (4.26)
is valid (one can easily check it). Now let us rewrite the right-hand side of
inequality (4.26) in terms of the compression coefficient (4.24):
Since inequality (4.25) holds true with probability of at least 1 - 'T} and
inequality (4.26) holds with probability 1, the inequality
(4.27)
Pg2 m1+ ~m' ~m < 2 Pg2lg2 m1 bits) and then, using this code-book, de-
scribe the string (which as shown above takes Pg2 Nl + Pg2 ell + flg2 dl +
~d bits).
The total length of the description in this case is Pg2 N1 + Pg2 ell +
Pg2 dl + ~d + flg2 m1+ ~m and the compression coefficient is
K (T) = Pg2 N1 + flg2 ell + flg2 d1 + ~d + Pg2 m 1+ ~m .
l
For this case an inequality analogous to inequality (4.27) holds. Therefore,
the probability of error for the table which was used for compressing the
description of string (4.20), is bounded by inequality (4.27).
Thus we have proved the following theorem:
Theorem 4.3. If, on a given structure of code-books, one compresses by
a factor K(T) the description of string (4-20) using a table T, then with
probability of at least 1 - TJ one can assert that the probability of committing
an error by the table T is bounded by
Note how powerful the concept of the compression coefficient is: to obtain
a bound on the probability of error, we actually need only information
about this coefficient. 8 We do not need such details as
(i) How many examples we used,
(ii) how the structure of code-books was organized,
(iii) which code-book was used,
(iv) how many tables were in the code-book,
(v) how many training errors were made using this table.
Nevertheless, the bound (4.28) is not much worse than the bound on the
risk (4.25) obtained on the basis of the theory of uniform convergence.
The latter has a more sophisticated structure and uses information about
the number of functions (tables) in the sets, the number of errors on the
training set, and the number of elements of the training set.
Note also that, the bound (4.28) cannot be improved more than by factor
2: it is easy to show that in the case when there exists a perfect table in
the code-book, the equality can be achieved with factor 1.
This theorem justifies the MDL principle: to minimize the probability of
error one has to minimize the coefficient of compression.
8The second term, -In 'Tf/ l, on the right-hand side is actually fool-proof: for
reasonable 'Tf and l, it is negligible compared to the first term, but it prevents one
from considering too small 'Tf and/or too small i.
106 4. Controlling the Generalization Ability of Learning Processes
The MDL principle works well when the problem of constructing rea-
sonable code-books has an obvious solution. But even in this case, it is
not better than the SRM principle. Recall that the bound for the MDL
principle (which cannot be improved using only concept of compression
coefficient) was obtained by roughening the bound for the SRM principle.
Informal Reasoning and Comments
4
(4.30)
00
UMi=M (4.31)
i=l
and for any subset Mi to find a function It E Mi minimizing the distance
Ivanov proved that under some general conditions the sequence of solutions
ft,···,/:,··· .
converges to the desired one.
The quasi-solution method was suggested at the same time when Tikhonov
proposed his regularization technique; in fact the two are equivalent. In the
regularization technique, one introduces a non-negative semi-continuous
(from below) functional flU) that possesses the following properties:
(i) The domain of the functional coincides with M (the domain to which
the solution of Eq. (4.29) belongs).
(ii) The region for which the inequality
(4.32)
lim ,(8)
6~0
= 0,
82
lim ($;:)
6~0, u
= O.
In both methods the formal convergence proofs do not explicitly contain
"capacity control." Essential, however, was the fact that any subset Mi in
Ivanov's scheme and any subset M = {f: nu) :::; c} in Tikhonov's scheme
are compact. That means it has a bounded capacity (a metric e-entropy).
Therefore both schemes implement an SRM principle: first define a struc-
ture on the set of admissible functions such that any element of the struc-
ture has a finite capacity, increasing with the number of the element. Then,
on any element of the structure, the function providing the best approxi-
mation of the right-hand side of the equation is found. The sequence of the
obtained solutions converges to the desired one.
(4.33)
then according to inequality (4.33) one obtains the estimates pe(t) whose
deviation from the desired solution can be described as follows:
(4.35)
lOBy the way, one can obtain all classical estimators if one approximates an
unknown distribution function F(x) by the the empirical distribution function
Fi (x). The empirical distribution function however is not the best approxima-
tion to the distribution function since, according to definition, the distribution
function should be an absolutely continuous one while the empirical distribu-
tion function is discontinuous. Using absolutely continuous approximations (e.g.,
polygon in the one-dimensional case) one can obtain estimators that in addi-
tion to nice asymptotic properties (shared by the classical estimators) possess
some useful properties from the point of view of limited numbers of observations
(Vapnik, 1988).
112 4. Informal Reasoning and Comments - 4
and so on.
To choose the polynomial of the best degree, one can minimize the fol-
lowing functional (the right hand side of bound (3.30))
R( ) _ iE~=I(Yi-fm(Xi,a))2 (4.36)
a, m - ( . IT.\ '
1 - CV£lJ+
Q(z,a) = (y - fm(x,a))2, a E A,
and C is a constant determinimng the "tails of distributions" (see Sections
3.4 and 3.7).
One can show that the VC-dimension h of the set of real functions
hm ~ 2m+2.
To find the best approximating polynomial, one has to choose both the de-
gree m of polynomial and the coefficients a that minimize functional l l (4.36).
3m ::; h* ::; 4m + 3
(Karpinski and Werther, 1989). Therefore the our set ofloss functions has
VC dimension less than 8m + 6. This estimate can be used for finding the
sparse algebraic polynomial that minimizes the functional (4.36).
Y = LajK(x,wj) + aQ,
j=l
R(4)) = J
(f(x, a) - 4>(xI[Y, X])) 2 P([Y, XlIa)P(o:)dxd([Y, Xl) 00.
To find this estimator in explicit form one has to conduct integration analytically
(numerical integration is impossible due to the high dimensionality of 0:). Unfor-
tunately, analytic integration of this expression is mostly an unsolvable problem.
4.11. The Problem of Capacity Control and Bayesian Inference 117
Let us for simplicity consider the case where the noise is distributed ac-
cording to the normal law
P(~) = 1 exp -
~O' {e } 20'2 •
(4.39)
(i) The given set of functions of the learning machine coincides with the
set of problems to be solved.
14This part of the a priori information is not as important as first one. One can
prove that with increasing numbers of observations the influence of an inaccurate
description of P(a) is decreased.
118 4. Informal Reasoning and Comments - 4
8 1 C 8 2 C, ... , c 8
on the set of loss functions 8 = {Q(z, a), a E A} and then choose both an
appropriate element 8k of the structure and a function Q(z, a;) E 8k in
this element that minimizes the corresponding bounds, for example, bound
(4.1). The bound (4.1) can be rewritten in the simple form
l
R(ai) :::; Remp(ai) + ~(hk)'
k k
(5.1)
120 5. Constructing Learning Algorithms
where the first term is the empirical risk and the second term is the confi-
dence interval.
There are two constructive approaches to minimizing the right-hand side
of inequality (5.1).
In the first approach, during the design of the learning machine one
determines a set of admissible functions with some VC dimension h*. For
a given amount £ of training data, the value h* determines the confidence
interval ~(~.) for the machine. Choosing an appropriate element of the
structure is therefore a problem of designing the machine for a specific
amount of data.
During the learning process this machine minimizes the first term of the
bound (5.1) (the number of errors on the training set).
If for a given amount of training data one designs too complex a machine,
the confidence interval ~(~.) will be large. In this case even if one could
minimize the empirical risk down to zero the number of errors on the test
set could still be large. This phenomenon is called overfitting.
To avoid overfitting (to get a small confidence interval) one has to con-
struct machines with small VC dimension. On the other hand, if the set of
functions has a small VC dimension, then it is difficult to approximate the
training data (to get a small value for the first term in inequality (5.1)).
To obtain a small approximation error and simultaneously keep a small
confidence interval one has to choose the architecture of the machine to
reflect a priori knowledge about the problem at hand.
Thus, to solve the problem at hand by these types of machines, one first
has to find the appropriate architecture of the learning machine (which is
a result of the trade off between overfitting and poor approximation) and
second, find in this machine the function that minimizes the number of
errors on the training data. This approach to minimizing the right-hand
side of inequality (5.1) can be described as follows:
Keep the confidence interval fixed· (by choosing an appropriate construc-
tion 0/ machine) and minimize the empirical risk.
The second approach to the problem of minimizing the right-hand side
of inequality (5.1) can be described as follows:
Keep the value o/the empirical risk fixed (say equal to zero) and minimize
the confidence interval.
Below we consider two different types of learning machines that imple-
ment these two approaches:
(i) Neural Networks (which implement the first approach), and
(ii) Support Vector machines (which implement the second approach).
Both types of learning machines are generalizations of the learning ma-
chines with a set of linear indicator functions constructed in the 1960s.
5.2. Sigmoid Approximation of Indicator Functions 121
Consider the problem of minimizing the empirical risk on the set of linear
indicator functions
(5.3)
If the training set is separable without error (Le. the empirical risk can
become zero) then there exists a finite step procedure that allows us to find
such a vector wo, for example the procedure that Rosenblatt proposed for
the perceptron (see the Introduction).
The problem arises when the training set cannot be separated without
errors. In this case the problem of separating the training data with the
smallest number of errors is NP-complete. Moreover, one cannot apply reg-
ular gradient based procedures to find a local minimum of functional (5.3),
since for this functional the gradient is either equal to zero or undefined.
Therefore, the idea was proposed to approximate the indicator functions
(5.2) by so-called sigmoid functions (see Fig. 0.3 )
/(x,w)=8{(w·x)}, (5.4)
for example,
8(u) = tanh u = exp(u) - exp( -u).
exp(u) + exp( -u)
For the set of sigmoid functions, the empirical risk functional
j=l
122 5. Constructing Learning Algorithms
00, 00.
n=l n=l
Thus, the idea is to use the sigmoid approximation at the stage of esti-
mating the coefficients, and use the threshold functions (with the obtained
coefficients) for the last neuron at the stage of recognition.
(i) The neural net contains m + 1 layers: the first layer x(O) describes
the input vector x = (xl, ... ,xn ). We denote the input vector by
Xi = (x~(O), ... xi(O», i = 1, ... ,t',
and the image of the input vector Xi(O) on the kth layer by
xi(k) = (x~(k), ... ,x~k(k», i = 1, ... ,t',
where we denote by nk the dimensionality of the vectors xi(k) i =
1, ... ,t' (nk, k = 1, ... ,m -1 can be any number, but nm = 1).
(ii) The layer k -1 is connected with the layer k through the (nk x nk-l)
matrix w(k),:
xi(k) = S{W(k)Xi(k - I)}, k = 1,2, ... , m, i = 1, ... , t', (5.5)
where S{W(k)Xi(k - I)} defines the sigmoid function of the vector
ui(k) = W(k)Xi(k - 1) = (ut(k), ... , u~k(k»
124 5. Constructing Learning Algorithms
_ (.) 8L{W, X, B)
w (k) +-- w (k) 7 8w{k)' k = 1, ... , m.
Using the back-propagation technique one can achieve a local minimum for
the empirical risk functional.
(w· x) + b = O. (5.7)
It is easy to check that the Optimal hyperplane is the one that satisfies the
conditions (5.8) and minimizes
(5.9)
(The minimization is taken with respect to both vector w and scalar b.)
128 5. Constructing Learning Algorithms
FIGURE 5.2. The Optimal separating hyperplane is the one that separates the
data with maximal margin.
Ilwll ~A
has the VC dimension h bounded by the inequality
h ~ min ([R2 A2], n) + 1.
5.5. Constructing the Optimal Hyperplane 129
1
cI>(w) = 2(w. w) (5.10)
1
L Oi{[(X· w) + b]Yi -
l
L(w, b,o) = 2(w. w) - I}, (5.12)
i=l
In the saddle point, the solutions wo, bo, and QO should satisfy the con-
ditions
8L(wo, bo, QO)
8b
=°'
8L(wo,bo,QO) =°
8w .
Rewriting these equations in explicit form one obtains the following prop-
erties of the Optimal hyperplane:
(i) The coefficients Q? for the Optimal hyperplane should satisfy the
constraints
l
LQ?Yi =0, Q? 2: 0, i = 1, ... ,i (5.13)
i=l
(first equation).
(second equation).
(iii) Moreover, only the so-called support vectors can have nonzero coeffi-
cients Q? in the expansion of wo. The support vectors are the vectors
for which, in inequality (5.11), the equality is achieved. Therefore we
obtain
Wo = L YiQ?Xi, Q? 2: 0. (5.15)
support vectors
This fact follows from the classical Kiihn-'IUcker theorem, accord-
ing to which the necessary and sufficient conditions for the Optimal
hyperplane are that the separating hyperplane satisfy the conditions:
Putting the expression for Wo into the Lagrangian and taking into account
the Kiihn-'IUcker conditions, one obtains the functional
l 1 l
W(Q) = LQi - 2 LQiQjYiYj(Xi ·Xj). (5.17)
i=l i,j
5.5. Constructing the Optimal Hyperplane 131
ai 2: 0, i = 1, ... , i (5.18)
According to Eq. (5.15) the Lagrange multipliers and support vectors deter-
mine the Optimal hyperplane. Thus, to construct the Optimal hyperplane
one has to solve a simple quadratic programming problem: maximize the
quadratic form (5.17) under constraints4 (5.18) and (5.19).
Let ao = (a~, ... ,a~) be a solution to this quadratic optimization prob-
lem. Then the norm of the vector Wo corresponding to the Optimal hyper-
plane equals:
where Xi are the support vectors, a? are the corresponding Lagrange coef-
ficients, and bo is the constant (threshold)
function
l
Fu(e) = Lei
i=l
where parameters ai, i = 1, ... , i and C* are the solutions to the following
convex optimization problem:
Maximize the functional
l 1 l C*
W(a , C*) = L..J
~ a·t - ~ a·a·y·y·(x··
-2C* L..J '3' 3 '3
x·) - ~
2
i=l i,j=l
subject to constraints
l
LYiai =0
i=l
C* 2: 0
5.6. Support Vector (SV) Machines 133
one has to find the parameters Cii, i = 1, ... , i one that maximize the same
quadratic form as in the separable case
o ~ Cii ~ C, i = 1, ... , i,
i=l
As in the separable case, only some of the coefficients Cii, i = 1, ... , i differ
from zero. They determine the support vectors.
Note that if the coefficient C in the functional <1>( w, e) is equal to the
optimal value of parameter C* for minimization of the functional F1(e)
C=C*,
The Support Vector (SV) machine implements the following idea: it maps
the input vectors x into a high-dimensional feature space Z through some
nonlinear mapping, chosen a priori. In this space, an Optimal separating
hyperplane is constructed (Fig. 5.3).
134 5. Constructing Learning Algorithms
FIGURE 5.3. The SV machine maps the input space into a high-dimensional
feature space and then constructs an Optimal hyperplane in the feature space.
Two problems arise in the above approach: one conceptual and one tech-
nical.
(i) How to find a separating hyperplane that will generalize well?
(The conceptual problem.)
The dimensionality of the feature space will be huge, and a hyperplane
that separates the training data will not necessarily generalize well. 5
(ii) How to treat computationally such high-dimensional spaces?
(The technical problem.)
To construct a polynomial of degree 4 or 5 in a 200 dimensional
space it is necessary to construct hyperplanes in a billion dimensional
feature space. How can this "curse of dimensionality" be overcome?
5Recall Fisher's concern about the small amount of data for constructing a
quadratic discriminant function in classical discriminant analysis (Section 1.9).
5.6. Support Vector (SV) Machines 135
60ne can compare the result of this theorem to result of analysis of the fol-
lowing compression scheme. To construct the Optimal separating hyperplane one
only needs to specify among the training data the support vectors and its classifi-
cation. This requires: ~ rlg2 m 1 bits to specify the number m of support vectors,
rlg2 efl bits to specify the support vectors; and rlg2 e;;:11 bits to specify rep-
resentatives of the first class among the support vectors. Therefore for m < < £
and ml ~ m/2 the compression coefficient is
does not need to consider the feature space in explicit form. One only has
to be able to calculate the inner products between support vectors and the
vectors of the feature space (Eqs. (5.17) and (5.20)).
Consider a general expression for the inner product in Hilbert space7
K(u, v) (5.24)
k=l
with positive coefficients ak > 0 {i. e., K (u, v) describes a inner product in
some feature space), it is necessary and sufficient that the condition
7This idea was used in 1964 by Aizerman, Braverman, and Rozonoer in their
analysis of the convergence properties of the method of Potential functions (Aiz-
erman, Braverman, and Rozonoer, 1964, 1970). It happened at the same time
(1965) when the method of the Optimal hyperplane was developed (Vapnik and
Chervonenkis 1965). However, combining these two ideas, which lead to the SV
machines, was only done in 1992.
5.6. Support Vector (SV) Machines 137
To find the coefficients (ti in the separable case (analogously in the non-
separable case) it is sufficient to find the maximum of the functional
(5.26)
Note that the support vector machines implement the SRM principle.
Indeed, let
Decision rule
y N
Y = sign ( 1: Yi a i F ( xi ,a ) - b )
i= l
Nonlinear transformation
based on support vectors
xl ' '. ' ,x N
where R is the radius of the smallest sphere that contains the vectors \II (x),
Iwl is the norm of the weights (we use canonical hyperplanes in feature
space with respect to the vectors z = \II(Xi) where Xi are the elements of
the training data).
According to Theorem 5.1 (now applied in the feature space), k gives an
estimate of the VC dimension of the set of functions Sk.
The SV machine separates without error the training data
l
Iwol 2 = La?a~K(xi,Xj)YiYj (5.28)
holds true. To control the generalization ability of the machine (to min-
imize the probability of test errors) one has to construct the separating
5.6. Support Vector (SV) Machines 139
(5.29)
Indeed, for separating hyperplanes the probability of test errors with prob-
ability 1 - 'fJ is bounded by the expression
£ =4
h(ln ¥ + 1)l - In 'fJ I 4
.
The right-hand side of £ attains its minimum when hil is minimal. We es-
timate the minimum of hil by estimating h by hest = R21wo 12. To estimate
this functional it is sufficient to estimate Iwol2 (say by expression (5.28))
and estimate R2 by finding
(5.31)
(5.32)
(the parameter Iwol depends on the chosen radius as well). This functional
describes a trade-off between the chosen radius Rf3, the value of the mini-
mum of the norm Iwol, and the number of training vectors lf3 that fall into
radius Rf3.
Radial Basis Function Machines
Classical Radial Basis Function (RBF) Machines use the following set of
decision rules:
(5.33)
where Ky(lx - XiI) depends on the distance Ix - xii between two vectors.
For the theory of RBF machines see (Micchelli, 1986), (Powell, 1992).
The function Ky(lx - XiI) is for any fixed ,,(, a non-negative monotonic
function; it tends to zero as z goes to infinity. The most popular function
of this type is
K'"((lx - XiI) = exp{ -"(Ix - xiI 2 }. (5.34)
To construct the decision rule (5.33) one has to estimate
(i) The value of the parameter ,,(,
In the classical RBF method the first three steps (determining the param-
eters ,,(, N, and vectors (centers) Xi, i = 1, ... , N) are based on heuristics
and only the fourth step (after finding these parameters) is determined by
minimizing the empirical risk functional.
The radial function can be chosen as a function for the convolution of the
inner product for a SV machine. In this case, the SV machine will construct
a function from the set (5.33). One can show (Aizerman, Braverman, and
5.7. Experiments with SV Machines 141
Rozonoer, 1964, 1970) that radial functions (5.34) satisfy the condition of
Theorem 5.3.
In contrast to classical RBF methods, in the SV technique all four types
of parameters are chosen to minimize the bound on probability of test error
by controlling the parameters R, Wo in the functional (5.29). By minimizing
the functional (5.29) one determines
(i) N, the number of support vectors,
(ii) Xi, (the pre-images of) support vectors;
(iii) ai = aiYi, the coefficients of expansion, and
(iv) 7, the width parameter of the kernel-function.
(ii) the vectors of the weights Wi = Xi in the neurons of the first (hidden)
layer (the support vectors), and
(iii) the vector of weights for the second layer (values of a).
FIGURE 5.5. Two classes of vectors are represented in the picture by black and
white balls. The decision boundaries were constructed using an inner product of
polynomial type with d = 2. In the pictures the examples cannot be separated
without errors; the errors are indicated by crosses and the support vectors by
double circles.
(i) Experiments in the plane with artificial data that can be visualized,
and
Classifier Rawerror%
Human performance 2.5
Decision tree, C4.5 16.2
Best two-layer neural network 5.9
Five-layer network (LeNet 1) 5.1
TABLE 5.1. Human performance and performance of the various learning ma-
chines, solving the problem of digit recognition on U.S. Postal Service data.
of the input space is 256. Figure 5.6 gives examples from this data-base.
Table 5.1 describes the performance of various classifiers, solving this
problem. 9
All machines constructed ten classifiers, each one separating one class from
the rest. The ten class classification was done by choosing the class with
the largest classifier output value.
The results of these experiments are given in Table 5.2. For different
types of SV machines, Table 5.2 shows: the best parameters for the ma-
chines (column 2), the average (over one classifier) of the number of support
vectors, and the performance of machine.
FIGURE 5.6. Examples of patterns (with labels) from the U.S. Postal Service
database.
5.7. Experiments with SV Machines 145
TABLE 5.3. Total number (in ten classifiers) of support vectors for various SV
machines and percentage of common support vectors.
Note that for this problem, all types of SV machines demonstrate ap-
proximately the same performance. This performance is better than the
performance of any other type of learning machine solving the digit recog-
nition problem by constructing the entire decision rules on the basis of the
U.S. Postal Service databaseY
11 Note that using a local approximation approach described in Section 5.7 (that
does not construct entire decision rule but approximates the decision rule in any
point of interest) one can obtain a better result: 3.3% error rate (L. Bottou and
V Vapnik, 1992).
The best results for this database, 2.7% was obtained by P. Simard, Y. LeCun,
and J. Denker without using any learning methods. They suggested a special
method of elastic matching with 7200 templates using a smart concept of distance
(so-called Tangent distance) that takes into account invariance with respect to
small translations, rotations, distortions, and so on (P. Simard, Y. LeCun, and
J. Denker, 1993).
146 5. Constructing Learning Algorithms
Poly RBF NN
Poly 100 84 94
RBF 87 100 88
NN 91 82 100
TABLE 5.4. Percentage of common (total) support vectors for two SV machines.
[lJ[I][JJB
4 4 8 5
FIGURE 5.7. Labeled examples of training errors for the second degree polyno-
mials.
5.7. Experiments with SV Machines 147
Note that the number of support vectors increases slowly with the degree
of the polynomials. The seventh degree polynomial has only 50% more
support vectors than the third degree polynomial. 12
The dimensionality of the feature space for a seventh degree polynomial
is however 1010 times larger than the dimensionality of the feature space
for a third degree polynomial classifier. Note that the performance does
not change significantly with increasing dimensionality of the space - in-
dicating no overfitting problems.
To choose the degree of the best polynomials for one specific classifier we
estimate the VC dimension (using the estimate [R2 A2]) for all constructed
polynomials (from degree two up to degree seven) and choose the one with
the smallest estimate of the VC dimension. In this way we found the ten
best classifiers (with different degrees of polynomials) for the ten two-class
problems. These estimates are shown on Fig. 5.8 where for all ten two-class
decision rules, the estimated VC dimension, is plotted versus the degree of
the polynomials. The question is:
12The relatively high number of support vectors for the linear separator is
due to nonseparability: the number 282 includes both support vectors and miss-
classified data.
148 5. Constructing Learning Algorithms
3200
Digits 0 ~ 4
3000 I - - - r-
"3"
2400 /J "2"
~~
J"on
c 2000
o
'iii
c
8" 1600
[1\
o
>
1200 ~ / ~I ./ P"4"
~ / ..PV
800 r- ~ ...--. V .)-
-
/
~~
400 I - - -
t--.-.
"1"
o I I
2 3 4 5 6 7
Degree of Polynomial
3200
.' . I
Digits 5 9
3000
~
b"5"
2400 II P"S)I
- \ /I
\
c 2000
tf
o
'00
.,c
CiE 1600
~~
o P"gll
>
1200
.---' /
r--
/
"6"
00 r- ~
V
""~ ......
.-I
.-I "7)}
400
o
2 3 4 5 6 7 8
Degree of Polynomial
FIGURE 5.8. The estimate of the VC dimension of the best element of the struc-
ture (defined on the set of canonical hyperplanes in the corresponding feature
space) versus the degree of polynomial for various two-class digit recognition
problems (denoted digit versus the rest) .
5.8. Remarks on SV Machines 149
Number of test errors: the number of test errors, using the constructed
polynomial of corresponding degree; the boxes show the number of
errors for the chosen polynomial.
Thus, Table 5.5 shows that for the SV polynomial machine there are no
overfitting problems with increasing degree of polynomials, while Table 5.6
shows that even in situations where the difference between the best and
the worst solutions is small (for polynomials starting from degree two up
to degree seven), the theory gives a method for approximating the best
solutions (finding the best degree of the polynomial).
Note also that Table 5.6 demonstrates that the problem is essentially
nonlinear. The difference in the number of errors between the best polyno-
mial classifier and the linear classifier can be as much as a factor of four
(for digit 9).
(5.35)
where N is any integer (N < t), ai, i = 1, ... , N are any scalars and
Wi, i = 1, .. ,N are any vectors. The kernel K (x, w) can be any symmetric
function satisfying the conditions of Theorem 5.3.
As was demonstrated, the best guaranteed risk for these sets of functions
is achieved when the vectors of weights WI, ... W N are equal to some of the
vectors x from the training data (support vectors).
Using the set of functions
/(x,a,w) =
support vectors
with convolutions of polynomial, radial basis function, or neural network
type, one can approximate a continuous function to any degree of accuracy.
Note that for the SV machine one does not need to construct the archi-
tecture of the machine by choosing a priori the number N (as is necessary
in classical neural networks or in classical radial basis function machines).
Furthermore, by changing only the function K(x, w) in the SV machine
one can change the type of learning machine (the type of approximating
functions) .
(ii) SV machines minimize the upper bound on the error rate for the
structure given on a set of functions in a feature space. For the best solution
it is necessary that the vectors Wi in Eq. (5.35) coincide with some vectors
of the training data (support vectors I3 ). SV machines find the functions
from the set (5.35) that separate the training data and belong to the subset
with the smallest bound of the VC dimension. (In the more general case
they minimize the bound of the risk (5.1).)
(iii). Finally, to find the desired function, the SV machine has to max-
imize a non-positive quadratic form in the non-negative quadrant. This
problem is a particular case of a special quadratic programming problem:
to maximize a non-positive quadratic form Q(x) with bounded constraints
where Xi, i 1, ... ,n are the coordinates of the vector x and ai, bi
are given constants. For this specific quadratic programming problem fast
algorithms exist.
Using the ERM inductive principle and this loss-function one obtains a
function that gives the best least squares approximation to the data. Un-
der conditions where y is the result of measuring a function with normal
additive noise (see Section 1.7.3) (and for the ERM principle) this loss-
function provides also the best approximation to the regression.
It is known, however, that if additive noise is generated by another law,
the optimal approximation to the regression (for the ERM principle) gives
another loss-function (associated with this law).
In 1964 Huber developed a theory that allows us to define the best loss-
function for the problem of regression estimation on the basis of the ERM
principle if one has only general information about the model of the noise. In
particular, he showed that if one only knows that the density p( x) describing
the noise is a symmetric convex function possessing second derivatives, then
the best minimax approximation to regression (the best approximation for
152 5. Constructing Learning Algorithms
Minimizing the empirical risk with respect to this loss-function is called the
least modulus method. It defines the so-called robust regression function.
We consider a slightly more general type of loss-function than (5.37), the
so-called linear loss-function with insensitive zone:
€, if Iy - f(x,a)1 ::; €
Iy - f(x,a)le = { Iy - f(x,a)l, otherwise. ' (5.38)
f(x,a) = (w· x) + b.
(ii) One defines the problem of regression estimation as the problem
of risk minimization with respect to an €-insensitive (€ 2: 0) loss-
function (5.38).
(iii) One minimizes the risk using the SRM principle, where elements of
the structure Sn are defined by inequality
1
L IYi - (w· Xi) - bl
i
Remp(w,b) = l e
,;=1
14This is an extreme case where one has minimal information about an un-
known density. Huber described also the intermediate cases where the unknown
density is a mixture of some given density and any density from a described set
of densities, taken in proportion € and 1 - € (Huber, 1964).
5.9. SV Machines for the Regression Estimation Problem 153
(5.40)
under constraints
l l
C* 2: o.
As in pattern recognition here only some of the parameters in expansion
(5.43)
0:; - O:i
i3i = i = 1, ... , i
C*
differ from zero. They define the support vectors of the problem.
2. One can reduce the convex optimization problem of finding the vec-
tor w to a quadratic optimization problem, if instead of minimizing the
functional (5.40), subject to constraints (5.41) and (5.39), one minimizes
(with given value C) subject to constraints (5.41). In this case to find the
desired vector
e
w = ~)o:; - O:i)Xi,
i=l
one has to find coefficients 0:;, O:i, i = 1, ... , i that maximize the quadratic
form
e e e
W(o:,o:*) = -c; L(O:;+O:i) + LYi(O:;-O:i) - ~ L (O:;-O:i)(o:j-O:j)(Xi'Xj)
i=l i=l i,j=l
(5.47)
subject to constraints
e e
Lo:i = LO:i,
i=l i=l
o ~ o:i ~ C, i = 1, ... , i,
o ~ O:i ~ C, i = 1, ... , i.
As in the pattern recognition case, the solution to these two optimization
problems coincide if C = C*.
One can show that for any i = 1, ... , i the equality
5.9. SV Machines for the Regression Estimation Problem 155
holds true. Therefore, for the particular case where c = 0 and Yi E {-I, I}
the considered optimization problems coincide with those described for
pattern recognition in Section 5.5.1.
To derive the bound on the generalization of the SV machine, suppose
that distribution F(x, y) = F(ylx)F(x) is such that for any fixed w, b the
corresponding distribution of the random variable Iy - (w· x) - bl e has a
"light tail" (see Section 3.4):
a(p)=
~ (p _1)P-l
2 p-2
and
hn (In !! + 1) -In(,,,j4)
£=4 £ .
Here h n is the VC dimension of the set of functions
Sn={ly-(w·x)-bl e : (w·w)~c..}.
where {3i, i = 1, .. ,N are scalars, Vi, ,i = 1, ... ,N are vectors, and K(.,·) is
a given function satisfying Mercer's conditions, is analogous to construct-
ing a linear approximation. It can be conducted both by solving a convex
optimization problem and by solving a quadratic optimization problem.
1. Using the convex optimization approach one evaluates coefficients
{3i, i = 1, ... , £ in (5.48) as
La; = Lai
i=l i=l
and to constraints
o :::; a; :::; 1, 0:::; ai :::; 1, i = 1, ... , l,
o :::; ai :::; 1, i = 1, ... , l,
and
G* ~ O.
La; = Lai
i=l i=l
and to constraints
o :::; a; :::; G, i = 1, ... , l,
o :::; ai :::; G, i = 1, ... , i.
Therefore from the formal point of view it seems that there should be
no question as to what type of machine should be used for solving real-life
problems.
(i) LeNet 4,
(ii) Polynomial SV machine (polynomial of degree four)
provided the same performance: 1.1% test error 16 .
The local learning approach and tangent distance matching to 60,000
templates also gave the same performance: 1.1 % test error.
Recall that for a small (U.S. Postal Service) database the best result (by
far) was obtained by the tangent distance matching method which uses a
priori information about the problem (incorporated in the concept of tan-
gent distance). As the number of examples increases to 60,000 the advan-
tage of a priori knowledge decreased. The advantage of the local learning
approach also decreased with the increasing number of observations.
LeNet 4, crafted for the NIST database demonstrated remarkable im-
provement in performance comparing to LeNet 1 (which has 1.7% test
errors for the NIST database 17 ).
The standard polynomial SV machine also did a good job. We continue
the quotation (Leon Bottou, et al, 1994):
"The SV machine has excellent accuracy, which is most remark-
able, because unlike the other high performance classifiers it
does not include knowledge about the geometry of the problem.
In fact this classifier would do just as well if the image pixel
were encrypted, e.g., by a fixed random permutation."
However, the performance achieved by these learning machines is not
the record for the NIST database. Using models of characters (the same
that was used for constructing the tangent distance) and 60,000 examples
of training data, H. Drucker, R. Schapire, and P. Simard generated more
than 1,000,000 examples which they used to train three LeNet 4 neural
networks, combined in the special "boosting scheme" (Drucker, Schapire,
and Simard, 1993) which achieved a 0.7% error rate.
Now the SV machines have a challenge - to cover this gap (between
1.1% to 0.7%). Probably the use of only brute force SV machines and
60,000 training examples will not be sufficient to cover the gap. Probably
one has to incorporate some a priori information about the problem at
hand.
There are several ways to do this. The simplest one is use the same
1,000,000 examples (constructed from the 60,000 NISTs prototypes). How-
ever, it is more interesting to find a way for directly incorporating the
invariants that were used for generating the new examples. For example,
for polynomial machines one can incorporate a priori information about in-
variance by using the convolution of an inner product in the form (x T Ax*)d,
where x and x* are input vectors and A is a symmetric positive definite
matrix reflecting the invariants of the models. 18
One can also incorporate another (geometrical) type of a priori infor-
mation using only features (monomials) XiXjXk formed by pixels which are
close each to other (this reflects our understanding of the geometry of the
problem - important features are formed by pixels that are connected to
each other, rather than pixels far from each other). This essentially reduces
(by a factor of millions) the dimensionality of feature space.
Thus, although the theoretical foundations of Support Vector machines
look more solid than those of Neural Networks, the practical advantages of
the new type of learning machines still needs to be proved. 18a
According to the the SRM principle, the structure has to be defined a priori
before the training data appear.
The attempt to implement the SRM principle in toto brings us to a new
statement of the learning problem which forms a new type of inference. For
simplicity we consider this model for the pattern recognition problem.
Let the learning machine that implements a set of functions linear in
feature space be given l + k vectors
for which the classification string should be found by the machine (test
set). The goal of the machine is to find the rule that gives the string with
the minimal number of errors on the given test set.
In contrast to the model of function estimation considered in this book,
this model looks for the rule that minimizes the number of errors on the
given test set rather than for the rule minimizing the probability of error
on the admissible test set. We call this problem the estimation of the values
of the function at given points. For the problem of estimating the values
of function at given points the SV machines will realize the SRM principle
in toto if one defines the canonical hyperplanes with respect to all l + k
vectors (5.49). (One can consider the data (5.49) as a priori information.
A posteriori information is any information about separating this set into
two subsets.)
Estimating the values of a function at given points has both a solution
and a method of solution, that differ from those based on estimating an
unknown function.
Consider for example the five digit zip-code recognition problem. 19 The
existing technology based on estimating functions suggests recognizing the
five digits Xl, ... , X5 of the zip-code independently: first one uses the rules
constructed during the learning procedures to recognize digit Xl, then one
uses the same rules to recognize digit X2 and so on.
20 Note that the local learning approach described in Section 4.5 can be consid-
ered as an intermediate model between function estimation and estimation of the
values of a function at points of interest. Recall that for a small (Postal Service)
database the local learning approach gave significantly better results (3.3% error
rate) than the best result based on entire function estimation approach (5.1%
obtained by LeNet 1, and 4.0% obtained by the polynomial SV machine).
5.12. What Can One Learn from Digit Recognition Experiments? 163
21Note that the thesis does not reflect some proved fact. It reflects the belief
in the existence of some law that is hard to prove (or formulate in exact terms).
164 5. Informal Reasoning and Comments - 5
as the value of the constant 'Y. (Recall that the kernel K(u) in Parzen's
estimator is determined by the functional nu), and 'Y is determined by the
regularization constant.)
The same was observed in the regression estimation problem where one
tries to use expansions in different series to estimate the regression function:
if the number of observations is not "very small" the type of series used is
not as important as the number of terms in the approximation. All these
observations were done solving low-dimensional (mostly one-dimensional)
problems.
In the described experiments we observed the same phenomena in very
high-dimensional space.
where
22 After this book had been completed, C. Burges demonstrated that one can
approximate the obtained decision rule
using the, so-called, generalized support vectors TI, ... , TM (a specially con-
structed set of vectors).
To obtain approximately the same performance for the digit recognition prob-
lem, described in Section 5.7, it was sufficient to use an approximation based on
M = 11 generalized support vectors per classifier instead of N = 270 (initially
obtained) support vectors per classifier.
This means that for Support Vector machines there exists a regular way to
synthesize the decision rules possessing the optimal complexity.
166 5. Informal Reasoning and Comments - 5
xi,·· .,xi;,
x:
the one which belongs to the first class with highest probability (decision
making problem in the pattern recognition form!). To solve this problem
one does not need to even estimate the values of the function at all given
points; therefore it can be solved in situations where one does not have
enough information (not enough training data) to estimate the value of a
function at given points.
The key to the solution of these problems is the following observation,
which for simplicity we will describe for the pattern recognition problem.
The learning machine (with a set of indicator functions Q(z, ex), ex E A)
is simultaneously given two strings: the string of i + k vectors x from the
training and the test sets, and the string of i values Y from the training
set. In pattern classification the goal of the machine is to define the string
containing k values Y for the test data.
For the problem of estimating the values of a function at the given points
the set of functions implemented by the learning machine can be factorized
into a finite set of equivalence classes. (Two indicator functions fall in the
same equivalence class if they coincide on the string Xl, . .. ,Xi+k)' These
equivalence classes can be characterized by their cardinality (how many
functions they contain).
The cardinality of equivalence classes is a concept that makes the theory
of estimating the function at the given points differ from the theory of
estimating the function. This concept (as well as the theory of estimating
the function at given points) was considered in the 1970s (Vapnik, 1979).
For the set of linear functions it was found that the bound on generalization
ability, in the sense of minimizing the number of errors only on the given
lOr to find one that with the most probability possesses the largest value of
y. (decision making in regression form).
6.1. What is Important in the Setting of the Problem? 169
Approximating
function
Values of the
Examples function at points
of interest
FIGURE 6.1. Different types of inference. Induction, deriving the function from
the given data. Deduction, deriving the values of the given function for points of
interest. Transduction, deriving the values of the unknown function for points of
interest from the given data. The classical scheme suggests to derive the values of
the unknown function for points of interest in two steps: first using the inductive
step, and then using the deduction step, rather than obtaining the direct solution
in one step.
test data, (along with the factors considered in this book), depends also
on a new factor - the cardinality of equivalence classes. Therefore, since to
minimize a risk one can minimize the obtained bound over a larger number
of factors, one can find a lower minimum. Now the problem is to construct
a general theory for estimating a function at the given points. This brings
us to a new concept of learning.
Classical philosophy usually considers two types of inference: deduction,
describing the movement from general to particular, and induction, describ-
ing the movement from particular to general.
The model of estimating the value of a function at a given point of
interest describes a new concept of inference: moving from particular to
particular. We call this type of inference transductive inference. (Fig. 6.1)
Note that this concept of inference appears when one would like to get
the best result from a restricted amount of information. The main idea in
this case was described in Section 1.9 as follows:
This part of the theory is well developed. It answers almost all questions
toward understanding the conceptual model of learning processes realizing
the ERM principle. The only remaining open question is the necessary
and sufficient conditions for a fast rate of convergence. In Chapter 2 we
considered the sufficient condition described using Annealed entropy
lim H:nn(l!) =0
£-+00 £
for the pattern recognition case. It also can be shown that the conditions
· H:nn(cj£) - 0
11m V'c >0
0 -,
£-+00 .[.
h
A
N (Zl' ... , Zl)::;
(
hei )
Note that the polynomial on the right-hand side depends on one free pa-
rameter h. This bound (which depend on one capacity parameter) cannot
be improved (there exist examples where equality is achieved).
The challenge is to find refined concepts containing more than one pa-
rameter (say two parameters) that describe some properties of capacity
(and the set of distribution functions F(z) E 1'), using which one can
obtain better bounds. 3
This is a very important question, and the answer would have immediate
impact on the bounds of the generalization ability of learning machines.
2In 1972 this bound was also published by Sauer (Sauer, 1972).
3Recall the MDL bound: even such a refined concept as the coefficient of
compression provides a worse bound than one based on three (actually rough)
concepts such as the value of the empirical risk, the number of observations, and
the number of functions in a set.
172 Conclusion: What is Important in Learning Theory?
The most important problem in the theory for controlling the generaliza-
tion ability of learning machines is finding a new inductive principle for
small sample sizes. In the mid-1970s, several techniques were suggested to
improve the classical methods of function estimating. Among these are the
various rules for choosing the degree of a polynomial in the polynomial
regression problem, various regularization techniques for multidimensional
regression estimation, the regularization method for solving ill-posed prob-
lems, etc. All these techniques are based on the same idea: to provide the
set of functions with a structure and then minimize the risk along the el-
ements of the structure. In the 1970s the crucial role of capacity control
was discovered. We call this general idea SRM to stress the importance of
minimizing the risk in the element of the structures.
In SRM, one tries to control simultaneously two parameters: the value
of the empirical risk and the capacity of the element of the structure.
In the 1970s the MDL principle was proposed. Using this principle, one
can control the coefficient of compression.
The most important question is:
Does there exist a new inductive principle for estimating dependency from
small sample sizes?
In studies of inductive principles it is crucial to find new concepts which
affect the bounds of the risk, and which therefore can be used in mini-
mizing these bounds. To use an additional concept, we introduced a new
statement of the learning problem: the local risk minimization problem.
In this statement, in the framework of the SRM principle, one can control
three parameters: empirical risk, capacity, and locality.
In the problem of estimating the values of a function at the given points
one can use an addition concept: the cardinality of equivalence classes. This
aids in controlling the generalization ability: by minimizing the bound over
four parameters, one can get smaller minima than by minimizing the bound
over fewer parameters. The problem is to find a new concept which can
affect the upper bound of the risk. This will immediately lead to a new
learning procedure, and even to a new type of reasoning (as in the case of
transductive inference).
Finally, it is important to find new structures on the set of functions. It
is interesting to find structures with elements containing functions which
are described by large numbers of parameters, but nevertheless have low
VC dimension. We have found only one such structure and this brought us
to SV machines. New structures of this kind will probably result in new
types of learning machines.
6.5. What is Important in the Theory for Constructing Learning Algorithms? 173
The algorithms for learning should be well controlled. This means that one
has to control two main parameters responsible for generalization ability:
the value of the empirical risk and the VC dimension of the smallest element
of the structure that contains a chosen function.
The SV technique can be considered as an effective tool for control-
ling these two parameters if structures are defined on the sets of linear
functions in some high dimensional feature space. This technique is not
restricted only to the sets of indicator functions (for solving pattern recog-
nition problems). At the end of Chapter 5 we described the generalization
of the SV method for solving regression problems. In the framework of
this generalization using a special convolution function one can construct
high dimensional spline-functions belonging to the subset of splines with
a chosen VC dimension. Using different convolution functions for the in-
ner product one can also construct different type offunctions nonlinear in
input space. 4
Moreover, the SV technique goes beyond the framework of learning the-
ory. It admits a general point of view as a new type of parametrization of
sets of functions.
The matter is that in solving the function estimation problems in both
computational statistics (say pattern recognition, regression, density esti-
mation) and in computational mathematics (say, obtaining approximations
to the solution to multidimensional (operator) equations of different types)
the first step is describing (parametrizing) a set of functions in which one
is looking for a solution.
In the first half of this century the main idea of parametrization (after
Weierstrass theorem) was series expansion. However even in the one dimen-
sional case sometimes one needs a few dozen terms for accurate function
approximation. To treat such a series for solving many problems the accu-
racy of existing computers can be insufficient.
Therefore in the middle of the 1950s a new type of functions parametriza-
tion was suggested, the so-called spline functions (picewise polynomial func-
tions). This type of parametrization allowed us to get an accurate solution
Y= L O:jK{(x· Wj)} + b,
j=l
where 0:1, ... ,O:N are scalars and WI, .•. ,WN are vectors.
174 Conclusion: What is Important in Learning Theory?
The learning problem belongs to the problems of natural science: there ex-
ists a phenomenon for which one has to construct a model. In the attempts
to construct this model, theoreticians can choose one of two different po-
sitions depending on which part of Hegel's formula (describing the general
philosophy of nature) they prefer:
5In Hegel's original assertion, the meaning of the words "real" and "rational"
does not coincide with the common meaning of these words. Nevertheless, ac-
cording to a remark of B. Russell, the identification of the real and the rational
in a common sense leads to the belief that "whatever is, is right." Russell did not
accept this idea (see B. Russell,A History of Western Philosophy). However, we
do interpret Hegel's formula as: "whatever exists, is right and whatever is right,
exists."
6.6. What is the Most Important? 175
are good models of real brains, then the goal of the theoretician is to prove
that this model is rational.
Suppose that the theoretician considers the model to be "rational" if it
possesses some remarkable asymptotic properties. In this case, the theo-
retician succeeds if he or she proves (as has been done) that the learning
process in neural networks asymptotically converges to local extrema and
that a sufficiently large neural network can approximate well any smooth
function. The conceptual part of such a theory will be complete if one can
prove that the achieved local extremum is close to the global one.
The second position is a heavier burden for the theoretician: the theo-
retician has to define what a rational model is, then has to find this model,
and finally must convince the experimenters to prove that this model is
real (describes reality).
Probably, a rational model is one that not only has remarkable asymp-
totic properties but also possesses some remarkable properties in dealing
with a given finite number of observations. 6 In this case, the small sample
size philosophy is a useful tool for constructing rational models.
The rational models can be so unusual that one needs to overcome prej-
udices of common sense in order to find them. For example, we saw that
the generalization ability of learning machines depends on the VC dimen-
sion of the set of functions, rather than on the number of parameters that
define the functions within a given set. Therefore, one can construct high-
degree polynomials in high-dimensional input space with good generaliza-
tion ability. Without the theory for controlling the generalization ability
this opportunity would not be clear. Now the experimenters have to an-
swer the question: Does generalization, as performed by real brains, include
mechanisms similar to the technology of support vectors?7
That is why the role of theory in studies of learning processes can be
more constructive than in many other branches of natural science.
This, however, depends on the choice of the general position in studies
of learning phenomenon. The choice of the position reflects the belief of
which in this specific area of natural science, is the main discoverer of
truth: experiment or theory.
Remarks on References
One of the greatest mathematicians of the century, A.N. Kolmogorov, once
noted that an important difference between mathematical sciences and his-
torical sciences is that facts once found in mathematics hold forever while
the facts found in history are reconsidered by every generation of historians.
In statistical learning theory as in mathematics the importance of results
obtained depends on new facts about learning phenomenon, whatever they
reveal, rather than a new description of already known facts. Therefore, I
tried to refer to the works that reflect the following sequence of the main
events in developing the statistical learning theory described in this book:
1958-1962. Constructing the Perceptron.
1962-1964. Proving the first theorem on learning processes.
1958-1963. Discovery of the nonparametric statistics.
1962-1963. Discovery of the methods for solving ill-posed prob-
lems.
1960-1965. Discovery of the Algorithmic Complexity concept and
its relation to the inductive inference.
1968-1971. Discovery of the Law of Large Numbers for the space
of indicator functions and its relation to the pattern
recognition problem.
178 References
REFERENCES
L. Breiman, J.H. Friedman, RA. Olshen, and C.J. Stone (1984), Classifi-
cation and regression trees, Wadsworth, Belmont, CA.
A. Bryson, W. Denham, and S. Dreyfuss (1963), "Optimal programming
problem with inequality constraints. I: Necessary conditions for extremal
solutions" AIAA Journal, 1, pp. 25-44.
F. P. Cantelli (1933), "Sulla determinatione empirica della leggi di proba-
bilita," Giornale dell' Institute Italiano degli Attuari, (4).
G.J. Chait in (1966) ," On the length of programs for computing finite binary
sequences," J. Assoc. Comput Mach., 13, pp. 547-569.
V.V. Ivanov (1976), The theory of approximate methods and their appli-
cation to the numerical solution of singular integral equation, Leyden,
Nordhoff International.
M. Karpinski and T. Werther (1989), "vc dimension and uniform learnabil-
ity of sparse polynomials and rational functions," SIAM J. Computing,
to appear. (Preprint 8537-CS, Bonn University, 1989.)
A. N. Kolmogoroff (1933), "Sulla determinatione empirica di una leggi di
distributione," Giornale dell' Institute Italiano degli Attuari, (4).
A.N. Kolmogorov (1933), GrundbegriJJe der Wahrscheinlichkeitsrechnung,
Springer.
(English translation: A.N. Kolmogorov (1956), Foundation of the Theory
of Probability, Chelsia.)
A.N. Kolmogorov (1965), "Three approaches to the quantitative definitions
of information, Problem of Inform. Transmission, 1, (1), pp. 1-7.
Y. LeCun (1986), "Learning processes in an asymmetric threshold net-
work," Disordered systems and biological organizations, Les Houches,
France, Springer, pp. 233-240.
Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hub-
bard, and L. J. Jackel (1990), "Handwritten digit recognition with back-
propagation network," Advances in Neural Information Processing Sys-
tems, 2, Morgan Kaufman, pp. 396-404.
G.G. Lorentz (1966), Approximation of /unctions, Holt-Rinehart-Winston,
Ney York.
H.N. Mhaskar (1993)," Approximation properties of a multi-layer feed-
forward artificial neural network," Advances in Computational Mathe-
matics, 1, pp. 61-80.
C. A. Micchelli (1986), "Interpolation of scattered data: distance matrices
and conditionally positive definite functions," Constructive Approxima-
tion, 2, pp. 11-22.
M.L. Miller (1990), Subset selection in regression, London, Chapman and
Hall.
J.J. More and G Toraldo (1991)," On the solution of large quadratic pro-
gramming problems with bound constraints," SIAM Optimization, 1,
(1), pp. 93-113.
A. B. J. Novikoff (1962), "On convergence proofs on perceptrons," Pro-
ceedings of the Symposium on the Mathematical Theory of Automata,
Polytechnic Institute of Brooklyn, Vol. XII, pp. 615--622.
182 References
D.Z. Phillips (1962), "A technique for numerical solution of certain integral
equation of the first kind," J. Assoc. Comput. Mach., 9, pp. 84-96.
K. Popper (1968), The logic of Scientific Discovery, 2nd ed., Harper Torch
Book, New York.
Entries are here indicated for the first occurrence of the given term.