Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

978 1 4757 2440 0

Download as pdf or txt
Download as pdf or txt
You are on page 1of 201

The Nature of

Statistical Learning Theory


Springer Science+Business Media, LLC
Vladimir N. Vapnik

The Nature of
Statistical Learning Theory

With 33 Illustrations

, Springer
Vladimir N. Vapnik
AT&T Bell Laboratories
101 Crawfords Corner Road
Holmdel, NJ 07733 USA

Library of Congress Cataloging-in-Publication Data


Vapnik, Vladimir Naumovich.
The nature of statistical learning theory / Vladimir N. Vapnik.
p. cm.
Includes bibliographical references and index.

ISBN 978-1-4757-2442-4 ISBN 978-1-4757-2440-0 (eBook)


DOI 10.1007/978-1-4757-2440-0
Softcover reprint of the hardcover 1st edition 1995
1. Computational learning theory. 2. Reasoning. I. Title.
Q325.7.V37 1995
006.3'1'015195-dc20 95-24205

Printed on acid-free paper.

© 1995 Springer Science+Business Media New York


Originally published by Springer-Verlag New York, Inc in 1995.

All rights reserved. This work may not be translated or copied in whole or in part without the
written permission of the publisher (Springer Science+Business Media, LLC), except for brief
excerpts in connection with reviews or scholarly
analysis. Use in connection with any form of information storage and retrieval, electronic
adaptation, computer software, or by similar or dissimilar methodology now known or hereaf-
ter developed is forbidden.
The use of general descriptive names, trade names, trademarks, etc., in this publication, even
if the former are not especially identified, is not to be taken as a sign that such names, as
understood by the Trade Marks and Merchandise Marks Act, may accordingly be used freely
by anyone.

Production coordinated by Bill Imbornoni; manufacturing supervised by Joseph Quatela.


Photocomposed copy prepared from the author's LaTeX file.

987654321
In memory of my mother
Preface

Between 1960 and 1980 a revolution in statistics occurred: Fisher's


paradigm introduced in the 1920-1930s was replaced by a new one. This
paradigm reflects a new answer to the fundamental question:
What must one know a priori about an unknown functional dependency
in order to estimate it on the basis of observations?
In Fisher's paradigm the answer was very restrictive - one must know
almost everything. Namely, one must know the desired dependency up to
the values of a finite number of parameters. Estimating the values of these
parameters was considered to be the problem of dependency estimation.
The new paradigm overcame the restriction of the old one. It was shown
that in order to estimate dependency from the data, it is sufficient to know
some general properties of the set of functions to which the unknown de-
pendency belongs.
Determining general conditions under which estimating the unknown
dependency is possible, describing the (inductive) principles that allow one
to find the best approximation to the unknown dependency, and finally
developing effective algorithms for implementing these principles are the
subjects of the new theory.
Four discoveries made in the 1960s led to the revolution:
(i) Discovery of regularization principles for solving ill-posed problems
by Tikhonov, Ivanov, and Phillips.
(ii) Discovery of nonparametric statistics by Parzen, Rosenblatt, and
Chentsov.
viii Preface

(iii) Discovery of the law of large numbers in functional space and its
relation to the learning processes by Vapnik and Chervonenkis.

(iv) Discovery of algorithmic complexity and its relation to inductive in-


ference by Kolmogorov, Solomonoff, and Chaitin.

These four discoveries also form a basis for any progress in the studies of
learning processes.
The problem of learning is so general that almost any question that
has been discussed in statistical science has its analog in learning theory.
Furthermore, some very important general results were first found in the
framework of learning theory and then reformulated in the terms of statis-
tics.
In particular learning theory for the first time stressed the problem of
small sample statistics. It was shown that by taking into account the size
of sample one can obtain better solutions to many problems of function
estimation than by using the methods based on classical statistical tech-
niques.
Small sample statistics in the framework of the new paradigm constitutes
an advanced subject of research both in statistical learning theory and in
theoretical and applied statistics. The rules of statistical inference devel-
oped in the framework of the new paradigm should not only satisfy the
existing asymptotic requirements but also guarantee that one does one's
best in using the available restricted information. The result of this theory
are new methods of inference for various statistical problems.
To develop these methods (that often contradict intuition), a compre-
hensive theory was built that includes:
(i) Concepts describing the necessary and sufficient conditions for con-
sistency of inference.
(ii) Bounds describing the generalization ability of learning machines
based on these concepts.

(iii) Inductive inference for small sample sizes, based on these bounds.
(iv) Methods for implementing this new type of inference.

Two difficulties arise when one tries to study statistical learning theory:
a technical one and a conceptual one - to understand the proofs and to
understand the nature of the problem, its philosophy.
To overcome the technical difficulties one has to be patient and persistent
in following the details of the formal inferences.
To understand the nature of the problem, its spirit, and its philosophy,
one has to see the theory as a whole, not only as a collection of its different
parts. Understanding the nature of the problem is extremely important
Preface ix

because it leads to searching in the right direction for results and prevents
searching in wrong directions.
The goal of this book is to describe the nature of statistical learning the-
ory. I would like to show how the abstract reasoning implies new algorithms.
To make the reasoning easier to follow, I made the book short.
I tried to describe things as simply as possible but without conceptual
simplifications. Therefore the book contains neither details of the theory
nor proofs of the theorems (both details of the theory and proofs of the the-
orems can be found (partly) in my 1982 book Estimation of Dependencies
Based on Empirical Data, Springer and (in full) in my forthcoming book
Statistical Learning Theory, J. Wiley, 1996). However to describe the ideas
without simplifications I needed to introduce new concepts (new mathe-
matical constructions) some of which are non-trivial.

The book contains an introduction, five chapters, informal reasoning and


comments on the chapters, and a conclusion.
The introduction describes the history of the study of the learning prob-
lem which is not as straightforward as one might think from reading the
main chapters.
Chapter 1 is devoted to the setting of the learning problem. Here the
general model of minimizing the risk functional from empirical data is in-
troduced.
Chapter 2 is probably both the most important one for understanding
the new philosophy and the most difficult one for reading. In this chapter,
the conceptual theory of learning processes is described. This includes the
concepts that allow construction of the necessary and sufficient conditions
for consistency of the learning process.
Chapter 3 describes the nonasymptotic theory of bounds on the conver-
gence rate of the learning processes. The theory of bounds is based on the
concepts obtained from the conceptual model of learning.
Chapter 4 is devoted to a theory of small sample sizes. Here we introduce
inductive principles for small sample sizes that can control the generaliza-
tion ability.
Chapter 5 describes, along with classical neural networks, a new type of
universal learning machine that is constructed on the basis of small sample
sizes theory.
Comments on the chapters are devoted to describing the relations be-
tween classical research in mathematical statistics and research in learning
theory.
In the conclusion some open problems of learning theory are discussed.
The book is intended for a wide range of readers: students, engineers, and
scientists of different backgrounds (statisticians, mathematicians, physi-
cists, computer scientists). Its understanding does not require knowledge
of special branches of mathematics, nevertheless, it is not easy reading since
x Preface

the book does describe a (conceptual) forest even if it does not consider
the (mathematical) trees.
In writing this book I had one more goal in mind: I wanted to stress the
practical power of abstract reasoning. The point is that during the last few
years at different computer science conferences, I heard repetitions of the
following claim:
Complex theories do not work, simple algorithms do.
One of the goals of this book is to show that, at least in the problems
of statistical inference, this is not true. I would like to demonstrate that in
this area of science a good old principle is valid:
Nothing is more practical than a good theory.
The book is not a survey of the standard theory. It is an attempt to
promote a certain point of view not only on the problem of learning and
generalization but on theoretical and applied statistics as a whole.
It is my hope that the reader will find the book interesting and useful.

ACKNOWLEDGMENTS

This book became possible due to support of Larry Jackel, the head of
Adaptive System Research Department, AT&T Bell Laboratories.
It was inspired by collaboration with my colleagues Jim Alvich, Jan
Ben, Yoshua Bengio, Bernhard Boser, Leon Bottou, Jane Bromley, Chris
Burges, Corinna Cortes, Eric Cosatto, Joanne DeMarco, John Denker, Har-
ris Drucker, Hans Peter Graf, Isabelle Guyon, Donnie Henderson, Larry
Jackel, Yann LeCun, Robert Lyons, Nada Matic, Drs Mueller, Craig Nohl,
Edwin Pednault, Eduard Sackinger, Bernhard Sch6lkopf, Patrice Simard,
Sara Solla, Sandi von Pier, and Chris Watkins.
Chris Burges, Edwin Pednault, and Bernhard Sch6lkopf read various
versions of the manuscript and improved and simplified the exposition.
When the manuscript was ready I gave it to Andrew Barron, Yoshua
Bengio, Robert Berwick, John Denker, Federico Girosi, Ilia Izmailov, Larry
Jackel, Yakov Kogan, Esther Levin, Tomaso Poggio, Edward Reitman,
Alexander Shustorovich, and Chris Watkins for remarks. These remarks
also improved the exposition.
I would like to express my deep gratitude to everyone who helped make
this book.

Vladimir N. Vapnik
AT&T Bell Laboratories,
Holmdel, March 1995
Contents

Preface vii

Introduction: Four Periods in the Research of the


Learning Problem 1
0.1 Rosenblatt's Perceptron (The 1960s) . . . . . . . . . . . . . 1
0.2 Construction of the Fundamentals of Learning Theory (The
1960--1970s) . . . . . . . . . . . . . . 7
0.3 Neural Networks (The 1980s). . . . . 11
0.4 Returning to the Origin (The 1990s) 14

Chapter 1 Setting of the Learning Problem 15


1.1 Function Estimation Model. . . . . 15
1.2 The Problem of Risk Minimization 16
1.3 Three Main Learning Problems 16
1.3.1 Pattern Recognition. . . . . 17
1.3.2 Regression Estimation. . . . 17
1.3.3 Density Estimation (Fisher-Wald Setting) 17
1.4 The General Setting of the Learning Problem .. 18
1.5 The Empirical Risk Minimization (ERM) Inductive Principle 18
1.6 The Four Parts of Learning Theory . . . . . . . . . . . . .. 19

Informal Reasoning and Comments - 1 21


1. 7 The Classical Paradigm of Solving Learning Problems 21
xii Contents

1.7.1 Density Estimation Problem (Maximum Likelihood


Method) . . . . . . . . . . . . . . . . . . . . . . . .. 22
1.7.2 Pattern Recognition (Discriminant Analysis) Problem 22
1.7.3 Regression Estimation Model. . . . . . . 23
1. 7.4 Narrowness of the ML Method . . . . . . 24
1.8 Nonparametric Methods of Density Estimation. 25
1.8.1 Parzen's Windows. . . . . . . . . . . . . 25
1.8.2 The Problem of Density Estimation is Ill-Posed 26
1.9 Main Principle for Solving Problems Using a Restricted Amount
of Information. . . . . . . . . . . . . . . . . . . . . . . . .. 28
1.10 Model Minimization of the Risk Based on Empirical Data. 29
1.10.1 Pattern Recognition . 29
1.10.2 Regression Estimation . . . . 29
1.10.3 Density Estimation. . . . . . 30
1.11 Stochastic Approximation Inference 31

Chapter 2 Consistency of Learning Processes 33


2.1 The Classical Definition of Consistency and the Concept of
Nontrivial Consistency. . . . . . . . . 34
2.2 The Key Theorem of Learning Theory . . . . . . . . . . . . 36
2.2.1 Remark on the ML Method. . . . . . . . . . . . . .. 37
2.3 Necessary and Sufficient Conditions for Uniform Two-Sided
Convergence. . . . . . . . . . . . . . . . . . . . . 38
2.3.1 Remark on Law of Large Numbers and its
Generalization . . . . . . . . . . . . . . . 39
2.3.2 Entropy of the Set of Indicator Functions . 40
2.3.3 Entropy of the Set of Real Functions . . . . . . . . . 41
2.3.4 Conditions for Uniform Two-Sided Convergence . .. 43
2.4 Necessary and Sufficient Conditions for Uniform One-Sided
Convergence. . . . . . . . . . . . . . . . . . . . . . . . . .. 44
2.5 Theory of Nonfalsifiability . . . . . . . . . . . . . . . . . .. 45
2.5.1 Kant's Problem of Demarcation and Popper's Theory
of Nonfalsifiability . . . . . . . . . . . . . . . 45
2.6 Theorems about Nonfalsifiability. . . . . . . . . . . . 47
2.6.1 Case of Complete (Popper's) Nonfalsifiability . 48
2.6.2 Theorem about Partial Nonfalsifiability . . 48
2.6.3 Theorem about Potential Nonfalsifiability . 50
2.7 Three Milestones in Learning Theory . . . . . . . 52

Informal Reasoning and Comments - 2 55


2.8 The Basic Problems of Probability Theory and Statistics 56
2.8.1 Axioms of Probability Theory . . . . . . . . . . . 56
2.9 Two Modes of Estimating a Probability Measure. . . . . 59
2.10 Strong Mode Estimation of Probability Measures and the
Density Estimation Problem. . . . . . . . . . . . . . . . 61
2.11 The Glivenko-Cantelli Theorem and its Generalization. .. 62
Contents xiii

2.12 Mathematical Theory of Induction . . . . . . . . . 63

Chapter 3 Bounds on the Rate of Convergence of


Learning Processes 65
3.1 The Basic Inequalities. . . . . . . . . . . . . 66
3.2 Generalization for the Set of Real Functions 68
3.3 The Main Distribution Independent Bounds 71
3.4 Bounds on the Generalization Ability of Learning Machines 72
3.5 The Structure of the Growth Function . . . . . 75
3.6 The VC Dimension of a Set of Functions . . . . . . . . . .. 76
3.7 Constructive Distribution-Independent Bounds. . . . . . .. 79
3.8 The Problem of Constructing Rigorous (Distribution-Dependent)
Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

Informal Reasoning and Comments - 3 83


3.9 Kolmogorov-Smirnov Distributions 83
3.10 Racing for the Constant. . . . 85
3.11 Bounds on Empirical Processes. . 86

Chapter 4 Controlling the Generalization Ability of


Learning Processes 89
4.1 Structural Risk Minimization (SRM) Inductive Principle 90
4.2 Asymptotic Analysis of the Rate of Convergence . . . . . 92
4.3 The Problem of Function Approximation in Learning Theory 94
4.4 Examples of Structures for Neural Nets. . . . . . . . 97
4.5 The Problem of Local Function Estimation . . . . . . 98
4.6 The Minimum Description Length (MDL) and SRM
Principles . . . . . . . . . . . . . . . 100
4.6.1 The MDL Principle . . . . . . . . . 102
4.6.2 Bounds for the MDL Principle . . . 103
4.6.3 The SRM and the MDL Principles. 104
4.6.4 A Weak Point of the MDL Principle . 106

Informal Reasoning and Comments - 4 107


4.7 Methods for Solving Ill-Posed Problems. . . . . . . . . . . . 108
4.8 Stochastic Ill-Posed Problems and the Problem of Density
Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
4.9 The Problem of Polynomial Approximation of the Regression 111
4.10 The Problem of Capacity Control . . . . . . . . . . . . 112
4.10.1 Choosing the Degree of the Polynomial . . . . . . . 112
4.10.2 Choosing the Best Sparse Algebraic Polynomial. . . 113
4.10.3 Structures on the Set of Trigonometric Polynomials 114
4.10.4 The Problem of Features Selection . . . . . . . .. 115
4.11 The Problem of Capacity Control and Bayesian Inference 115
4.11.1 The Bayesian Approach in Learning Theory. . .. 115
xiv Contents

4.11.2 Discussion of the Bayesian Approach and Capacity


Control Methods . . . . . . . . . . . . . . . . . . . . 117

Chapter 5 Constructing Learning Algorithms 119


5.1 Why can Learning Machines Generalize? .. 119
5.2 Sigmoid Approximation of Indicator Functions . 121
5.3 Neural Networks . . . . . . . . . . . . . . 122
5.3.1 The Back-Propagation Method . . . . . . 122
5.3.2 The Back-Propagation Algorithm . . . . 125
5.3.3 Neural Networks for the Regression Estimation
Problem . . . . . . . . . . . . . . . . . . . 126
5.3.4 Remarks on the Back-Propagation Method 126
5.4 The Optimal Separating Hyperplane . . . . . . 127
5.4.1 The Optimal Hyperplane . . . . . . . . . 127
5.4.2 The Structure of Canonical Hyperplanes 128
5.5 Constructing the Optimal Hyperplane. . . . . . 129
5.5.1 Generalization for the Nonseparable Case. 131
5.6 Support Vector (SV) Machines . . . . . . . . . . . 133
5.6.1 Generalization in High-Dimensional Space 135
5.6.2 Convolution of the Inner Product 135
5.6.3 Constructing SV Machines 136
5.6.4 Examples of SV Machines 137
5.7 Experiments with SV Machines . 141
5.7.1 Example in the Plane . . . 142
5.7.2 Handwritten Digit Recognition. 142
5.7.3 Some Important Details . . . . . 146
5.8 Remarks on SV Machines. . . . . . . . 149
5.9. SV Machines for the Regression Estimation Problem 151
5.9.1 c-Insensitive Loss-Function . . . . . . . . . . . 151
5.9.2 Minimizing the Risk Using Convex Optimization Pr~
cedure . . . . . . . . . . . . . . . . . . . . . 152
5.9.3 SV Machine with Convolved Inner Product . . . . . . 155

Informal Reasoning and Comments - 5 157


5.10 The Art of Engineering Versus Formal Inference 157
5.11 Wisdom of Statistical Models . . . . . . . . . . . 160
5.12 What Can One Learn from Digit Recognition Experiments? 162
5.12.1 Influence of the Type of Structures and Accuracy of
Capacity Control . . . . . . . . . . . . . . . . . . . . 162
5.12.2 SRM Principle and the Problem of Feature Construc-
tion . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
5.12.3 Is the Set of Support Vectors a Robust Characteristic
of the Data? . . . . . . . . . . . . . . . . . . 165

Conclusion: What is Important in Learning Theory? 167


6.1 What is Important in the Setting of the Problem? .. 167
Contents xv

6.2 What is Important in the Theory of Consistency of Learning


Processes? . . . . . . . . . . . . . . . . . . . . . . . . . . .. 170
6.3 What is Important in the Theory of Bounds? . . . . . . .. 171
6.4 What is Important in the Theory for Controlling the Gener-
alization Ability of Learning Machines? . . . . . . . . . . . . 172
6.5 What is Important in the Theory for Constructing Learning
Algorithms? . . . . . . . . . . 173
6.6 What is the Most Important? . . . . . . . . . . . . . . . . . 174

References 177
Remarks on References. 177
References . 178

Index 185
Introduction:
Four Periods in the Research of the
Learning Problem

In the history of research of the learning problem one can extract four
periods that can be characterized by four bright events:

(i) Constructing the first learning machines,

(ii) constructing the fundamentals of the theory,


(iii) constructing neural networks,
(iv) constructing the alternatives to neural networks.
In different periods, different subjects of research were considered to be im-
portant. Altogether this research forms a complicated (and contradictory)
picture of the exploration of the learning problem.

0.1 ROSENBLATT'S PERCEPTRON (THE 1960s)


More than 35 years ago F. Rosenblatt suggested the first model of learning
machine, called the Perceptronj this is when the mathematical analysis of
learning processes truly began. 1 From the conceptual point of view, the idea

INote that discriminant analysis as proposed in the 1930s by Fisher actually


did not consider the problem of inductive inference (the problem of estimating the
discriminant rules using the examples). This happened later, after Rosenblatt's
work. In the 1930s discriminant analysis was considered a problem of construct-
2 Introduction: Four Periods in the Research of the Learning Problem

of the Perceptron was not new. It was discussed in the neurophysiological


literature for many years. Rosenblatt, however, did something unusual. He
described the model as a program for computers and demonstrated with
simple experiments that this model can generalize. The Percept ron was
constructed to solve pattern recognition problems; in the simplest case this
is the problem of constructing a rule for separating data of two different
categories using given examples.

The Perceptron Model


To construct such a rule the Perceptron uses adaptive properties of the
simplest neuron model (Rosenblatt, 1962). Each neuron is described by
the McCulloch-Pitts model according to which the neuron has n inputs
x = (xl, ... ,xn ) EX C Rn and one output Y E {-I,I} (Fig. O.I).The
output is connected with the inputs by the functional dependence:

y = sign{(w· x) - b},
where (u . v) is the inner product between two vectors, b is a threshold
value, and sign(u) = 1 if u > 0 and sign(u) = -1 if u ::; O.
Geometrically speaking the neurons divide the space X into two regions:
a region where the output y takes the value 1 and a region where the output
y takes the value -1. These two regions are separated by the hyperplane

(w· x) - b = O.

The vector w and the scalar b determine the position of the separating hy-
perplane. During the learning process the Perceptron chooses appropriate
coefficients of the neuron.
Rosenblatt considered a model that is a composition of several neurons:
he considered several levels of neurons, where outputs of neurons of the pre-
vious level are inputs for neurons of the next level (the output of one neuron
can be input to several neurons). The last level contains only one neuron.
Therefore, the (elementary) Perceptron has n inputs and one output.
Geometrically speaking, the Perceptron divides the space X into two
parts separated by a piecewise linear surface (Fig. 0.2).Choosing appropri-
ate coefficients for all neurons of the net, Percept ron specifies two regions
in X space. These regions are separated by piecewise linear surfaces (not
necessarily connected). Learning in this model means finding appropriate
coefficients for all neurons using given training data.
In the 1960s it was not clear how to choose the coefficients simultane-
ously for all neurons of the Percept ron (the solution came 25 years later).

ing a decision rule separating two categories of vectors using given probability
distribution functions for these categories of vectors.
0.1. Rosenblatt's Perceptron (The 1960s) 3

y = sign [(w • x) - b.

x"
(a)

x"

(b)
(w. xl - b = 0

FIGURE 0.1. (a) Model of a neuron. (b) Geometrically, a neuron defines two
regions in input space where it takes the values -1 and 1. These regions are
separated by a hyperplane (w . x) - b = O.

Therefore, Rosenblatt suggested the following scheme: to fix the coefficients


of all neurons, except for the last one, and during the training process to
try to find the coefficients of the last neuron. Geometrically speaking, he
suggested transforming the input space X into a new space Z (by choosing
appropriate coefficients of all neurons except for the last) and to use the
training data to construct a separating hyperplane in the space Z.
Following the traditional physiological concepts of learning with reward
and punishment stimulus, Rosenblatt proposed a simple algorithm for it-
eratively finding the coefficients.
Let
(Xl! yd,· .. , (Xt, Yt)
be the training data given in input space and let
(Zl! yd, ... , (Zt, Yt)
be the corresponding training data in Z (vector Zi is the transformed Xi)'
At each time step k, let one element of the training data be fed into the
4 Introduction: Four Periods in the Research of the Learning Problem

x"

(b)
FIGURE 0.2. (a) The Perceptron is a composition of several neurons. (b) Geo-
metrically, the Perceptron defines two regions in input space where it takes the
values -1 and 1. These regions are separated by a piecewise linear surface.
0.1. Rosenblatt's Perceptron (The 1960s) 5

Perceptron. Denote by w( k) the coefficient vector of the last neuron at this


time. The algorithm consists of the following:
(i) If the next example of the training data Zk+l, Yk+l is classified cor-
rectly, i.e.
Yk+l (w(k) . Zk+!) > 0,
then the coefficient vector of the hyperplane is not changed,
w(k + 1) = w(k).

(ii) If, however, the next element is classified incorrectly, i.e.,


Yk+! (wi(k) . Zk+!) < 0,
then the vector of coefficients is changed according to the rule
w(k + 1) = w(k) + Yk+lZk+!.
(iii) The initial vector w is zero:
w(l) = 0.

Using this rule the Perceptron demonstrated generalization ability on sim-


ple examples.

Beginning the Analysis of Learning Processes


In 1962 Novikoff proved the first theorem about the Perceptron (Novikoff,
1962). This theorem actually started learning theory. It asserts that if
(i) the norm of the training vectors Z is bounded by some constant
s
R (izi R);
(ii) the training data can be separated with margin p:
supm~nYi(zi . w) > p;
w •
(iii) the training sequence is presented to the Perceptron a sufficient num-
ber of times,
then after at most

N S [~:]
corrections the hyperplane that separates the training data will be con-
structed.
This theorem played an extremely important role in creating learning
theory. It somehow connected the cause of generalization ability with the
principle of minimizing the number of errors on the training set. As we
will see in the last chapter, the expression [R2 / p2] describes an impor-
tant concept that for a wide class of learning machines, allows control of
generalization ability.
6 Introduction: Four Periods in the Research of the Learning Problem

Applied and Theoretical Analysis of Learning Processes


Novikoff proved that the Percept ron can separate training data. Using ex-
actly the same technique, one can prove that if the data are separable, then
after a finite number of corrections, the Perceptron separates any infinite
sequence of data (after the last correction the infinite tail of data will be
separated without error). Moreover, if one supplies the Perceptron with the
following stopping rule:
Percept ron stops the learning process if after the correction
number k (k = 1,2 ... ), the next
1+2Ink-In7]
mk = ---:----:------:-:..
-In(l-c)
elements of the training data do not change the decision rule
(they are recognized correctly),
then
(i) the Perceptron will stop the learning process during the first

f<
1 + 4ln
P
B-In 7] [R2]
-
- -In(l- c) p2
steps,
(ii) by the stopping moment it will have constructed a decision rule that
with probability 1 - 7] has a probability of error on the test set less
than c (Aizerman, Braverman, and Rozonoer, 1964).
After these results many researchers thought that minimizing the error
on the training set is the only cause of generalization (small probability of
test errors). Therefore the analysis of learning processes was split into two
branches, call them Applied Analysis of learning processes and Theoretical
Analysis of learning processes.
The philosophy of Applied Analysis of the learning process can be de-
scribed as follows:
To get a good generalization it is sufficient to choose the coeffi-
cients of the neuron that provide the minimal number of train-
ing errors. The principle of minimizing the number of training
errors is a self-evident inductive principle and from the practi-
cal point of view does not need justification. The main goal of
Applied Analysis is to find methods for constructing the coef-
ficients simultaneously for all neurons such that the separating
surface provides the minimal number of errors on the training
data.
0.2. Construction of the Fundamentals ofthe Learning Theory 7

The philosophy of Theoretical Analysis of learning processes is different.

The principle of minimizing the number of training errors is not


self-evident and needs to be justified. It is possible that there
exists another inductive principle that provides a better level
of generalization ability. The main goal of Theoretical Analy-
sis of learning processes is to find the inductive principle with
the highest level of generalization ability and to construct algo-
rithms that realize this inductive principle.

This book shows that indeed the principle of minimizing the number
of training errors is not self-evident and that there exists another more
intelligent inductive principle that provides a better level of generalization
ability.

0.2 CONSTRUCTION OF THE FUNDAMENTALS OF


THE LEARNING THEORY (THE 1960-19708)

As soon as the experiments with the Perceptron became widely known,


other types of learning machines were suggested (such as the Madaline,
constructed by B. Widrow, or the Learning Matrices constructed by K.
Steinbuch; in fact they started construction of special learning hardware).
However in contrast to the Percept ron these machines were considered from
the very beginning as the tools for solving real-life problems rather than a
general model of the learning phenomenon.
For solving real-life problems, a lot of computer programs were also de-
veloped including programs for constructing logical functions of different
types (e.g., Decision Trees, initially intended for Expert Systems ), or Hid-
den Markov Models (for speech recognition problems). These programs also
did not affect the study of the general learning phenomena.
The next step in constructing a general type of learning machine was done
in 1986 when the so-called the back-propagation technique for finding the
weights simultaneously for many neurons was used. This method actually
starts a new era in the history of learning machines. We will discuss it in
the next section. In this section we concentrate on the history of developing
the fundamentals of learning theory.
In contrast to the Applied Analysis, where during the time between con-
structing the Percept ron (1960) and implementing back-propagation tech-
nique (1986) nothing extraordinary happened, these years were extremely
fruitful for developing statistical learning theory.
8 Introduction: Four Periods in the Research of the Learning Problem

0.2.1 Theory of the Empirical Risk Minimization Principle


As early as 1968, a philosophy of statistical learning theory had been devel-
oped. The essential concepts of the emerging theory, VC entropy and VC
dimension, had been discovered and introduced for the set of indicator func-
tions (i.e., for the pattern recognition problem). Using these concepts, the
Law of Large Numbers in functional space (necessary and sufficient con-
ditions for uniform convergence of the frequencies to their probabilities)
was found, its relation to learning processes was described, and the main
nonasymptotic bounds for the rate of convergence were obtained (Vapnik
and Chervonenkis, 1968); complete proofs were published by 1971 (Vap-
nik and Chervonenkis, 1971). The obtained bounds made the introduction
of a novel inductive principle possible (Structural Risk Minimization in-
ductive principle, 1974), completing the development of pattern recogni-
tion learning theory. The new paradigm for pattern recognition theory was
summarized in a monograph. 2
Between 1976 and 1981, the results, initially obtained for the set of in-
dicator functions, were generalized for the set of real functions: the Law
of Large Numbers (necessary and sufficient conditions for uniform conver-
gence of means to their expectations), the bounds on the rate of uniform
convergence both for the set of totally bounded functions and for the set of
unbounded functions, and the Structural Risk Minimization principle. In
1979 these results were summarized in a monograph 3 describing the new
paradigm for the general problem of dependencies estimation.
Finally, in 1989 the necessary and sufficient conditions for consistency 4 of
the Empirical Risk Minimization inductive principle and Maximum Likeli-
hood method were found, completing the analysis of Empirical Risk Mini-
mization inductive inference (Vapnik and Chervonenkis, 1989).
Building on 30 years of analysis of learning processes, in the 1990s the
synthesis of novel learning machines controlling generalization ability has
begun.

These results were inspired by the study of learning processes. They are
the main subject of the book.

2V. Vapnik and A. Chervonenkis, Theory of Pattern Recognition (in Russian),


Nauka, Moscow, 1974.
German translation: W. N. Wapnik, A. Ja. Tscherwonenkis, The01-ie der Ze-
ichenerkennung, Akademia-Verlag, Berlin, 1979.
3V. N. Vapnik, Estimation of Dependencies Based on Empirical Data (in Rus-
sian), Nauka, Moscow, 1979.
English translation: Vladimir Vapnik, Estimation of Dependencies Based on
Empirical Data, Springer, New York, 1982.
4Convergence in probability to the best possible result. An exact definition of
consistency is given in Section 2.1.
0.2. Construction of the Fundamentals of the Learning Theory 9

0.2.2 Theory of Solving nl-Posed Problems


In the 1960-1970s, in various branches of mathematics, some new theories
were constructed that became very important for creating a new philosophy.
Below we list some of these theories. They also will be discussed in the
Comments on the Chapters.
Let us start with the regularization theory for the solution of so-called
ill-posed problems.
In the early 1900's Hadamard observed that under some (very general)
circumstances the problem of solving (linear) operator equations:

AI = F, IE:F
(finding IE :F that satisfies the equality) is ill-posed; even if there exists a
unique solution to this equation, a small deviation on the right-hand side
of this equation (F6 instead of F, where IIF - F611 < b is arbitrarily small)
can cause large deviations in the solutions (it can happen that 11/6 - III is
large).
In this case if the right-hand side F of the equation is not exact (e.g., it
equals F6, where F6 differs from F by some level b of noise) the functions
16 that minimize the functional
R(f) = IIAI - F6W
do not guarantee a good approximation to the desired solution even if b
tends to zero.
Hadamard thought that ill-posed problems are a pure mathematical phe-
nomenon and that all real-life problems are ''well-posed.'' However in the
second half of the century a number of very important real-life problems
were found to be ill-posed. In particular ill-posed problems arise when one
tries to reverse the cause-effect relations: to find unknown causes from the
known consequences. Even if the cause-effect relationship forms a one-to-
one mapping, the problem of inverting it can be ill-posed.
For our discussion it is important that one of the the main problems of
statistics, estimating the density function from the data, is ill-posed.
In the middle of the 1960s it was discovered that if instead of the func-
tional R(f) one minimizes another so-called regularized functional

R*(f) = IIAI - F6W + ,(b)n(f),


where n(f) is some functional (that belongs to a special type of function-
als) and ,(b) is an appropriately chosen constant (depending on the level
of noise), then one obtains a sequence of solutions that converges to the de-
sired one as b tends to zero (Tikhonov, 1963), (Ivanov,1962), and (Phillips,
1962).
The regularization theory was one of the first signs of the existence of in-
telligent inference. It demonstrated that whereas the "self-evident" method
10 Introduction: Four Periods in the Research of the Learning Problem

of minimizing the functional R(f) does not work, the not "self-evident"
method of minimizing the functional R* (f) does.
The influence of the philosophy created by the theory of solving ill-posed
problem is very deep. Both the regularization philosophy and the regular-
ization technique became widely spread in many areas of science, including
statistics.

0.2.3 Nonparametric Methods of Density Estimation


In particular, the problem of density estimation from a rather wide set of
densities, is ill-posed. Estimating densities from some narrow set of densi-
ties (say from a set of densities determined by a finite number of param-
eters, i.e., from a so-called parametric set of densities) was the subject of
the classical paradigm, where a "self-evident" type of inference (the max-
imum likelihood method) was used. An extension of the set of densities
from which one has to estimate the desired one makes it impossible to
use the "self-evident" type of inference. To estimate a density from the
wide (nonparametric) set requires a new type of inference that contains
regularization techniques. In the 1960s several such types of (nonparamet-
ric) algorithms were suggested (M. Rosenblatt, 1956), (Parzen, 1962), and
(Chentsov, 1962); in the middle of the 1970s the regular way for creating
these kind of algorithms on the basis of standard procedures for solving
ill-posed problems was found (Vapnik and Stefanyuk, 1978).
Nonparametric methods of density estimation gave rise the statistical
algorithms that overcame the shortcoming of the classical paradigm. Now
one could estimate functions from a wide set of functions.
One has to note, however, that these methods are intended for estimating
a function using large sample sizes.

0.2.4 The Idea of Algorithmic Complexity


Finally, in the 1960s one of the greatest ideas of statistics and information
theory was suggested: the idea of Algorithmic Complexity (Solomonoff,
1960), (Kolmogorov, 1965), and (Chaitin, 1966). Two fundamental ques-
tions that on the first glance look different have inspired this idea:
(i) What is the nature of inductive inference (Solomonoff)?
(ii) What is the nature of randomness (Kolmogorov), (Chaitin)?
The answers to these questions proposed by Solomonoff, Kolmogorov,
and Chait in started the information theory approach to the problem of
inference.
The idea of the randomness concept can be roughly described as follows:
a rather large string of data forms a random string if there are no algo-
rithms whose complexity are much less than f, the length of the string,
0.3. Neural Networks (The 1980s) 11

that can generate this string. The complexity of an algorithm is described


by the length of the smallest program which embodies that algorithm. It
was proved that the concept of algorithmic complexity is universal (it is
determined up to an additive constant reflecting the type of computer).
Moreover, it was proved that if the description of the string cannot be
compressed using computers, then the string possesses all properties of a
random sequence.
This implies the idea that if one can significantly compress the descrip-
tion of the given string, then the algorithm used describes intrinsic prop-
erties of the data.
In the 1970s, on the basis of these ideas, Rissanen suggested the Mini-
mum Description Length (MDL) inductive inference for learning problems
(Rissanen, 1978).
In Chapter 4 we consider this principle.

All these new ideas are still being developed. However they did shift
the main understanding as to what can be done in the problem of the
dependency estimation on the basis of a limited number of empirical data.

0.3 NEURAL NETWORKS (THE 1980s)


0.3.1 Idea of Neural Networks
In 1986 several authors independently proposed a method for simultane-
ously constructing the vector-coefficients for all neurons of the Perceptron
using the so-called Back-Propagation method (LeCun, 1986), (Rumelhart,
Hinton, and Williams, 1986). The idea of this method is extremely sim-
ple. If instead of the McCulloch-Pitts model of the neuron one considers a
slightly modified model, where the discontinuous function sign {(w . x) - b}
is replaced by the continuous so-called sigmoid approximation (Fig. 0.3),

y=8{(w·x)-b}

(here 8 (u) is a monotonic function with the properties

8(-00) = -1, 8(+00) = 1


e.g., S(u) = tanhu) then the composition of the new neurons is a con-
tinuous function that for any fixed x has a gradient with respect to all
coefficients of all neurons. In 1986 the method for evaluating this gradi-
ent was found. 5 Using the evaluated gradient one can apply any gradient

5The back-propagation method was actually found in 1963 for solving some
control problems (Brison, Denham and Dreyfuss, 1963) and was rediscovered for
Perceptrons.
12 Introduction: Four Periods in the Research of the Learning Problem

FIGURE 0.3. The discontinuous function sign(u) = ±1 is approximated by the


smooth function S(u).

based technique for constructing a function that approximates the desired


function. Of course gradient based techniques only guarantee finding local
minima. Nevertheless it looked as if the main idea of applied analysis of
learning processes was found and the problem was in its implementation.

0.3.2 Simplification of the Goals of Theoretical Analysis


The discovery of the back-propagation technique can be considered as the
second birth of the Perceptron. This birth, however, happened in a com-
pletely different situation. Since 1960 powerful computers had appeared,
moreover new branches of science had became involved in research on the
learning problem. This essentially changed the scale and the style of re-
search.
In spite of the fact that one cannot assert for sure that the generalization
properties of the Perceptron with many adjustable neurons is better than
the generalization properties of the Perceptron with only one adjustable
neuron and approximately the same number of free parameters, the scien-
tific community was much more enthusiastic towards this new method due
to the scale of experiments.
Rosenblatt's first experiments were conducted for the problem of digit
recognition. To demonstrate the generalization ability of the Perceptron,
Rosenblatt used training data consisting of several hundreds of vectors,
containing several dozen coordinates. In the 1980s and even now in the
0.3. Neural Networks (The 1980s) 13

1990s the problem of digit recognition learning continues to be important.


Today, in order to obtain good decision rules one uses tens (even hundreds)
of thousands of observations over vectors with several hundreds of coordi-
nates. This required special organization of the computational processes.
Therefore in the 1980s researchers in Artificial Intelligence became the main
players in the computational learning game. Among Artificial Intelligence
researchers the hardliners had considerable influence (it is precisely they
who declared that "Complex theories do not work, simple algorithms do.")

Artificial Intelligence hardliners approached the learning problem with


great experience in constructing "simple algorithms" for the problems where
theory is very complicated. In the end of the 1960s machine language trans-
lators were promised within a couple of years (even now this extremely com-
plicated problem is far from being solved); the next project was construct-
ing a General Problem Solver; after this came the project of constructing an
Automatic Controller of Large Systems, etc. All of these projects had little
success. The next problem to be investigated was creating a computational
learning technology.
First the hardliners changed the terminology. In particular, the Percep-
tron was renamed Neural Network. Then it was declared a joint research
program with physiologists and the study of the learning problem became
less general, more subject oriented. In the 1960s'-1970s the main goal of
research was finding the best way for inductive inference from small sample
sizes. In the 1980s the goal became constructing a model of generalization
that uses the brain. 6

The attempt to introduce theory to the Artificial Intelligence community


was done in 1984 when the Probably Approximately Correct (PAC) model
was suggested. 7 This model is defined by a particular case of the consis-
tency concept commonly used in statistics in which some requirements on
computational complexity were incorporated. 8
In spite of the fact that almost all results in the PAC model were adopted
from Statistical Learning Theory and constitute particular cases of one of
its four parts (namely, the theory of bounds), this model undoubtedly had

60f course it is very interesting to know how humans can learn. However, this
is not necessarily the best way for creating an artificial learning machine. It has
been noted that the study of birds flying was not very useful for constructing the
airplane.
7L.G. Valiant, 1984, "A theory oflearnability", Commun. ACM 27(11),1134-
1142.
8 "If the computational requirement is removed from the definition then we
are left with the notion of nonparametric inference in the sense of statistics, as
discussed in particular by Vapnik." (L. Valiant, 1991, "A view of computational
learning theory" , In the book: "Computation and Cognition", Society for Indus-
trial and Applied Mathematics, Philadelphia, p. 36.)
14 Introduction: Four Periods in the Research of the Learning Problem

the merit of bringing the importance of statistical analysis to the attention


of the Artificial Intelligence community. This, however, was not sufficient
to influence the developing new learning technologies.
Almost ten years have passed since the Perceptron had been born a sec-
ond time. From the conceptual point of view, its second birth was less
important than the first one. In spite of important achievements in some
specific applications using neural networks, the theoretical results obtained
did not contribute much to general learning theory. Also no new inter-
esting learning phenomena were found in experiments with neural nets.
The so-called "overfitting" phenomenon observed in experiments is actu-
ally a phenomenon of "false structure" known in the theory for solving ill-
posed problems. From the theory of solving ill-posed problems tools were
adopted that prevent overfitting - using regularization techniques in the
algorithms.
Therefore, almost ten years of research in neural nets did not substan-
tially advance the understanding of the essence of learning processes.

0.4 RETURNING TO THE ORIGIN (THE 1990s)


In the last couple of years something changed in relation to neural networks.
More attention is now focused on the alternatives to neural nets, for
example, a lot of effort has been devoted to the study of the Radial Basis
Functions method (see the review in (Powell, 1992)). As in the 1960s, neural
networks are called multi-layer perceptrons again. The advanced parts of
Statistical Learning Theory now attract more researchers. In particular in
the last few years both the Structural Risk Minimization principle and
the Minimum Description Length principle became popular subjects of
analysis. The discussions on small sample sizes theory, in contrast to the
asymptotic one, became widely spread.
It looks like everything is returning to its fundamentals.
In addition, Statistical Learning Theory now plays a more active role:
after the completion of the general analysis of learning processes, the re-
search in the area of the synthesis of optimal algorithms (that possess the
highest level of generalization ability for any number of observations) was
started.
These studies, however, do not belong to history yet. They are a subject
of today's research activities.
Chapter 1
Setting of the Learning Problem

In this book we consider the learning problem as a problem of finding a


desired dependence using a limited number of observations.

1.1 FUNCTION ESTIMATION MODEL


We describe the general model of learning from examples through three
components (Fig. 1.1):
(i) A generator (G) of random vectors x ERn, drawn independently
from a fixed but unknown probability distribution function F(x).
(ii) A supervisor (8) who returns an output value y to every input vector
x, according to a conditional distribution function! F(Ylx), also fixed
but unknown.

(iii) A learning machine (LM) capable of implementing a set of functions


I(x, a), a E A, where A is a set of parameters. 2
The problem of learning is that of choosing from the given set of functions
I(x, a), a E A, the one which best approximates the supervisor's response.

IThis is the general case which includes the case where the supervisor uses a
function y = I(x).
2Note that the elements Q E A are not necessarily vectors. They can be any
abstract parameters. Therefore, we in fact consider any set of functions.
16 1. Setting of the Learning Problem

FIGURE 1.1. A model of learning from examples. During the learning process,
the Learning Machine observes the pairs (x, y) (the training set). After training,
the machine must on any given x return a value y. The goal is to return a value
y which is close to the supervisor's response y.

The selection of the desired function is based on a training set of l inde-


pendent and identically distributed (i.i.d.) observations drawn according to
F{x, y) = F{x)F{y\x):
(1.1)

1.2 THE PROBLEM OF RISK MINIMIZATION


In order to choose the best available approximation to the supervisor's
response, one measures the loss or discrepancy L{y, I(x, 0:)) between the
response y of the supervisor to a given input x and the response I(x, 0:)
provided by the learning machine. Consider the expected value of the loss,
given by the risk functional

R(o:) = J L(y, I(x, o:))dF(x, y). (1.2)

The goal is to find the function I{x, 0:0) which minimizes the risk functional
R{o:) (over the class of functions I(x, 0:), 0: E A) in the situation where
the joint probability distribution function F(x, y) is unknown and the only
available information is contained in the training set (1.1).

1.3 THREE MAIN LEARNING PROBLEMS

This formulation of the learning problem is rather broad. It encompasses


many specific problems. Consider the main ones: the problems of pattern
recognition, regression estimation, and density estimation.
1.3. Three Main Learning Problems 17

1.3.1 Pattern Recognition


Let the supervisor's output y take only two values y = {O, I} and let
!(x, a), a E A, be a set of indicator functions (functions which take only
two values: zero and one). Consider the following loss-function:
0 if y = !(x, a)
L(y, !(x, a)) = { 1 if y ¥: !(x, a). (1.3)

For this loss function, the functional (1.2) determines the probability of
different answers given by the supervisor and by the indicator function
! (x, a). We call the case of different answers a classification error.
The problem, therefore, is to find a function which minimizes the prob-
ability of classification error when the probability measure F(x, y) is un-
known, but the data (1.1) are given.

1.3.2 Regression Estimation


Let the supervisor's answer y be a real value, and let !(x, a) ,a E A, be a
set of real functions which contains the regression function

!(x,ao) = j y dF(ylx).
It is known that the regression function is the one which minimizes the
functional (1.2) with the following loss-function3 :
L(y, !(x, a)) = (y - !(x, a))2. (1.4)
Thus the problem of regression estimation is the problem of minimizing the
risk functional (1.2) with the loss function (1.4) in the situation where the
probability measure F(x, y) is unknown but the data (1.1) are given.

1.3.3 Density Estimation (Fisher- Wald Setting)


Finally, consider the problem of density estimation from the set of densities
p(x, a), a E A. For this problem we consider the following loss-function:
L(p(x, a)) = -logp(x, a). (1.5)

3lfthe regression function f(x) does not belong to f(x,a) ,a E A, then the
function f(x, ao) minimizing the functional (1.2) with loss function (1.4) is the
closest to the regression in the metric L 2 (F):

p(f(x),f(x,ao)) = j(f(x) - f(x,ao))2dF(x).


18 1. Setting of the Learning Problem

It is known that the desired density minimizes the risk-functional (1.2)


with the loss-function (1.5). Thus, again, to estimate the density from the
data one has to minimize the risk-functional under the condition that the
corresponding probability measure F(x) is unknown but LLd. data

Xb···,X n

are given.

1.4 THE GENERAL SETTING OF THE LEARNING


PROBLEM

The general setting of the learning problem can be described as follows.


Let the probability measure F(z) be defined on the space Z. Consider the
set of functions Q(z, a), a EA. The goal is to minimize the risk functional

R(a) = J
Q(z, a)dF(z), a E A, (1.6)

where the probability measure F(z) is unknown but an Li.d. sample

Zl,'" ,Zl (1.7)

is given.
The learning problems considered above are particular cases of this gen-
eral problem of minimizing the risk functional (1.6) on the basis of empirical
data (1.7), where z describes a pair (x, y) and Q(z, a) is the specific loss-
function (e.g., one of Eqs. (1.3), (1.4), or (1.5)). In the following we will
describe the results obtained for the general statement of the problem. To
apply them to specific problems, one has to substitute the corresponding
loss-functions in the formulas obtained.

1.5 THE EMPIRICAL RISK MINIMIZATION (ERM)


INDUCTIVE PRINCIPLE

In order to minimize the risk-functional (1.6) with an unknown distribution


function F(z), the following inductive principle can be applied:
(i) The risk-functional R(a) is replaced by the so-called empirical risk-
functional

(1.8)

constructed on the basis of the training set (1.7).


1.6. The Four Parts of Learning Theory 19

(ii) One approximates the function Q(z, ao) which minimizes risk (1.6)
by the function Q(z, al) minimizing the empirical risk (1.8).
This principle is called the Empirical Risk Minimization inductive principle
(ERM principle).
We say that an inductive principle defines a learning process if for any
given set of observations the learning machine chooses the approximation
using this inductive principle. In learning theory the ERM principle plays
a crucial role.
The ERM principle is quite general. The classical methods for the solu-
tion of a specific learning problem, such as the least-squares method in the
problem of regression estimation or the maximum likelihood (ML) method
in the problem of density estimation, are realizations of the ERM principle
for the specific loss-functions considered above.
Indeed, substituting the specific loss-function (1.4) in Eq. (1.8) one ob-
tains the functional to be minimized:

that forms the least-squares method, while substituting the specific loss-
function (1.5) in Eq. (1.8) one obtains the functional to be minimized:

Minimizing this functional is equivalent to the ML method (the latter uses


a plus sign on the right-hand side).

1.6 THE FOUR PARTS OF LEARNING THEORY

Learning theory has to address the following four questions:


(i) What are (necessary and sufficient) conditions for consistency of a
learning process based on the ERM principle?

(ii) How fast is the rote of convergence of the learning process?


(iii) How can one control the rote of convergence (the generolization abil-
ity) of the learning process?
(iv) How can one construct algorithms that can control the generolization
ability?
The answers to these questions form the four parts of learning theory:
20 1. Setting of the Learning Problem

(i) Theory of consistency of learning processes.


(ii) Nonasymptotic theory of the rate of convergence of learning pro-
cesses.
(iii) Theory of controlling the generalization ability of learning processes.
(iv) Theory of constructing learning algorithms.

Each of these four parts will be discussed in the following chapters.


Informal Reasoning and Comments
1

The setting of learning problems given in Chapter 1 reflects two major


requirements:

(i) To estimate the desired function from a wide set of functions.

(ii) To estimate the desired function on the basis of a limited number of


examples.

The methods developed in the framework of the classical paradigm (created


in the 1920s'-1930s) did not take into account these requirements. There-
fore in the 1960s considerable effort was put into both the generalization of
classical results for wider sets of functions and the improvement of existing
techniques of statistical inference for small sample sizes. In the following
we will describe some of these efforts.

1. 7THE CLASSICAL PARADIGM OF SOLVING


LEARNING PROBLEMS

In the framework of the classical paradigm all models of function estimation


are based on the maximum likelihood method. It forms an inductive engine
in the classical paradigm.
22 1. Informal Reasoning and Comments - 1

1. 7.1 Density Estimation Problem (ML Method)


Let p(x, a), a E A be a set of density functions where (in contrast to the
setting of the problem described in the chapter) the set A is necessarily
contained in Rn (a is an n-dimensional vector). Let the unknown density
p(x, ao) belong to this class. The problem is to estimate this density using
i.i.d. data
Xl,··· ,Xl

(distributed according to this unknown density).


In the 1920s Fisher developed the ML method for estimating the un-
known parameters of the density (Fisher, 1952). He suggested to approxi-
mate the unknown parameters by the values that maximize the functional
l
L(a) = Llnp(xi,a).
i=l
Under some conditions the ML method is consistent. In the next chapter
we use results on the Law of Large Numbers in functional space to describe
the necessary and sufficient conditions for consistency of the ML method.
In the following we show how by using the ML method one can estimate a
desired function.

1.7.2 Pattern Recognition (Discriminant Analysis) Problem


Using the ML technique, Fisher considered a problem of pattern recognition
(he called it discriminant analysis). He proposed the following model:
There exist two categories of data distributed according to two dif-
ferent statistical laws Pi (x, a*) and P2 (x, (3* ) (densities, belonging to
parametric classes). Let the probability of occurrence of the first cat-
egory of data be ql and the probability of the second category be
1 - ql. The problem is to find a decision rule that minimizes the
probability of error.
Knowing these two statistical laws and the value ql, one can immediately
construct such a rule: the smallest probability of error is achieved by the
decision rule that considers vector X as belonging to the first category if the
probability that this vector belongs to the first category is not less than the
probability that this vector belongs to the second category. This happens
if the inequality holds:

One considers this rule in the equivalent form

f(x) = sign {lnPl (x, a*) - Inp2(x, (3*) + In (1 ~ qt) } (1.9)


1. 7. The Classical Paradigm of Solving Learning Problems 23

called the discriminant function (rule) that assigns value 1 for representa-
tives of the first cathegory and value -1 for representatives of the second
cathegory. To find the discriminant rule one has to estimate two densities:
P1(x,a) and P2(X,{3). In the classical paradigm one uses the ML method
to estimate the parameters a* and (3* of these densities.

1. 7.3 Regression Estimation Model


Regression estimation in the classical paradigm is based on another model
the so-called model of measuring a function with additive noise:

Suppose that an unknown function has the parametric form

fo(x) = f(x,ao),

where ao E A is an unknown vector of parameters. Suppose also


that at any point Xi one can measure the value of this function with
additive noise:
Yi = f(Xi' ao) + ~i'
where the noise ~i does not depend on Xi and is distributed according
to a known density function p(~). The problem is to estimate the
function f(x, ao) from the set f(x, a), a E A using the data obtained
by measurements of the function f(x, ao) corrupted with additive
noise.

In this model using the observations of pairs

one can estimate the parameters ao of the unknown function f(x, ao) by
the ML method, namely by maximizing the functional
l
L(a) = 2:1np(Yi - f(xi,a)).
i=l

(Recall that p(~) is a known function and that ~ =Y- f(x,ao).) Taking
the normal law
p(~) 1 exp
= aJ2ir {e }
- 2a2

with zero mean and some fixed variance as a model of noise, one obtains
the least-squares method:
l
L*(a) = - 2!2 2:(Yi - f(xi,a))2 - fln(V21ra)
.=1
24 1. Informal Reasoning and Comments - 1

Maximizing L * (a) over parameters a is equivalent to minimizing the func-


tional
i
M(a) = ~)Yi - f(xi,a))2
i=l

(the so-called least-squares functional).


Choosing other laws p(~), one can obtain other methods for parameter
estimation. 4

1. 7.4 Narrowness of the ML Method


Thus, in the classical paradigm the solutions to all problems of dependency
estimation described in this chapter are based on the ML method. This
method, however, can fail in the simplest cases. Below we demonstrate
that using the ML method it is impossible to estimate the parameters of a
density which is a mixture of normal densities. To show this it is sufficient
to analyse the simplest case described in the following example.
Example. Using the ML method it is impossible to estimate a density
which is the simplest mixture of two normal densities

p(x,a,O') = ~exp{
20' 271'
(x-a)2}
20'2
+ _1 e
2J2; xp {_ x2}
2'

where only the parameters (a, 0') of one density is unknown.


Indeed for any data Xl, ... ,Xl and for any given constant A, there exists
such a small 0' = 0'0 that for a = Xl the likelihood will exceed A:
l
L(a = Xl, 0'0) = L lnp(xi; a = Xl, 0'0)
i=l

> In ( ~) + tIn
20'0 271'i=2
( ~exp{- X2~})
2v 271'
i 2
-In 0'0 - ~ X; - lIn 2V21r > A.

4In 1964 P. Huber extended the classical model of regression estimation by


introducing the so-called robust regression estimation model. According to this
model, instead of an exact model of the noise p('), one is given a set of density
functions (satisfying quite general conditions) to which this function belongs. The
problem is to construct, for the given parametric set of functions and for the given
set of density functions, an estimator which possesses the minimax properties
(provides the best approximation for the worst density from the set). The solution
to this problem actually has the following form: choose an appropriate density
function and then estimate the parameters using the ML method (Huber, 1964).
1.8. Nonparametric Methods of Density Estimation 25

From this inequality one concludes that the maximum of the likelihood
does not exist and therefore the ML method does not provide a solution to
estimating the parameters a and a.
Thus, the ML method can only be applied to a very restrictive set of
densities.

1.8 NONPARAMETRIC METHODS OF DENSITY


ESTIMATION

In the beginning of the 1960s several authors suggested various new meth-
ods, so-called nonparametric methods, for density estimation. The goal of
these methods was to estimate a density from a rather wide set of functions
that is not restricted to be a parametric set of functions (M. Rosenblatt,
1957), (Parzen, 1962), and (Chentsov, 1963).

1.8.1 Parzen's Windows


Among these methods the Parzen windows method probably is the most
popular. According to this method, one first has to determine the so-called
kernel function. For simplicity we consider a simple kernel function:

where K(u) is a symmetric unimodal density function.


Using this function one determines the estimator

In the 1970s a comprehensive asymptotic theory for Parzen type nonpara-


metric density estimation was developed (Devroye, 1985). It includes the
following two important assertions:

(i) Parzen's estimator is consistent (in the various metrics) for estimating
a density from a very wide classes of densities.

(ii) The asymptotic rate of convergence for Parzen's estimator is optimal


for "smooth" densities.

The same results were obtained for other types of estimators.


Therefore, for both classical models (discriminant analysis and regression
estimation) using nonparametric methods instead of parametric methods,
26 1. Informal Reasoning and Comments - 1

one can obtain a good approximation to the desired dependency if the


number of observations is sufficiently large.
Experiments with nonparametric estimators, however, did not demon-
strate great advantages over old techniques. This indicates that nonpara-
metric methods, when applied to a limited numbers of observations, do not
possess their remarkable asymptotic properties.

1.8.2 The Problem of Density Estimation is Ill-Posed


Nonparametric statistics were developed as a number of recipes for density
estimation and regression estimation. To make the theory comprehensive it
was necessary to find the general principle for constructing and analysing of
various nonparametric algorithms. In 1978 this principle was found (Vapnik
and Stefanyuk, 1978).
By definition a density p(x) (if it exists) is the solution of the integral
equation
[~ p(t)dt = F(x), (1.10)

where F(x) is a probability distribution function. (Recall that in the theory


of probabilities one first determines the probability distribution function,
and then only if the distribution function is absolutely continuous, can one
define the density function.)
The general formulation of the density estimation problem can be de-
scribed as follows: in the given set of functions {p(t)} , find one that is a
solution to the integral equation (1.10) for the case where the probabil-
ity distribution function F(x) is unknown, but we are given the i.i.d. data
Xl, ... , Xl, ... obtained according to the unknown distribution function.
Using these data one can construct a function which is very important
in statistics: the so-called empirical distribution function (Fig. 1.2)

where 9( u) is the step-function: it takes the value 1 if u ;::: 0 and 0 otherwise.


The uniform convergence
p
sup IF(x) - Fl(X) I ---+ 0
x l-+oo

of the empirical distribution function Fl(X) to the desired function F(x),


constitutes one of the most fundamental facts of theoretical statistics. We
will discuss this fact several times, in the comments on Chapter 2 and in
the comments on Chapter 3.
Thus, the general setting of the density estimation problem (coming from
the definition of a density) is the following:
1.8. Nonparametric Methods of Density Estimation 27

x. x 0 x· x
11 12 '£

FIGURE 1.2. The empirical distribution function Fl(X), constructed from the
data Xl, ... , Xl, approximates the probability distribution function F(x).

Solve the integral equation (1.10) in the case where the proba-
bility distribution function is unknown, but Li.d. Xl, ... ,Xe, . ..
data in accordance to this function are given.
Using these data one can construct the empirical distribution function
Fe (X). Therefore one has to solve the integral equation (1.10) for the case
where instead of the exact right-hand side, one knows an approximation
that converges uniformly to the unknown function as the number of obser-
vations increases.
Note that the problem of solving this integral equation in a wide class of
functions {pet)} is ill-posed. This bring us to two conclusions:
(i) Generally speaking the estimation of a density is a hard (ill-posed)
computational problem.
(ii) To solve this problem well one has to use regularization (i.e., not
"self-evident" ) techniques.

It was shown that all proposed nonparametric algorithms can be obtained


using standard regularization techniques (with different types of regulariz-
ers) and using the empirical distribution function instead of the unknown
one (Vapnik, 1979, 1988).
28 1. Informal Reasoning and Comments - 1

1.9 MAIN PRINCIPLE FOR SOLVING PROBLEMS


USING A RESTRICTED AMOUNT OF INFORMATION

We now formulate the main principle for solving problems using a restricted
amount of information:
When solving a given problem, try to avoid solving a more geneml prob-
lem as an intermediate step.
Although this principle is obvious, it is not easy to follow it. For our
problems of dependency estimation this principle means that to solve the
problem of pattern recognition or regression estimation, one must try to
find the desired function "directly" (in the next section we will specify what
this means) rather than first estimating the densities and then using the
estimated densities to construct the desired function.
Note that estimation of densities is a universal problem of statistics
(knowing the densities one can solve various problems). Estimation of den-
sities in general is an ill-posed problem, therefore it requires a lot of ob-
servations in order to be solved well. In contrast, the problems which we
really need to solve (decision rule estimation or regression estimation) are
quite particular ones; often they can be solved on the basis of a reasonable
number of observations.
To illustrate this idea let us consider the following situation. Suppose one
wants to construct a decision rule separating two sets of vectors described
by two normal laws: N(11-1, E 1 ) and N(11-2, E2)' In order to construct the dis-
criminant rule (1.9), one has to estimate from the data two n-dimensional
vectors, the means 11-1 and 11-2, and two n x n covariance matrices El and
E 2 • As a result one obtains a separating polynomial of degree two:

containing n(n + 3)/2 coefficients. To construct a good discriminant rule


from the parameters of the unknown normal densities one needs to estimate
the parameters of the covariance matrices with high accuracy, since the
discriminant function uses inverse covariance matrices (in general the esti-
mation of a density is an ill-posed problem; for our parametric case it can
give ill-conditioned covariance matrices). To estimate the high-dimensional
covariance matrices well one needs an unpredictably large (depending on
the properties of the actual covariance matrices) number of observations.
Therefore, in high-dimensional spaces the general normal discriminant func-
tion (constructed from two different normal densities) seldom succeeds in
1.10. Model Minimization of the Risk 29

practice. In practice, the linear discriminant function that occurs when the
two covariance matrices coincide, is used, I:l = I:2 = I::

f(x) = sign {(P,2 - p,t}TI:-1x + ~(p,rI:-lp,l) - ~(p,rI:-lp,2) + In 1 ~lql}


(in this case one has to estimate only n parameters of the discriminant
function).
It is remarkable that Fisher suggested to use the linear discriminant
function even if the two covariance matrices were different and proposed a
heuristic method for constructing such functions 5 (Fisher, 1952).
In Chapter 5 we solve a specific pattern recognition problem by con-
structing separating polynomials (up to degree 7) in high-dimensional (256)
space. This is accomplished only by avoiding the solution of unnecessarily
general problems.

1.10 MODEL MINIMIZATION OF THE RISK BASED


ON EMPIRICAL DATA

In what follows we argue that the setting of learning problems given in this
chapter not only allows us to consider estimating problems in any given
set of functions, but also to implement the main principle for using small
samples: avoiding the solution of unnecessarily general problems.

1.10.1 Pattern Recognition


For the pattern recognition problem, the functional (1.2) evaluates the
probability of error for any function of the admissible set of functions. The
problem is to use the sample to find the function from the set of admissible
functions that minimizes the probability of error. This is exactly what we
want to obtain.

1.10.2 Regression Estimation


In regression estimation we minimize functional (1.2) with loss-function
(1.4). This functional can be rewritten in the equivalent form

R(a) = J (y - f(x, a))2dF(x, y)

5In the 1960s the problem of constructing the best linear discriminant function
(in the case where a quadratic function is optimal) was solved (Andersen and
Bahadur, 1966). For solving real life problems the linear discriminant functions
usually are used even if it is known that the optimal solution belongs to quadratic
discriminant functions.
30 1. Informal Reasoning and Comments - 1

= /U(x,a) - fO(x))2dF(x) + /(y - fo(x))2dF(x,y)(1.11)

where fo(x) is the regression function. Note that the second term in Eq.
(1.11) does not depend on the chosen function. Therefore, minimizing this
functional is equivalent to minimizing the functional

R*(a) = /U(x,a) - fO(x))2dF(x).

The last functional equals the squared L2 (F) distance between a function
of the set of admissible functions and the regression. Therefore, we con-
sider the following problem: using the sample, find in the admissible set of
functions, the closest one to the regression (in metrics L2(F)).
If one accepts the L2(F) metrics, then the formulation of the regression
estimation problem (minimizing R(a)) is direct. (It does not require solving
a more general problem, for example, finding F(x,y)).)

1.10.3 Density Estimation


Finally consider the functional

R(a) =- / lnp(t, a)dF(t) =- / po(t) lnp(t, a)dt.

Let us add to this functional a constant (a functional that does not depend
on the approximating functions)

c = /lnpo(t)dF(t),

where po(t) and F(t) are the desired density and its probability distribution
function. We obtain

R*(a) = - J lnp(t, a)dF(t) + J In Po (t)dF(t)

- J p(t, a)
In po(x) Po(t)dt.

The expression on the right-hand side is the so-called Kullback-Leibler


distance that is used in statistics for measuring the distance between an
approximation of a density and the actual density. Therefore, we consider
the following problem: in the set of admissible densities find the closest
to the desired one in the Kullback-Leibler distance using a given sample.
If one accepts the Kullback-Leibler distance, then the formulation of the
problem is direct.
The short form of the setting of all these problems is the general model
of minimizing the risk functional on the basis of empirical data.
1.11. Stochastic Approximation Inference 31

1.11 STOCHASTIC APPROXIMATION INFERENCE


To minimize the risk functional on the basis of empirical data we considered
in Chapter 1 the empirical risk minimization inductive principle. Here we
discuss another general inductive principle, the so-called Stochastic approx-
imation method suggested in the 1950s by Robbins and Monroe (Robbins
and Monroe, 1951).
According to this principle, to minimize the functional

R(a) = J
Q(z,a)dF(z)

with respect to the parameters a using LLd. data

Zl, .. ·, Zl

one uses the following iterative procedure:

a(k + 1) = a(k) - 'Yk gradaQ(Zk' a(k)), i = 1,2, ...i, (1.12)

where the number of steps is equal to the number of observations. It was


proven that this method is consistent under very general conditions on the
gradient gradaQ(z, a) and the values 'Yk.

Inspired by Novikoff's theorem, M.A. Aizerman and Ya.Z. Tsypkin started


the discussions on consistency of learning processes in 1963 at the seminars
of the Moscow Institute of Control Science. Two general inductive princi-
ples that ensure consistency of learning processes were under investigation:
(i) Principle of stochastic approximation, and

(ii) principle of empirical risk minimization.


Both inductive principles were applied to the general problem of mini-
mizing the risk functional (1.6) using empirical data. As a result, by 1971
two different types of general learning theories were created:

(i) The general asymptotic learning theory for Stochastic Approxima-


tion inductive inference6 (Aizerman, Braverman, and Rozonoer, 1964,
1965, 1970), (Tsypkin, 1971, 1973).

(ii) The general nonasymptotic theory of pattern recognition for ERM


inductive inference (Vapnik and Chervonenkis, 1968, 1971, 1974). (By
1979 this theory was generalized for any problem of minimization of
the risk on the basis of empirical data (Vapnik, 1979).)

6In 1967 this theory was also suggested by S. Amari (Amari, 1967).
32 1. Informal Reasoning and Comments - 1

The stochastic approximation principle is, however, too wasteful: it uses


one element of the training data per step (see Eq. (1.12)). To make it more
economical, one uses the training data many times (using many epochs).
In this case the following question arises immediately:
When does one have to stop the training process?
Two ideas are possible:

(i) When for any element of the training data the gradient is so small
that the learning process cannot be continued.

(ii) When the learning process is not saturated but satisfies some stopping
criterion.
It is easy to see that in the first case the stochastic approximation method is
just a special way of minimizing the empirical risk. The second case consti-
tutes a regularization method of minimizing the risk functional. 7 Therefore,
in the "non-wasteful regimes" the stochastic approximation method can be
explained as either inductive properties of the ERM method or inductive
properties of the regularization method.
To complete the discussion on classical inductive inferences it is neces-
sary to consider Bayesian inference. In order to use this inference one must
possess additional a priori information complementary to the set of para-
metric functions containing the desired one. Namely, one must know the
distribution function that describes the probability for any function from
the admissible set of functions to be the desired one. Therefore, Bayesian
inference is based on using strong a priori information (it requires that the
desired function belongs to the set of functions of the learning machine).
In this sense it does not define a general way for inference. We will discuss
this inference later in the comments on Chapter 4.

Thus, along with the ERM inductive principle one can use other inductive
principles. However, the ERM principle (compared to other ones) looks
more robust (it uses empirical data better, it does not depend on a priori
information, and there are clear ways to implement it).
Therefore, in the analysis of learning processes, the key problem became
exploring the ERM principle.

7The regularizing property of the stopping criterion in iterative procedures of


solving ill-posed problems was observed in the 1950s even before the regulariza-
tion theory for solving ill-posed problems was developed.
Chapter 2
Consistency of Learning Processes

The goal of this part of the theory is to describe the conceptual model
for learning processes that are based on the Empirical Risk Minimization
inductive principle. This part of the theory has to explain when a learning
machine that minimizes empirical risk can achieve a small value of actual
risk (can generalize) and when it can not. In other words, the goal of this
part is to describe the necessary and sufficient conditions for the consistency
of learning processes that minimizes the empirical risk.
The following question arises:

Why do we need an asymptotic theory (consistency is an asymptotic con-


cept) if the goal is to construct algorithms for learning from a limited num-
ber of observations?

The answer is:


To construct any theory one has to use some concepts in terms of which
the theory is developed. It is extremely important to use concepts that
describe necessary and sufficient conditions for consistency. This guarantees
that the constructed theory is general and cannot be improved from the
conceptual point of view.
The most important issue in this chapter is the concept of the VC entropy
of a set of functions in terms of which the necessary and sufficient conditions
for consistency of learning processes are described.
Using this concept we will obtain in the next chapter the quantitative
characteristics on the rate of the learning process that we will use later for
constructing learning algorithms.
34 2. Consistency of Learning Processes

FIGURE 2.1. The learning process is consistent if both the expected risks R(at)
and the empirical risks Remp(at) converge to the minimal possible value of the
risk, infaEA R(a).

2.1 THE CLASSICAL DEFINITION OF CONSISTENCY


AND THE CONCEPT OF NONTRIVIAL CONSISTENCY

Let Q(z, at) be a function that minimizes the empirical risk functional
1 t
Remp =l LQ(zi,a)
i=l

for a given set of Li.d. observations Zb .•. , Zt.

Definition. We say that the principle (method) of ERM is consistent


for the set of functions Q(z, a), a E A and for the probability distribution
function F(z) if the following two sequences converge in probability to the
same limit (see the schematic Fig. 2.1):

R(at) ~ inf R(a), (2.1)


t-+ooaEA

Remp{at) ~ inf R(a). (2.2)


t-+ooaEA
In other words the ERM method is consistent if it provides a sequence of
functions Q(z, at), l = 1,2, ... for which both expected risk and empirical
risk converge to the minimal possible value of risk. Equation (2.1) asserts
that the values of achieved risks converge to the best possible, while Eq.
(2.2) asserts that one can estimate on the basis of the values of empirical
risk the minimal possible value of the risk.
The goal of this chapter is to describe conditions of consistency for the
ERM method. We would like to obtain these conditions in terms of general
characteristics of the set of functions and the probability measure.
2.1. The Classical Definition of Consistency 35

Q (z, a), a c A

o z

FIGURE 2.2. A case of trivial consistency. The ERM method is inconsistent


on the set of functions Q(z, a), a E A, and consistent on the set of functions
{¢>(z)}UQ(z,a), a E A.

Unfortunately, for the classical definition of consistency given above, ob-


taining such conditions is impossible since this definition includes cases of
trivial consistency.
What is a trivial case of consistency?
Suppose we have established that for some set of functions Q(z, a), a E
A, the ERM method is not consistent. Consider an extended set of func-
tions which includes this set of functions and one additional function, ¢(z).
Suppose that the additional function satisfies the inequality

inf Q(z,a) > ¢(z), Vz.


aEA

It is clear (Fig. 2.2)that for the extended set of functions (containing ¢(z))
the ERM method will be consistent. Indeed for any distribution function
and for any number of observations, the minimum of the empirical risk
will be attained on the function ¢(z) that also gives the minimum of the
expected risk.
This example shows that there exist trivial cases of consistency which
depend on whether the given set of functions contains a minorizing function.
Therefore, any theory of consistency that uses the classical definition
must determine whether a case of trivial consistency is possible. That means
that the theory should take into account the specific functions in the given
set.
36 2. Consistency of Learning Processes

In order to create a theory of consistency of the ERM method which


would not depend on the properties of the elements of the set of functions,
but would depend only on the general properties (capacity) of this set of
functions, we need to adjust the definition of consistency to exclude the
trivial consistency cases.
Definition. We say that the ERM method is non trivially consistent
for the set of functions Q(z, a), a E A and the probability distribution
function F(z) if for any nonempty subset A(c), c E (-00,00) of this set of
functions defined as

A(c) = {a: JQ(z,a)dF(z) > c, a E A}

the convergence
inf Remp(a) ~ inf R(a) (2.3)
aEA(c) l-+oo aEA(c)

is valid.
In other words, the ERM is nontrivially consistent if it provides conver-
gence (2.3) for the subset of functions that remain after the functions with
the smallest values of the risks are excluded from this set.
Note that in the classical definition of consistency described in the pre-
vious section one uses two conditions: (2.1) and (2.2). In the definition of
nontrivial consistency one uses only one condition (2.3). It can be shown
that condition {2.1} will be satisfied automatically under the condition of
nontrivial consistency.
In this chapter we will study conditions for nontrivial consistency which
for simplicity we will call consistency.

2.2 THE KEY THEOREM OF LEARNING THEORY

The Key Theorem of Learning Theory is the following (Vapnik and Cher-
vonenkis, 1989):
Theorem 2.1. Let Q(z, a), a E A be a set of functions that satisfy the

J
condition
A::; Q(z, a)dF(z) ::; B (A::; R(a) ::; B).

Then for the ERM principle to be consistent, it is necessary and sufficient


that the empirical risk Remp(a) converge uniformly to the actual risk R(a)
over the set Q(z, a), a E A in the following sense:

lim P{sup (R(a) - Remp(a)) > c} = 0, "Ie> O. (2.4)


£-+00 aEA
2.2. The Key Theorem of Learning Theory 37

We call this type of uniform convergence uniform one-sided convergence. I


In other words, according to the Key Theorem consistency of the ERM
principle is equivalent to existance of uniform one-sided convergence (2.4).
From the conceptual point of view this theorem is extremely important
because it asserts that the conditions for consistency of the ERM principle
are necessarily (and sufficiently) determined by the "worst" (in sense (2.4))
function of the set of functions Q(z, a), a E A. In other words, according
to this theorem any analysis of the ERM principle must be a "worst case
analysis" .2

2.2.1 Remark on the ML Method


As has been shown in Chapter 1, the ERM principle encompasses the ML
method. However for the ML method we define another concept of non-
trivial consistency.
Definition. We say that the ML method is nontrivially consistent if
for any density p(x, ao), from the given set of densities p(x, a) E A, the
convergence in probability
l
inf ~L(-logp(xi,a)) l-+oo
<rEA .(,
~ <rEA
infj(-IOgp(x,a))p(x,ao)dx
i=l

is valid, where Xl, ... , Xl is an Li.d. sample obtained according to the density
po(x).
In other words we define the ML method to be nontrivially consistent if it
is consistent for estimating any density from the admissible set of densities.
For the ML method the following Key Theorem is true (Vapnik and
Chervonenkis, 1989):
Theorem 2.2. For the ML method to be nontrivially consistent on the
set of densities:

0< a:::; p(x, a) :::; A < 00, a E A,

1 In contrast to the so-called uniform two-sided convergence defined by the


equation
lim P{sup jR(a) - Remp(a)j > e} = 0, Ve > O.
£ ..... 00 aEA

2The following fact confirms the importance of this theorem. Toward the end of
the 1980s and beginning of the 1990s several alternative approaches to learning
theory were attempted based on the idea that statistical learning theory is a
theory of "worst case analysis" . In these approaches authors expressed a hope to
develop a learning theory for "real case analysis" . According to the Key Theorem,
this type of theory for the ERM principle is impossible.
38 2. Consistency of Learning Processes

it is necessary and sufficient that uniform one-sided convergence takes place


for the set of risk-functions

Q(x, a) = -lnp(x, a), a E A,

with respect to some (any) probability density p(x, ao), ao E A.

2.3 NECESSARY AND SUFFICIENT CONDITIONS


FOR UNIFORM TWO-SIDED CONVERGENCE

The Key Theorem of learning theory replaced the problem of consistency


of ERM method with the problem of uniform convergence (Eq. (2.4)). To
investigate the necessary and sufficient conditions for uniform convergence,
one considers two stochastic processes which are called empirical processes.
Consider the sequence of random variables

We call this sequence of random variables which depend both on the proba-
bility measure F(z) and on the set of functions Q(z,a), a E A, a two-sided
empirical process. The problem is to describe conditions under which this
empirical process converges in probability to zero. The convergence in prob-
ability of the process (2.5) means that the equality

i~~ p {!~~ If Q(z, a)dF(z) - ~ t Q(Zi, a)1 > €} = 0, v€ > 0 (2.6)

holds true.
Along with the empirical process el, we consider the one-sided empirical
process given by the sequence of random variables

e~=suP(JQ(Z,a)dF(Z)-~tQ(Zi,a)) , £=1,2,... (2.7)


aEA i=l +
where we denote
u ifu>O
(u)+ ={ 0 otherwise·
The problem is to describe conditions under which the sequence of random
variables e~ converges in probability to zero. Convergence in probability of
the process (2.7) means that the equality
2.3. Uniform Two-Sided Convergence 39

holds true. According to the Key Theorem, the uniform one-sided conver-
gence (2.8) is a necessary and sufficient condition for consistency of the
ERM method.
We will see that conditions for uniform two-sided convergence plays an
important role in the constructing conditions of uniform one-sided conver-
gence.

2.3.1 Remark on Law of Large Numbers and its Generalization


Note that if the set of functions Q(z, a), a E A contains only one element
then the sequence of random variables ~( defined in Eq. (2.5) always con-
verge in probability to zero. This fact constitutes the main law of statistics,
the Law of Large Numbers:
The means of the sequence of random variables e
converges to its
expectation as the (number of observations) l increases.
It is easy to generalize the Law of Large Numbers for the case where a set
of functions has a finite number of elements:
The sequence of random variables econverges in probability to zero
if the set of functions Q(z, a), a E A contains a finite number N of
elements.
This case can be interpreted as the Law of Large Numbers in the N-
dimensional vector space (to each function in the set corresponds one co-
ordinate; the Law of Large Numbers in a vector space asserts convergence
in probability simultaneously for all coordinates).
The problem arises when the set of functions Q(z, a), a E A has an
infinite number of elements. In contrast to the cases with the finite number
of elements the sequence of random variables e for the set with infinite
number of elements does not necessarily converge to zero. The problem is:
To describe the properties of the set of functions Q(z, a), a E A
and probability measure F(z) under which the sequence of random
variables ~( converges in probability to zero.
In this case one says that the Law of Large Numbers in the functional space
(space of functions Q(z, a), a E A) takes place or that there exists uniform
(two-sided) convergence of the means to their expectation over a given set
of functions.
Thus, the problem of the existence of the Law of Large Numbers in func-
tional space (uniform two-sided convergence of the means to their proba-
bilities) can be considered as a generalization of the classical Law of Large
Numbers.
Note that in classical statistics the problem of the existence of uniform
one-sided convergence was not considered; it became important due to the
40 2. Consistency of Learning Processes

Key Theorem pointing the way for analysis of the problem of consistency
of the ERM inductive principle.
The necessary and sufficient conditions for both uniform one-sided con-
vergence and uniform two-sided convergence are obtained on the basis of a
concept which is called the entropy of the set of functions Q(z,a), a E A,
on a sample of size i.
For simplicity we will introduce this concept in two steps: first for the
set of indicator functions (that take only the two values 0 and 1) and then
for the set of real bounded functions.

2.3.2 Entropy of the Set of Indicator Functions


Let Q(z, a), a E A be a set of indicator functions. Consider a sample

Let us characterize the diversity of the set of functions Q(z, a), a E A on


the given set of data by the quantity NA(zI, .. , Zt) that evaluates how many
different separations of the given sample can be done using functions from
the set of indicator functions.
Let us write this in a more formal way. Consider the set of i-dimensional
binary vectors

q(a) = (Q(zl,a),oo.,Q(zt,a», a EA

that one obtains when a takes various values from A. Then geometri-
cally speaking NA(Zb' ., Zt) is the number of different vertices of the l-
dimensional cube that can be obtained on the basis of the sample Zl, ... , Zt
and the set of functions Q(z, a) E A (Fig. 2.3).
Let us call the value

the mndom entropy. The random entropy describes the diversity of the set
of functions on the given data. HA(Zb . . , Zt) is a random variable since it
was constructed using the Li.d. data. Now we consider the expectation of
the random entropy over the joint distribution function F(Zb"" Zt):

We call this quantity the entropy of the set of indicator functions Q(z, a),
a E A on samples of size i. It depends on the set of functions Q(z, a),
a E A, the probability measure, and the number of observations l and it
describes the expected diversity of the given set of indicator functions on
the sample of size i.
2.3. Uniform Two-Sided Convergence 41

FIGURE 2.3. The set of i-dimensional binary vectors q(a), a E A, is a subset of


the set of vertices of the i-dimensional unit cube.

2.3.3 Entropy of the Set of Real Functions


Now we generalize the definition of the entropy of the set of indicator
functions on samples of size i.
Definition. Let A ::; Q(z, a) ::; B, a E A, be a set of bounded loss
functions. Using this set of functions and the training set Zl, ... Zl one can
construct the following set of i-dimensional vectors:

q(a) = (Q(zl,a), ... ,Q(zl,a)), a E A. (2.9)

This set of vectors belongs to the i-dimensional cube (Fig. 2.4)and has a fi-
nite minimal c-net in the metric3 C (or in metric Lp). Let N = NA(c; Zl, ... , ze)

3The set of vectors q(a), a E A has a minimal c-net q(aI), ... , q(aN) if:

(i) There exist N = NA(c;zl, ... ,Zt) vectors q(al), ... ,q(aN), such that for
any vector q(a*), a* E A one can find among these N vectors one q(a r )
which is c-close to q(a*) (in a given metric). For the metric C that means

(ii) N is the minimum number of vectors which possess this property.


42 2. Consistency of Learning Processes

Q (z£ ,a)

q (a). a f A

Q (z\ , a)

FIGURE 2.4. The set of i-dimensional vectors q(a), a E A, belong to an


i-dimensional cube.

be the number of elements of the minimal 6-net of this set of vectors


q(a), a E A.
Note that NA(c; Zl, " " Zt) is a random variable since it was constructed
using random vectors zt, ... , Ze. The logarithm of the random value
N A(6; Zl,···, ze):

is called the mndom VC entropy of the set of functions A ~ Q(z, a) ~ B


on the sample Zl, ... , z". The expectation of the random VC entropy

is called the VC entropy4 of the set of functions A ~ Q(z, a) ~ B, a E A


on a sample of size f. Here the expectation is taken with respect to the
product-measure F(Zl,"" ze).

4The VC entropy differs from classical metrical c-entropy

in the following respect: N A (£) is the cardinality ofthe minimal £-net ofthe set of
functions Q(z, a), a E A while the VC entropy is the expectation of the diversity
of the set of functions on the sample of size i.
2.4. Uniform One-Sided Convergence 43

Note that the given definition of the entropy of a set of real functions is
a generalization of the definition of the entropy given for a set of indicator
functions. Indeed, for a set of indicator functions the minimal €-net for
€ < 1 does not depend on € and is a subset of the vertices of the unit cube.
Therefore, for € < 1,

Below we will formulate the theory for the set of bounded real functions.
The obtained general results are of course valid for the set of indicator
functions.

2.3.4 Conditions for Uniform Two-Sided Convergence


Under some (technical) conditions of measurability on the set of functions
Q(z, a), a E A the following theorem is true.
Theorem 2.3. For uniform two-sided convergence (Eq. (2.6)) it is nec-
essary and sufficient that the equality

lim HA(€, £)
£-+00 £
= 0, V€ > ° (2.10)

be valid.

In other words the ratio of the VC entropy to the number of observations


should decrease to zero with increasing numbers of observations.

Corollary. Under some conditions of measurability on the set of indi-


cator functions Q(z, a), a E A, the necessary and sufficient condition for
uniform two-sided convergence is

· HA(£)
11m
£-+00
n
-
-,
{.
°
which is a particular case of equality (2.10).
This condition for uniform two-sided convergence was obtained in 1968
(Vapnik and Chervonenkis 1968, 1971). The generalization of this result for
bounded sets of functions (Theorem 2.3) was found in 1981 (Vapnik and
Chervonenkis 1981).
44 2. Consistency of Learning Processes

2.4 NECESSARY AND SUFFICIENT CONDITIONS


FOR UNIFORM ONE-SIDED CONVERGENCE

Uniform two-sided convergence can be described as follows

i~~ p {[s~p (R(a) - Remp(a)) > c] or [s~p (Remp(a) - R(a)) > c]}

= o. (2.11)
The condition (2.11) includes uniform one-sided convergence and therefore
forms a sufficient condition for consistency of the ERM method. Note, how-
ever, when solving learning problems we face a nonsymmetrical situation:
we require consistency in minimizing the empirical risk but we do not care
about consistency with respect to maximizing the empirical risk. So for
consistency of the ERM method the second condition on the left-hand side
of Eq. (2.11) can be violated.
The next theorem describes a condition under which there exists consis-
tency in minimizing the empirical risk but not necessarily in maximizing
the empirical risk (Vapnik and Chervonenkis, 1989).
Consider the set of bounded real functions Q(z, a), a E A together with
a new set of functions Q*(z,a*),a* E A* satisfying some conditions of
measurability as well as the following conditions: for any function from
Q(z,a), a E A there exists a function in Q*(z,a*), a* E A* such that
(Fig. 2.5)
Q(z,a) - Q*(z,a*) ~ 0, Vz,

J (Q(z, a) - Q*(z,a*))dF(z):::; 6.
(2.12)

Q (z, ex)

o z

FIGURE 2.5. For any function Q(z, a), a E A, one considers a function
Q*(z,a*), a* E A*, such that Q*(z,a*) does not exceed Q(z,a) and is close
to it.
2.5. Theory of Nonfalsifiability 45

Theorem 2.4. In order for uniform one-sided convergence of empirical


means to their expectations to hold for the set of totally bounded functions
Q(z,o:), 0: E A (Eq. (2.8)), it is necessary and sufficient that for any
positive 8, "" and c there exists a set of functions Q* (z, 0:*), 0:* E A* sat-
isfying Eq. (2.12) such that the following holds for the c-entropy of the set
Q* (z, 0:), 0:* E A* on samples of size f:

. H A*( c, f)
11m
£--+00
n
(.
<",. (2.13)

In other words for uniform one-sided convergence on the set of bounded


functions Q(z,o:), 0: E A it is necessary and sufficient that there exists
another set of functions Q*(z, 0:*), 0:* E A* that is close (in sense (2.12))
to Q(z, 0:), 0: E A such that for this new set of functions, condition (2.13) is
valid. Note that condition (2.13) is weaker than condition (2.10) in Theorem
2.3.
According to the Key Theorem, this is necessary and sufficient for con-
sistency of the ERM method.

2.5 THEORY OF NONFALSIFIABILITY


From the formal point of view, Theorems 2.1, 2.3, and 2.4 give the con-
ceptual model of learning based on the ERM inductive principle. However,
both to prove Theorem 2.4 and to understand the nature of the ERM
principle more deeply we have to answer the following questions:
What happens if the condition of Theorem 2.4 is not valid?
Why is the ERM method not consistent in this case?
Below, we show that if there exists an co such that

then the learning machine with functions Q(z, 0:), 0: E A is faced with
a situation that in the philosophy of science corresponds to a so-called
nonfalsifiable theory.
Before we describe the formal part of the theory, let us remind the reader
what the idea of nonfalsifiability is.

2.5.1 Kant's Problem of Demarcation and Popper's Theory of


N onfalsifiability
Since the era of ancient philosophy, two models of reasoning have been
accepted:
46 2. Consistency of Learning Processes

(i) Deductive, which means moving from general to particular, and


(ii) inductive, which means moving from particular to general.
A model in which a system of axioms and inference rules are defined, by
means of which various corollaries (consequences) are obtained, is ideal for
the deductive approach. The deductive approach should guarantee that we
obtain true consequences from true premises.
The inductive approach to reasoning consists of the formation of gen-
eral judgments from particular assertions. However, the general judgments
obtained from true particular assertions are not always true. Nevertheless,
it is assumed that there exist such cases of inductive inference for which
generalization assertions are justified.
The demarcation problem, originally proposed by I. Kant, is a central
question of inductive theory:
What is the difference between the cases with a justified inductive step,
and those for which the inductive step is not justified?
The demarcation problem is usually descusses in the terms of the philos-
ophy of natural science. All theories in the natural sciences are the result
of generalizations of observed real facts and therefore theories are built
using inductive inference. In the history of the natural science, there have
been both true theories that reflect reality (say chemistry) and false ones
(say alchemy) that do not reflect reality. Sometimes it takes many years of
experiments to prove that a theory is false.
The question is following:
Is there a formal way to distinguish true theories from false theories?
Let us assume that meteorology is a true theory and astrology is a false
one. What is the formal difference between them?
(i) Is it in the complexity of their models?
(ii) Is it in the predictive ability of their models?
(iii) Is it in their use of mathematics?
(iv) Is it in the level of formality of inference?
None of the above gives a clear advantage to either of these two theories.
(i) The complexity of astrological models is no less than the complexity
of the meteorological models.

(ii) Both theories fail in some of their predictions.


(iii) Astrologers solve differential equations for restoration of the posi-
tions of the planets, which are no simpler than the basic equations in
meteorology.
2.6. Theorems about Nonfalsifiability 47

(iv) Finally, in both theories, inference has the same level of formaliza-
tion. It contains two parts: the formal description of reality and the
informal interpretation of it.

In the 1930s, K. Popper suggested his famous criterion for demarcation


between true and false theories (Popper, 1968). According to Popper, a
necessary condition for justifiability of a theory is the feasibility of its fal-
sification. By the falsification of a theory, Popper means the existence of a
collection of particular assertions which cannot be explained by the given
theory although they fall into its domain. If the given theory can be falsified
it satisfies the necessary conditions of a scientific theory.

Let us come back to our example. Both meteorology and astrology make
weather forecasts. Consider the following assertion:

Once in New Jersey, in July there was a tropical rain storm and then
snowfall.

Suppose that according to the theory of meteorology, this is impossible.


Then this assertion falsifies the theory because if such a situation really
will happen (note that nobody can guarantee with probability one that
this is impossible5 ) the theory will not be able to explain it. In this case
the theory of meteorology satisfies the necessary conditions to be viewed
as a scientific theory.
Suppose that this assertion can be explained by the theory of astrology.
(There are many elements in the starry sky, and they can be used to create
an explanation.) In this case, this assertion does not falsify the theory. If
there is no example that can falsify the theory of astrology, then astrology
according to Popper, should be considered a non-scientific theory.
In the next section, we describe the theorems of nonfalsifiability. We show
that if for some set of functions conditions of uniform convergence do not
hold, the situation of nonfalsifiability will arise.

2.6 THEOREMS ABOUT NONFALSIFIABILITY

In the following, we show that if uniform two-sided convergence does not


take place, then the method of minimizing the empirical risk is nonfalsifi-
able.

5Recall Laplace's calculations of conditional probability that the sun will rise
tomorrow given that it rose every day up to this day. It will rise for sure according
to the models that we use and in which we believe. However with probability one
we can assert only that the sun rose every day up to now during the thousands
of years of recorded history.
48 2. Consistency of Learning Processes

2.6.1 Case of Complete (Popper's) Nonfalsifiability


To give a clear explanation of why this happens, let us start with the
simplest case. Recall that according to the definition of VC entropy the
following expressions are valid for a set of indicator functions:

Suppose now that for the VC entropy of the set of indicator functions
Q(z, a), a E A the following equality is true:

lim H~(i) = In 2.
l-+oo (.

It can be shown that the ratio of the entropy to the number of obser-
vations HA(i)ji monotonically decreases as the number of observations i
increases. 6 Therefore, if the limit of the ratio of the entropy to the number
of observations tends to In 2, then for any finite number i the equality

HA(i) = In2
i
holds true.
This means that for almost all samples Zl,"" Zl (Le. all but a set of
measure zero) the equality

is valid.
In other words, thE set of functions of the learning machine is such that
almost any sample Zl, .•. ,Zl (of arbitrary size i) can be separated in all
possible ways by functions of this set. This implies that the minimum of the
empirical risk for this machine equals zero. We call this learning machine
nonfalsifiable because it can give a general explanation (function) for almost
any data (Fig. 2.6).
Note that the minimum value of the empirical risk is equal to zero inde-
pendent of the value of the expected risk.

2.6.2 Theorem about Partial Nonfalsifiability


In the case where the entropy of the set of indicator functions over the
number of observations tends to a nonzero limit, the following theorem
shows that there exists some subspace of the original space Z E Z where
the learning machine is nonfalsifiable (Vapnik and Chervonenkis, 1989).

6This assertion is analogous to the assertion that a value of relative (with


respect to the number of observations) information can not increase with the
number of observations.
2.6. Theorems about Nonfalsifiability 49

1 ----- - ------~---
I

: ~Q(z,a*)

o Z3 Zl _2 Zl _ 1 Zl Z

FIGURE 2.6. A learning machine with the set of functions Q(z,a), a E A,


is nonJalsifiable if for almost all samples Zl, ••• , Zl given by the generator of
examples, and for any possible labels lh, ... , 6l for these z-s, the machine contains
a function Q(z, a*) that provides equalities 6; = Q(Xi, a), i = 1, ... , t.

Theorem 2.5. For the set of indicator functions Q(z, 0:), a E A let the
convergence
lim HA(l) = c > 0
l-HXl l
be valid.
Then there exists a subset Z* of the set Z for which the probability mea-
sure is
P(Z*) = a(c) f:. 0
such that for the intersection of almost any training set

with the set Z* :

and for any given sequence of binary values


81 , ••. ,8k, 8i E {O,l},
there exists a function Q(z,o:*) for which the equalities

8i = Q(z;,o:*), i = 1,2, ... ,k


hold true.

Thus, if the conditions for uniform two-sided convergence fail, then there
exists some subspace of the input space where the learning machine is
nonfalsifiable (Fig. 2.7).
50 2. Consistency of Learning Processes

/ Q(z, cr*)
~----,-- ---- ~
I
I

I r I : 1
o ZI Z2 ttiiWH~W£B¥2hl~.Kt*i[t z

z*

FIGURE 2.7. A learning machine with the set of functions Q(z,o), 0 E A, is


partially nonfalsifiable if there exists a region Z· E Z with nonzero measure such
that for almost all samples Zl, •.. ,Zt given by the generator of examples and for
any labels 81, ... , 8t for these z-s, the machine contains a function Q(z, 0*) that
provides equalities 8i = Q(Zi, 0) for all Zi belonging to the region Z·.

2.6.3 Theorem about Potential Nonfalsifiability


Now let us consider the set of uniformly bounded real functions

IQ(Z, a)1 ::; C, 0 E A.

For this set of functions a more sophisticated model of nonfalsifiability is


valid. So we give the following definition of non-falsifiability:

Definition. We say that a learning machine that has an admissible set


of real functions Q(z, a), a E A is potentially nonfalsijiable for a generator
of inputs with a distribution F(x) if there exist two functions 7

such that
(i) There exists a positive constant c for which the equality

!(vJt(z) - '¢o(z))dF(z) =c>0


holds true (this equality shows that two functions, '¢O (z) and '¢1 (z)
are essentially different).

7These two functions do not necessarily belong to the set Q(z,o), 0 E A.


2.6. Theorems about Nonfalsifiability 51

(ii) For almost any sample

any sequence of binary values,

8(1), ... ,8(£), 8(i) E {0,1},

and any c > 0, one can find a function Q(z, a*) in the set of functions
Q(z, a), a E A for which the inequalities

hold true.

In this definition of nonfalsifiability, we use two essentially different func-


tions tPl (z) and tPo (z) to generate the values Yi of the function for the
given vectors Zi. To make these values arbitrary, one can switch these two
functions using the arbitrary rule 8(i). The set of functions Q(z, a), a E A
forms a potentially nonfalsifiable machine for input vectors generated ac-
cording to distribution function F(z), if for almost any sequence of pairs
(tP6(i) (Zi), Zi) obtained on the basis ofrandom vectors Zi and this switching
rule 8(i), one can find in this set a function Q(z, a*) that describes these
pairs with high accuracy (Fig. 2.8).
Note that this definition of nonfalsifiability generalizes Popper's concept:

(i) In the simplest example considered in Section 2.6.1, for the set of indi-
cator functions Q(z, a), a E A, we use this concept of nonfalsifiability
where tPl(Z) = 1 and tPo(z) = 0,
(ii) in Theorem 2.5 we can use the functions

I if
tPl(Z) = { Q(z) if
z E Z*
z f/. Z* , tPo(z) = { °Q(z) if
if
z E Z*
z f/. Z* ,
where Q(z) is some indicator function.

On the basis of this concept of potential nonfalsifiability, we formulate


the following general theorem that holds for an arbitrary set of uniformly
bounded functions (including the sets of indicator functions) (Vapnik and
Chervonenkis, 1989).
Theorem 2.6. Suppose that for the set of uniformly bounded real func-
tions Q(z, a), a E A there exists an co such that the convergence

be valid.
52 2. Consistency of Learning Processes

I ljIo (Z)

FIGURE 2.8. A learning machine with the set of functions Q(z,o), 0 E A is


potentially nonfalsifiable if for any c > 0 there exist two essentially different
functions tPl (z) and tPo (z) such that for almost all samples Zl, ... , Zt given by the
generator of examples, and for any values Ul, ... , Ut constructed on the basis of
these curves using the rule Ui = tP6(Zi)(Zi), where 6(z) C {O, I} is an arbitrary
binary function, the machine contains a function Q(z, 0*) that satisfy inequalities
ItP6(zi) (Zi) - Q(Zi, 0*)1 ~ c, i = 1, ... , t.

Then the learning machine with this set of functions is potentially non-
falsifiable.
Thus, if the conditions of Theorem 2.4 fail (in this case, of course, the
conditions of Theorem 2.3 will also fail), then the learning machine is non-
falsifiable. This is the main reason why the ERM principle may be incon-
sistent.

Before continuing with the description of statistical learning theory, let


me remark how amazing Popper's idea was. In the 1930s Popper suggested
a general concept determining the generalization ability (in a very wide
philosophical sense) that in the 1990s turned out to be one of the most
crucial concepts for the analysis of consistency of the ERM inductive prin-
ciple.

2.7 THREE MILESTONES IN LEARNING THEORY

Below we again consider the set of indicator functions Q(z, a), a E A (Le.,
we consider the problem of pattern recognition). As mentioned above, in
2.7. Three Milestones in Learning Theory 53

the case of indicator functions Q(z, a), a E A, the minimal €-net of the
vectors q(a), a E A (see Section 2.3.3) does not depend on € if € < 1. The
number of elements in the minimal €-net

is equal to the number of different separations of the data Zl, ... , Zl by


functions of the set Q(z, a), a E A.
For this set of functions the VC entropy also does not depend on €:

where expectation taken over (Zl, ... , Zl).


Consider two new concepts that are constructed on the basis of the values
of NA(Zl'" ., Zl):
(i) The Annealed VC entropy

H~nn(£) = InENA(zl,"" Zt)j

(ii) The Growth function

G A (£) = In sup NA(zl, ... , Zl).


ZI,···,Zl

These concepts are defined in such a way that for any £ the inequalities

are valid.
On the basis of these functions the main milestones of learning theory
are constructed.
In Section 2.3.4 we introduced the equation

lim HA(£) = 0
l-+oo £
describing a sufficient condition for consistency of the ERM principle (the
necessary and sufficient conditions are given by a slightly different con-
struction (2.13)). This equation is the first milestone in learning theory: we
require that machine minimizing empirical risk should satisfy it.
However this equation says nothing about the rate of convergence of the
obtained risks R(O:l) to the minimal one R(o:o). It is possible to construct
examples where the ERM principle is consistent, but where the risks have
an arbitrarily slow asymptotic rate of convergence.
The question is:
Under what conditions is the asymptotic rote of convergence fast?
54 2. Consistency of Learning Processes

We say that the asymptotic rate of convergence is fast if for any £ > £0,
the exponential bound

P{ R( a£) - R( ao) > c} < e- ce2 £


holds true, where c > 0 is some constant.
As it turns out, the equation

lim H~n(£) = 0
£-+00 £
is a sufficient condition for a fast rate of convergence. 8 This equation is
the second milestone of the learning theory: it guarantees a fast asymptotic
rate of convergence.
Thus far, we have considered two equations: one that describes a neces-
sary and sufficient condition for the consistency of the ERM method, and
one that describes a sufficient condition for a fast rate of convergence of
the ERM method. Both equations are valid for a given probability measure
F(z) on the observations (both the VC entropy HA(£) and the VC An-
nealed entropy H ~n (£) are constructed using this measure). However, our
goal is to construct a learning machine capable of solving many different
problems (for many different probability measures).
The question is:
Under what conditions is the ERM principle consistent and simultane-
ously provides a fast rote of convergence, independent of the probability
measure?
The following equation describes the necessary and sufficient conditions
for consistency of ERM for any probability measure:
· GA(£) - 0
11m n -.
£-+00 {.

It is also the case that if this condition holds true then the rate of conver-
gence is fast.
This equation is the third milestone in the learning theory. It describes
a necessary and sufficient condition under which a learning machine that
implements the ERM principle has a high asymptotic rate of convergence
independent of the probability measure (Le., independent of the problem
that has to be solved).
These milestones form the foundation for constructing both distribution
independent bounds for the rate of convergence of learning machines, and
rigorous distribution dependent bounds which we will consider in Chapter
3.

8The necessity of this condition for a fast rate of convergence is an open


question.
Informal Reasoning and Comments
2

In the Introduction as well as in Chapter 1 we discussed the empirical risk


minimization method and the methods of density estimation; however we
will not use them for constructing learning algorithms. In Chapter 4 we
introduce anather inductive inference which we we will use in Chapter 5
for constructing learning algorithms. On the other hand in Section 1.11
we introduced the stochastic approximation inductive principle, which we
did not consider as very important in spite of the fact that some learning
procedures (e.g., in neural networks) are based on this principle.
The questions arise:
Why are the ERM principle and the methods of density estimation so
important?
Why did we expend such efJord describing the necessary and sufficient
conditions for consistency of the ERM principle?
In these comments we will try to show that in some sense these two
approaches to the problem of function estimation, one based on density
estimation methods and the other based on the ERM method, reflect two
most general ideas of statistical inference.
To show this we formulate the general problem of statistics as a problem
of estimating the unknown probability measure using the data. We will
distinguish between two modes of estimation of probability measures, the
so-called strong mode estimation and the so-called weak mode estimation.
We show that methods providing strong mode estimations are based on
the density estimation approach while the methods providing weak mode
estimation are based on the ERM approach.
56 2. Informal Reasoning and Comments - 2

The weak mode estimation of probability measures forms one of the most
important problems in the foundations of statistics, the so-called General
Glivenko-Cantelli problem. The results described in Chapter 2 provide a
complete solution to this problem.

2.8 THE BASIC PROBLEMS OF PROBABILITY


THEORY AND STATISTICS

In the 1930s Kolmogorov introduced an axiomatization of probability the-


ory (Kolmogorov, 1933) and since this time probability theory has become
a purely mathematical (Le., deductive) discipline: any analysis in this the-
ory can be done on the basis of formal inference from the given axioms.
This has allowed the development of a deep analysis of both probability
theory and statistics.

2.8.1 Axioms of Probability Theory


According to Kolmogorov's axiomatization of probability theory, to every
random experiment there corresponds a set Z of elementary events z which
defines all possible outcomes of the experiment (the elementary events). On
the set Z of elementary events, a system {A} of subsets A c Z, which are
called events, is defined. Considered as an event, the set Z determines a
situation corresponding to a sure event (an event that always occurs). It
is assumed that the set A contains the empty set 0, the event that never
occurs.
For the elements of {A} the operations union, complement, and inter-
section are defined. On the set Z a a-algebra :F of events {A} is defined. 9
The set :F of subsets of Z is called a a-algebra of events A E :F if

(i) Z E :F;
(ii) if A E :F then A E :F;
(iii) if Ai E :F then U:l Ai E :F.
Example. Let us describe a model of the random experiments that
are relevant to the following situation: somebody throws two dice, say
red and black, and observes the result of the experiment. The space of
elementary events Z of this experiment can be described by pairs of

9 About u-algebra one can read in any advanced textbook on probability the-
ory. (See, for example, A. N. Schiryaev, Probability, Springer, New York, p. 577.)
This concept makes it possible to use the formal tools developed in measure
theory for constructing the foundations of probability theory.
2.8. The Basic Problems of Probability Theory and Statistics 57

red
Ar > b

5
AIO
4

• • • • • •
2 3 4 5 6 black

FIGURE 2.9. The space of elementary events for a two-dice throw. The events
AlO and Ar>b are indicated.

integers where the first number describes the points on the red dice
and the second number describes the points on the black one. An
event in this experiment can be any subset of this set of elementary
events. For example, it can be the subset AlO of elementary events for
which the sum of points on the two dice is equal to 10, or it can be
the subset of elementary events Ar>b where the red dice has a larger
number of points than the black one etc. (Fig. 2.9 ).

The pair (Z, F) consisting of the set Z and the a-algebra F of events
A E F is an idealization of the qualitative aspect of random experiments.
The quantitative aspect of experiments is determined by a probability
measure P(A) defined on the elements A of the set F. The function P(A)
defined on the elements A E F is called a countably additive probability
measure on F or, for simplicity, a probability measure, provided

(i) P(A) ~ OJ

(ii) P(Z) = Ij

(iii) P (U:l Ai) = 2::1 P(Ai) if Ai, Aj E F andAi n Aj = 0, Vi,j.


We say that a probabilistic model of an experiment is determined if the
probability space defined by the triple (Z, F, P) is determined.
58 2. Informal Reasoning and Comments - 2

Example. In our experiment let us consider a symmetrical dice,


where all elementary events are equally possible (have the provability
1/36). Then the probabilities of all events are defined. (The event AlO
has probability 3/36, the event Ar>b has probability 15/36.)
In probability theory and in the theory of statistics the concept of inde-
pendent trials lO plays a crucial role.
Consider the experiment containing i distinct trials with probability
space (Z,F,P) and let
Zl,···, Zl (2.14)
be the results of these trials. For an experiment with i trials the model
(Z", F", pi) can be considered where Zl is a space of all possible outcomes
(2.14), F" is a a-algebra on Zl that contains the sets Ak1 x ... X Akil and
pi is a probability measure defined on the elements of the a-algebra F".
We say that the sequence (2.14) is a sequence of i independent trials if
for any A k1 , ... ,Akt E F the equality
l
P"{Zl E A k1 ;···; Zl E A kt } = II P{Zi E AkJ
i=l

is valid.
Let (2.14) be the result of i independent trials with the model (Z,F,P).
Consider the random variable V(Zb ... , Zl; A) defined for a fixed event A E
F by the value
nA
vl(A) = V(Zb ... ,Zl; A) = £'
where nA is the number of elements of the set Zl, ... , Zl belonging to event
A. The random variable vl(A) is called the frequency of occurrence of an
event A in a series of i independent, random trials.
In terms of these concepts we can formulate the basic problems of prob-
ability theory and the theory of statistics.
The basic problem of probability theory
Given a model (Z, F, P) and an event A *, estimate the distribution
(or some of its characteristics) of the frequency of occurrence of the
event A * in a series of i independent random trials. Formally, this
amounts to finding the distribution function

F(~;A*,i) = P{W(A*) < 0 (2.15)

(or some functionals depending on this function).

lOThe concept of independent trials actually is the one which makes probability
theory different from measure theory. Without the concept of independent trials
the axioms of probability theory define a model from the theory of measure.
2.9. Two Modes of Estimating a Probability Measure 59

Example. In our example with two dice it can be the following prob-
lem. What is the probability that the frequency of event AlO (sum of
points equals 10) be less than ~ if one throws the dice f times?

In the theory of statistics one faces the inverse problem.


The basic problem of the theory of statistics
Given a qualitative model of random experiments (Z, F) and given
the Li.d. data
ZI, ... ,Ze, ... ,
which occurred according to an unknown probability measure P, es-
timate the probability measure P defined on all subsets A E F (or
some functionals depending on this function).
Example. Let our two dice now be nonsymmetrical and somehow
connected to each other (say connected by a thread). The problem is,
given results of f trials (f pairs), to estimate the probability measure
for all events (subsets) A E F.

In this book we consider a set of elementary events Z c Rn where the


a-algebra F is defined to contain all Borel sets l l on Z.

2.9 TWO MODES OF ESTIMATING A PROBABILITY


MEASURE

One can define two modes of estimating a probability measure: a strong


mode and a weak mode.
Definition:
(i) We say that the estimator
ee(A) = ee(Zl, . .. ,Ze; A), AEF
estimates probability measure P in the strong mode if

sup IP(A) - ee(A)1 ~ O. (2.16)


AEF l--+oo

(ii) We say that the estimator ee(A) estimates the probability measure
P in the weak mode determined by some subset F* cF if
p
sup IP(A) - el(A)1 --t 0, (2.17)
AEF* e--+oo

11 We consider the minimal u-algebra that contains all open parallelepipeds.


60 2. Informal Reasoning and Comments - 2

Q
Q(z. <xl
P (Q(z, <xl > :~)

o z

FIGURE 2.10. The Lebesgue integral defined in (2.18) is the limit of a sum of
products, where factor P {Q(z, a) > iB/m}, is the (probability) measure of the
set {z: Q(z, a) > iB/m} and the factor B/m is the height of a slice.

where the subset F* (of the set F) does not necessarily form a a-
algebra.

For our reasoning it is important that if one can estimate the probability
measure in one of these modes (with respect to a special set :F* described
below for the weak mode) then one can minimize the risk-functional in a
given set of functions.
Indeed, consider the case of bounded risk-functions 0 ~ Q(z, a) ~ B. Let
us rewrite the risk-functional in an equivalent form, using the definition of
the Lebesgue integral (Fig. 2.10):

R(a)= J Q(z,a)dP(z)= lim " m - P Q(z,a»_z


m--+oo L.J m
i=l
m
B{
. 'B} (2.18)

If the estimator [leA) approximates peA) well in the strong mode, i.e,
approximates uniformly well the probability of any event A (including the
events A~ ,i = {Q( z, a) > iB / m }) then the functional

R*(a) = m--+oo
lim "m ~[l B {Q(z,a) > _z'B}
L....IiIrn m
(2.19)
i=l
2.10. Strong Mode Estimation of Probability Measures 61

constructed on the basis of the probability measure t't(A) estimated from


the data, approximates uniformly well (for any 0:) the risk functional R(o:).
Therefore it can be used for choosing the function that minimizes risk. The
empirical risk functional R£(o:) considered in Chapters 1 and 2 corresponds
to the case where estimator t'£(A) in Eq. (2.19) evaluates the frequency of
event A from the given data.
Note, however, that to approximate Eq. (2.18) by Eq. (2.19) on the given
set of functions Q(z, 0:), 0: E A, one does not need uniform approximation
of P on all events A of the a-algebra, one only needs uniform approximation
on the events
A~,i = {Q(Z, o:) > ~}
(only these events come in the evaluation of the risk (2.18)). Therefore, to
find the function providing the minimum of the risk functional, the weak
mode approximation of the probability measure with respect to the set of
events

is sufficient.
Thus, in order to find the function that minimizes risk (Eq. (2.18)) with
unknown probability measure P{A} one can minimize the functional (2.19)
where instead of P{A} an approximation t'£{A} that converges to P{A} in
any mode (with respect to events A~ i' 0: E A, i = 1, ... , m for weak mode)
is used. '

2.10 STRONG MODE ESTIMATION OF


PROBABILITY MEASURES AND THE DENSITY
ESTIMATION PROBLEM

Unfortunately, there is no estimator that can estimate an arbitmry proba-


bility measure in the strong mode. One can estimate a probability measure
if for this measure there exists a density-function (Radon-Nikodym deriva-
tive). Let us assume that a density function p(z) exists, and let Pt(z) be an
approximation to this density function. Consider an estimator

According to Scheffe's theorem, for this estimator the bound


62 2. Informal Reasoning and Comments - 2

is valid, i.e., the strong mode distance between the approximation of the
probability measure and the actual measure is bounded by the Ll distance
between the approximation of the density and the actual density.
Thus, to estimate the probability measure in the strong mode, it is suffi-
cient to estimate a density function. In Section 1.8 we stressed that estimat-
ing a density function from the data forms an ill-posed problem. Therefore,
generally speaking, one can not guarantee a good approximation using a
fixed number of observations.
Fortunately, as we saw above, to estimate the function that minimizes the
risk-functional one does not necessarily need to approximate the density.
It is sufficient to approximate the probability measure in the weak mode
where the set of events F* depends on the admissible set of functions
Q(z, a), a E A: it must contain the events

{Q(z,a) > ~}, a E A, i = 1, ... ,m.


The "smaller" the set of admissible events considered, the "smaller" the
set of events :F* which must be taken into account for the weak approx-
imation, and therefore (as we will see) minimizing the risk on a smaller
set of functions requires fewer observations. In Chapter 3 we will describe
bounds on the rate of uniform convergence that depend on the capacity of
the set of admissible events.

2.11 THE GLIVENKO-CANTELLI THEOREM AND


ITS GENERALIZATION

In the 1930's Glivenko and Cantelli proved a theorem that can be con-
sidered as the most important result in the foundation of statistics. They
proved that any probability distribution function of one random variable e
F(z) = p{e < z}
can be approximated arbitrarily well by the empirical distribution function
1
Fl(Z) =i L B(z - Zi),
l

i=l

where Zl, ... ,Zl is LLd. data obtained according to the unknown density 12
(Fig. 1.2). More precisely the Glivenko-Cantelli theorem asserts that for
any c > 0 the equality
lim P{sup IF(z) - Fl(Z)1 > c} = 0
l-+oo z

12The generalization for n > 1 variables was obtained later.


2.12. Mathematical Theory of Induction 63

(convergence in probability 13) holds true.


Let us formulate the Glivenko-Cantelli theorem in a different form. Con-
sider the set of events

Az = {z: z < z}, z E (-00,00) (2.20)

(the set of rays on the line pointing to -00). For any event Az of this set
of events one can evaluate its probability

(2.21)

Using an i.i.d. sample of size lone can also estimate the frequency of
occurrence of the event Az in independent trials:

(2.22)

In these terms, the Glivenko-Cantelli theorem asserts weak mode conver-


gence of estimator (2.22) to probability measure (2.21) with respect to the
set of of events (2.20) (weak, because only a subset of all events is consid-
ered).
To justify the ERM inductive principle for various sets of indicator func-
tions (for the pattern recognition problem), we constructed in this chapter a
general theory of uniform convergence of frequencies to probabilities on ar-
bitrary sets of events. This theory completed the analysis of the weak mode
approximation of probability measures that was started by the Glivenko-
Cantelli theory for a particular set of events (2.20).
The generalization of these results to the uniform convergence of means
to their mathematical expectations over sets of functions which was ob-
tained in 1981 actually started research on the general type of empirical
processes.

2.12 MATHEMATICAL THEORY OF INDUCTION


In spite of significant results obtained in the foundation of theoretical statis-
tics, the main conceptual problem of learning theory remained unsolved for
more than 20 years (from 1968 up to 1989):
Does the uniform convergence of means to their expectations form a nec-
essary and sufficient condition for consistency of the ERM inductive prin-
ciple, or is this condition only sufficient? In the latter case, might there
exist another less restrictive sufficient condition?

13 Actually a stronger mode of convergence holds true, the so-called convergence


"almost surely".
64 2. Informal Reasoning and Comments - 2

The answer was not obvious. Indeed, uniform convergence constitutes a


global property of the set of functions, while one could have expected that
consistency of the ERM principle is determined by local properties of a
subset of the set of functions, close to the desired one.
Using the concept of nontrivial consistency we showed in 1989 that con-
sistency is a global property of the admissible set of functions, determined
by one-sided uniform convergence (Vapnik and Chervonenkis 1989). We
found the necessary and sufficient conditions for one sided convergence.
The proof of these conditions based on a new circle of ideas - ideas
on nonfalsifiability which appear in philosophical discussions on inductive
inference. In these discussions, however, induction was not considered as a
part of statistical inference. Induction was considered as a tool for inference
in more general frameworks than the framework of statistical models.
Chapter 3
Bounds on the Rate of
Convergence of Learning Processes

In this chapter we consider bounds on the rate of uniform convergence.


We consider upper bounds (there exist lower bounds as well (Vapnik and
Chervonenkis, 1974), however, they are not as important for controlling
the learning processes as the upper bounds).
Using two different capacity concepts described in Chapter 2 (the An-
nealed entropy function and the Growth function) we describe two types
of bounds on the rate of convergence:
(i) Distribution-dependent bounds (based on the Annealed entropy func-
tion), and
(ii) distribution-independent bounds (based on the Growth function).
These bounds, however, are nonconstructive since theory does not give
explicit methods to evaluate the Annealed entropy function or the Growth
function.
Therefore, we introduce a new characteristic of the capacity of a set
of functions (the VC dimension of a set of functions), which is a scalar
value that can be evaluated for any set of functions accesible to a learning
machine.
On the basis of the VC dimension concept we obtain
(iii) Constructive distribution-independent bounds.
Writing these bounds in equivalent form we find the bounds on the risk
achieved by a learning machine (i.e., we estimate the generalization ability
of a learning machine). In Chapter 4 we will use these bounds to control
the generalization ability of learning machines.
66 3. Bounds on the Rate of Convergence of Learning Processes

3.1 THE BASIC INEQUALITIES


We start the description of the results of the theory of bounds with the case
where Q(z,a), a E A is a set of indicator functions and then generalize the
results for sets of real functions.
Let Q(z, a), a E A be a set of indicator functions, HA(l) be the corre-
sponding VC entropy, H&tn(l) be the Annealed entropy and GA(l) be the
Growth function (see Section 2.7).
The following two bounds on the rate of uniform convergence form the
basic inequalities in the theory of bounds (Vapnik and Chervonenkis, 1968,
1971), (Vapnik, 1979, 1996).
Theorem 3.1. The following inequality holds true:

p {~~ f Q(z, Q)dF(z) - it. Q(.. , a) > e}


~ 4exp { (H~(2l) _ g2) l}. (3.1)

Theorem 3.2. The following inequality holds true:

P { sup JQ(z, a)dF(z) - ~ E~=l Q(Zi' a) }


aEA JJ Q(z, a)dF(z) >g

~ 4exp {( H~(2l) - ~) l}. (3.2)

The bounds are nontrivial (Le., for any g > 0 the right-hand side tends
to zero when the number of observations l goes to infinity) if

lim H&tn(l) = O.
/.-+00 l
(Recall that in Section 2.7 we called this condition the second milestone of
learning theory.)

To discuss the difference between these two bounds let us recall that for
any indicator function Q(z, a) the risk functional

R(a) = JQ(z, a)dF(z)

describes the probability of event {z : Q(z,a) = I}, while the empirical


functional Remp(a) describes the frequency of this event.
3.1. The Basic Inequalities 67

Theorem 3.1 estimates the rate of uniform convergence with respect to


the norm of the deviation between probability and frequency. It is clear that
the maximal difference more likely occurs for the events with the maximal
variance. For this Bernoulli case the variance is equal to

a = JR(a)(l- R(a))

and therefore the maximum of the variance is achieved for the events with
probability R(a*) ~ 1/2. In other words the largest deviations are associ-
ated with functions that possess large risk.
In Section 3.3, using the bound on the rate of convergence, we will obtain
a bound on the risk where the confidence interval is determined by the rate
of uniform convergence, i.e., by the function with risk R(a*) ~ 1/2 (the
"worst" function in the set).
To obtain a smaller confidence interval one can try to construct the
bound on the risk using a bound for another type of uniform convergence,
namely, the uniform relative convergence

P { IR(a) - Remp(a)1 > } < q,( f )


!~~ JR(a)(l- R(a)) - € €, " ,

where the deviation is normalized by the variance. The supremum on the


uniform relative convergence can be achieved on any function Q(z, a) in-
cluding one that has a small risk.
Technically, however, it is difficult to estimate well the right-hand side
for this bound. One can obtain a good bound for simpler cases, where
instead of normalization by the variance one considers normalization by
the function JR(a). This function is close to the variance when R(a) is
reasonably small (this is exactly the case that we are interested in). To
obtain better coefficients for the bound one considers the difference rather
than the modulus of the difference in the numerator. This case of relative
uniform convergence is considered in Theorem 3.2.
In the Section 3.4 we will demonstrate that the upper bound on the risk
obtained using Theorem 3.2 is much better than the upper bound on the
risk obtained on the basis of Theorem 3.1.
The bounds obtained in Theorems 3.1 and 3.2 are distribution dependent:
they are valid for a given distribution function F(z) on the observations
(the distribution was used in constructing the Annealed entropy function
Hgnn(f)).
To construct distribution independent bounds it is sufficient to note that
for any distribution function F(z) the Growth function is not less than the
Annealed entropy
68 3. Bounds on the Rate of Convergence of Learning Processes

Therefore, for any distribution function F(z), the following inequalities hold
true:

These inequalities are nontrivial if

(3.5)

(Recall that in Section 2.7 we called this equation the third milestone in
learning theory).
It is important to note that conditions (3.5) are necessary and sufficient
for distribution free uniform convergence (3.3). In particular,
if condition (3.5) is violated then there exist probability measures F(z)
on Z for which uniform convergence

does not take place.

3.2 GENERALIZATION FOR THE SET OF REAL


FUNCTIONS

There are several ways to generalize the results obtained for the set of
indicator functions to the set of real functions. Below we consider the sim-
plest and most effective (it gives better bounds and is valid for the set of
unbounded real functions) (Vapnik 1979, 1996).
Let Q(z, 0:), 0: E A now be a set of real functions, with

A = infQ(z,
o:,z
o:) ~ Q(z, 0:) ~ supQ(z,o:) = B
o:,z
3.2. Generalization for the Set of Real Functions 69

I (Q (z, a) -~)

Q(z,a);aeA

o z

FIGURE 3.1. The indicator of level f3 for the function Q(z, a) shows for which z
the function Q(z, a) exceeds f3 and for which it does not. The function Q(z, a)
can be described by the set of all its indicators.

(here A can be -00 and/or B can be +00). We denote the open interval
(A, B) by B. Let us construct a set of indicators (Fig. 3.1)of the set ofreal
functions Q(z, 0:), 0: E A:

I(z, 0:,13) = (I {Q(z, 0:) - fi}, 0: E A, 13 E B.

For a given function Q(z,o:*) and for a given 13* the indicator I(z, 0:*,13*)
indicates by 1 the region z E Z where Q(z, 0:*) 2: 13* and indicates by 0 the
region z E Z where Q(z,o:*) < 13*.
In the case where Q(z,o:), 0: E A are indicator functions, the set of
indicators fez, a, (3), a E A, (3 E (0,1) coincides with this set Q(z, a}, a E
A.
For any given set of real functions Q(z, 0:), 0: E A we will extend the
results of the previous section by considering the corresponding set of in-
dicators I(z, 0:, 13), 0: E A, 13 E B.
Let HA,B(i) be the VC entropy for the set of indicators, H~n~(i) be the
Annealed entropy for the set, and GA,B(i) be the Growth function.
Using these concepts we obtain the basic inequalities for the set of real
functions as generalizations of inequalities (3.1) and (3.2). In our general-
ization we distinguish between three cases:

(i) Totally bounded functions Q(z,o:), 0: E A.

(ii) Totally bounded non-negative functions Q(z,o:), 0: E A.


(iii) Non-negative (not necessarily bounded) functions Q(z, 0:), 0: E A.
Below we consider the bounds for all three cases.
70 3. Bounds on the Rate of Convergence of Learning Processes

(i) Let A ~ Q(z, a) ~ B, a E A be a set of totally bounded functions.


Then the following inequality holds true:

~ 4exp { ( H:in~(2f)
f -
€2)}f
(B _ A)2 . (3.6)

(ii) Let 0 ~ Q(z, a) ~ B, a E A be a set of totally bounded non-negative


functions. Then the following inequality holds true:

p { sup
J Q(z, a)dF(z) - i L:~=1 Q(Zi' a) >€}
nEA JJQ(z,a)dF(z)

~ 4exp { (H~n;(2f) - : ; ) f}. (3.7)

These inequalities are direct generalizations of the inequalities obtained


in Theorems 3.1 and 3.2 for the set of indicator functions. They coincide
with inequalities (3.1) and (3.2) when Q(z, a) E {O, I}.
(iii) Let 0 ~ Q(z, a), a E A be a set of functions such that for some
p> 2 the pth normalized moments l of the random variables en = Q(z, a)
exist:

mp(a) = J Qp(z, a)dF(z).

Then the following bound holds true:

where

a(p) = ~ (~)P-l (3.9)


2 p-2

1 We consider p > 2 only to simplify the formulas. Analogous results hold true
for p > 1 (Vapnik, 1979, 1996).
3.3. The Main Distribution Independent Bounds 71

The bounds (3.6), (3.7), and (3.8) are nontrivial if

lim H~(£) = O.
£-+00 £

3.3 THE MAIN DISTRIBUTION INDEPENDENT


BOUNDS

The bounds (3.6), (3.7), and (3.8) were distribution dependent: the right-
hand sides of the bounds use the Annealed entropy H~(£) that is con-
structed on the basis of the distribution function F(z). To obtain distribution-
independent bounds one replaces the Annealed entropy H~ (£) on the
right-hand sides of bounds (3.6), (3.7), (3.8) with the Growth function
GA,B(£). Since for any distribution function the Growth function GA,B(£)
is not smaller than the Annealed entropy H~(£), the new bound will be
truly independent of the distribution function F(x).
Therefore, one can obtain the following distribution-independent bounds
on the rate of various types of uniform convergence:
(i) For the set of totally bounded functions -00 < A :::; Q(z, a) :::; B < 00

p {:~~ f Q(z, a)dF(z) -7 t. Q(z" a) > e}

GA,B(2£) c2 )}
:::; 4exp { ( £ - (B _ A)2 i . (3.10)

(ii) For the set of non-negative totally bounded functions 0 :::; Q(z, a) :::;
B<oo:

P { sup J Q(z, a)dF(z) -1 r:~=1 Q(Zi, a) }


aEA J J Q(z, a)dF(z)
>c

(3.11)

(iii) For the set of non-negative real functions 0 :::; Q(z, a) whose pth
normalized moment exists for some p > 2:

J Q(z, a)dF(z) -1 r:f=l Q(Zi, a) (p)}


VJQP(z, a)dF(z)
P { sup >a c
aEA
72 3. Bounds on the Rate of Convergence of Learning Processes

(3.12)

These inequalities are are nontrivial if

. GA,B(£)
hm
e---+OC!
£ = o. (3.13)

Using these inequalities one can establish bounds on the generalization


ability of different learning machines.

3.4 BOUNDS ON THE GENERALIZATION ABILITY


OF LEARNING MACHINES

To describe the generalization ability of learning machines that implement


the ERM principle one has to answer two questions:

(A) What actual risk R(at) is provided by the function Q(z, ae) that
achieves minimal empirical risk Remp(ae)?

(B) How close is this risk to the minimal possible infa R(a), a E A, for
the given set of functions?

Answers to both questions can be obtained using the bounds described


above. Below we describe distribution-independent bounds on the general-
ization ability of learning machines that implement sets of totally bounded
functions, totally bounded non-negative functions, and arbitrary sets of
non-negative functions. These bounds are another form of writing the
bounds given in the previous section.
To describe these bounds we use the notation

£ = 4 GA,B(2£) ; In (r';4). (3.14)

Note that the bounds are nontrivial when £ < 1.


Case 1. The set of totally bounded functions.
Let A ::; Q(z, a) ::; B, a E A, be a set of totally bounded functions.
Then

(A) The following inequalities hold with probability of at least 1-1] simul-
taneously for all functions of Q(z, a), a E A (including the function
that minimizes the empirical risk):

R(a) ::; Remp(a) + (B; A) ve, (3.15)


3.4. Bounds on the Generalization Ability of Learning Machines 73

(B - A) J;;
Remp(a) - 2 v£:::; R(a).

(These bounds are equivalent to the bound on the rate of uniform


convergence (3.10).)

(B) The following inequality holds with probability of at least 1 - 2", for
the function Q(z, at) that minimizes the empirical risk:

R(at) -l~~ R(a) :::; (B - A) 2 f + (B -2 A) v£.


J-ln", J;;
(3.16)

Case 2. The set of totally bounded non-negative functions.


Let 0 :::; Q(z, a) :::; B, a E A be a set of non-negative bounded functions.
Then

(A) The following inequality holds with probability of at least 1 - ", si-
multaneously for all functions Q(z, a) :::; B, a E A (including the
function that minimizes the empirical risk)

B£ ( 1 + 4Remp(a)
R(a) :::; Remp(a) +2 1+ B£ (3.17)

(This bound is equivalent to the bound on the rate of uniform con-


vergence (3.11).)

(B) The following inequality holds with probability of at least 1 - 2", for
the function Q(z, at) that minimizes the empirical risk

(3.18)

Case 3. The set of unbounded non-negative functions.


Finally, consider the set of unbounded non-negative functions 0:::; Q(z, a),
a E A.
It is easy to show (by constructing examples) that without additional
information about the set of unbounded functions and/or probability mea-
sures it is impossible to obtain any inequalities describing the generalization
ability of learning machines. Below we assume the following information:
we are given a pair (p, T) such that the inequality

sup
UQP(z, a)dF(z)) lip
< T < 00 (3.19)
aEA J Q(z, a)dF(z) -
74 3. Bounds on the Rate of Convergence of Learning Processes

holds true 2 , where p > l.


The main result of the theory of learning machines with unbounded sets
of functions is the following assertion, which for simplicity we will describe
for the case p > 2 (the results for the case p > lone can be found in
(Vapnik, 1979, 1996)):

(A) With probability of at least 1 - ", the inequality

R(a) ::s Remp(a) , (3.20)


(1- a (p)rvle) +
where

a(p)=
~ (p _l)P-l
2 p-2
holds true simultaneously for all functions satisfying (3.19), where
(u) + = max( u, 0). (This bound is a corollary of the bound on the
rate of uniform convergence (3.12) and constraint (3.19).)

(B) With probability of at least 1 - 2", the inequality

R(al) - infaEA R(a) < ra(p)vIe + 0 (~) (3.21)


infaEA R(a) - (1 - ra(p) vie) + £

holds for the function Q(z, al) that minimizes the empirical risk.

The inequalities (3.15), (3.17), and (3.20) bound the risks for all functions in
the set Q(z, a), a E A , including the function Q(z, al) that minimizes the
empirical risk. The inequalities (3.16), (3.18), and (3.21) evaluate how close
the risk obtained by using the ERM principle is to the smallest possible
risk.
Note that if £ < 1 then bound (3.17) obtained from the rate of uniform
relative deviation is much better than bound (3.15) obtained from the rate
of uniform convergence: for a small value of empirical risk the bound (3.17)
has a confidence interval whose order of magnitude is £, but not vie, as in
bound (3.15).

2This inequality describes some general properties of the distribution functions


of the random variables e", = Q (z, a), generated by F (z ). It describes the "tails
of the distributions" (the probability oflarge values for the random variables e",).
If the inequality (3.19) with P 2: 2 holds, then the distributions have so-called
"light tails" (large values of e",do not occur very often). In this case a fast rate
of convergence is possible. If, however, the inequality (3.19) holds only for p < 2
(large values ~'" occur rather often) then the rate of convergence will be slow (it
will be arbitrarily slow if p is sufficiently close to one).
3.5. The Structure of the Growth Function 75

3.5 THE STRUCTURE OF THE GROWTH FUNCTION

The bounds on the generalization ability of learning machines presented


above are to be thought of as conceptual rather than constructive. To make
them constructive one has to find a way to evaluate the Annealed entropy
H~~(£) and/or the Growth function C A (£) for the given set of functions
Q(z, a), a E A.
We will find constructive bounds by using the concept of VC dimension of
the set of functions Q(z, a), a E A (abbreviation for Vapnik-Chervonenkis
dimension) .
The remarkable connection between the concept of VC dimension and
the Growth function was discovered in 1968 (Vapnik and Chervonenkis,
1968, 1971).
Theorem 3.3. Any Growth function either satisfies the equality

or is bounded by the inequality

where h is an integer such that when £ = h

CA(h) = hln2,

CA(h + 1) < (h + 1) In2.

In other words the Growth function is either linear or is bounded by a


logarithmic function. (The Growth function cannot, for example, be of the
form C A (£) = cv.e(Fig. 3.2).)
Definition. We will say that the VC dimension of the set of indicator
functions Q(z, a), a E A is infinite if the Growth function for this set of
functions is linear.
We will say that the VC dimension of the set of indicator functions
Q(z, a), a E A is finite and equals h if the corresponding Growth function
is bounded by a logarithmic function with coefficient h.
Since the inequalities

are valid, the finiteness of the VC dimension of the set of indicator functions
implemented by a learning machine is a sufficient condition for consistency
of the ERM method independent of the probability measure. Moreover, a
finite VC dimension implies a fast rate of convergence.
76 3. Bounds on the Rate of Convergence of Learning Processes

- - - - - h (In (£jh) + 1)

o h £

FIGURE 3.2. The Growth function is either linear or bounded by a logarithmic


function. It cannot, for example, behave like the dashed line.

Finiteness of the VC dimension is also a necessary and sufficient condition


for distribution independent consistency of ERM learning machines. The
following assertion holds true (Vapnik and Chervonenkis, 1974):
If uniform convergence of the frequencies to their probabilities over some
set of events (set of indicator functions) is valid for any distribution func-
tion F(x}, then the VC dimension of the set of functions is finite.

3.6 THE VC DIMENSION OF A SET OF FUNCTIONS

Below we give an equivalent definition of the VC dimension for sets of indi-


cator functions and then generalize this definition for sets of real functions.
These definitions stress the method of evaluating the VC dimension.
The VC dimension of a set of indicator functions (Vapnik and
Chervonenkis, 1968, 1971)
The VC dimension of a set of indicator functions Q(z, a}, a E A, is
the maximum number h of vectors Zl, ... , Zh that can be separated into
two classes in all 2h possible ways using functions of the set 3 ( i.e., the
maximum number of vectors that can be shattered by the set of functions).
If for any n there exists a set of n vectors which can be shattered by the
set Q(z, a}, a E A, then the VC dimension is equal to infinity.
The VC dimension of a set of real functions (Vapnik, 1979).

3 Any indicator function separates a given set of vectors into two subsets: the
subset of vectors for which this indicator function takes the value 0 and subset
of vectors for which this indicator function takes the value 1.
3.6. The VC Dimension of a Set of Functions 77

Let A :s; Q(z, a) :s; B, a E A, be a set of real functions bounded by


constants A and B (A can be -00 and B can be 00).
Let us consider along with the set of real functions Q(z, a), a E A, the
set of indicators (Fig. 3.1)

I(z,a,/3) = O{Q(z,a) - /3}, a E A, /3 E (A, B), (3.22)

where O(z) is the step-function

0 ifz<O
O(z) = { 1 if z 2:: O.

The VC dimension of a set of real functions Q(z, a), a E A, is defined


to be the VC dimension of the set of corresponding indicators (3.22) with
parameters a E A and /3 E (A, B).
Example 1.
(i) The VC dimension of the set of linear indicator functions

in n-dimensional coordinate space Z = (ZI,"" zn) is equal to h =


n + 1, since by using functions of this set one can shatter at most
n + 1 vectors (Fig. 3.3).

z2
Z2

,,
z2 ,
e
,'" '"
'",
'",, '", '" ,
'" '"
'" '" ,, e Z3

,-t , ,
eZ I
e z3
eZ I
, ,
0 Z1 0 Z1

FIGURE 3.3. The VC dimension of the lines in the plane is equal to three since
they can shatter three vectors, but not four: The vectors Z2, Z4 cannot be sepa-
rated by a line from the vectors Zl, Z3.
78 3. Bounds on the Rate of Convergence of Learning Processes

(ii) The VC dimension of the set of linear junctions


n
Q(z, a) = L apzp + ao, ao, . .. , an E (-00,00)
p=1
in n-dimensional coordinate space Z = (Z1' ... ' zn) is equal to h =
n+ 1, because the VC dimension of the corresponding linear indicator
functions is equal to n + 1 (note: using ao - f3 instead of ao does not
change the set of indicator functions).
Note that for the set of linear functions the VC dimension equals the num-
ber of free parameters ao, a1, ... , an. In the general case this is not true.
Example 2.
(i) The VC dimension of the following set of functions:
j(z, a) = O(sinaz), a E R1
is infinite: the following points on the line
Z1 = 10- 1 , ... , Zt = 10-t
can be shattered by functions from this set.
Indeed, to separate these data into two classes determined by the
sequence
81, ... ,8t , 8i E {O, I}
it is sufficient to choose the value of parameter

a = 7r (t,(1 - 8i )lOi + 1)
This example reflects the fact that choosing an appropriate coefficient
a one can for any number of appropriate chosen points approximate
values of any function bounded by (-1, +1) (Fig. 3.4 ) using sin ax.
In Chapter 5 we will consider a set of functions for which the VC dimension
is much less than the number of parameters.
Thus, generally speaking the VC dimension of a set of functions does
not coincide with the number of parameters. It can be both larger than
the number of parameters (as in Example 2) and smaller than the number
of parameters (we will use sets of functions of this type in Chapter 5 for
constructing a new type of learning machine).
In the next section we will see that the VC dimension of the set of func-
tions (rather than number of parameters) is responsible for generalization
ability of learning machines. This opens remarkable opportunities to over-
come the "curse of dimensionality": to generalize well on the basis of a set
of functions containing a huge number of parameters but possesing a small
VC dimension.
3.7. Constructive Distribution-Independent Bounds 79

'- )~
~
r--
~ r-.
t--

---
~ J :""'1
z
I
\
~..., y = sin az
1--''
V V V V V 1 V V v v v \J v V v

FIGURE 3.4. Using a high frequency function sin(az), one can approximate the
value of any function -1 ~ j(z) ~ 1 in any f. points well.

3.7 CONSTRUCTIVE DISTRIBUTION-INDEPENDENT


BOUNDS

In this section we will present the bounds on the risk functional that in
Chapter 4 we use for constructing the methods for controlling the general-
ization ability of learning machines.
Consider sets of functions which possess a finite VC dimension h. In this
case Theorem 3.3 states that the bound

(3.23)

holds. Therefore, in all inequalities of Section 3.3 the following constructive


expression can be used:

£ =4
h (In ¥ + 1)£ -In(11/4) . (3.24)

We also will consider the case where the set of loss-functions Q(z, a), a E
A contains a finite number N of elements. For this case one can use the
expression
(3.25)

Thus, the following constructive bounds hold true, where in the case of
the finite VC dimension one uses the expression for £ given in Eq. (3.24)
80 3. Bounds on the Rate of Convergence of Learning Processes

and in the case of a finite number of functions in the set one uses the
expression for E given in Eq. (3.25).
Case 1. The set of totally bounded functions.
Let A ::; Q( z, a) ::; B, a E A be a set of totally bounded functions. Then

(A) The following inequalities hold with probability of at least 1 - '" si-
multaneously for all functions Q(z, a), a E A (including the function
that minimizes the empirical risk):

R(a) ::; Remp(a) + (B; A)..(f, (3.26)

(B - A) r;;
R(a) ~ Remp(a) - 2 yE.

(B) The following inequality holds with probability of at least 1 - 2", for
the function Q(z, at) that minimizes the empirical risk:

(3.27)

Case 2. The set of totally bounded non-negative functions.


Let 0::; Q(z, a) ::; B, a E A be a set of non-negative bounded functions.
Then

(A) The following inequality holds with probability of at least 1- '" simul-
taneously for all functions Q(z, a) ::; B, a E A (including the function
that minimizes the empirical risk):

BE ( 1 +
R(a) ::; Remp(a) + 2 V+ 1 4Remp
BE (a)) . (3.28)

(B) The following inequality holds with probability of at least 1 - 2", for
the function Q(z, at) that minimizes the empirical risk:

(3.29)

Case 3. The set of unbounded non-negative functions.


Finally, consider the set of unbounded non-negative functions 0::; Q(z, a),
aEA.
3.8. The Problem of Constructing Rigorous Bounds 81

(A) With probability of at least 1 - ." the inequality

(3.30)

a(p)=
! (p _l)P-l
2 p-2
holds true simultaneously for all functions satisfying Eq. (3.19), where
(u)+ = max(u,O).

(B) With probability of at least 1 - 2." the inequality

(3.31)

holds for the function Q(z, al) that minimizes the empirical risk.

These bounds cannot be significantly improved. 4

3.8 THE PROBLEM OF CONSTRUCTING RIGOROUS


(DISTRIBUTION-DEPENDENT) BOUNDS

To construct rigorous bounds on the risk one has to take into account infor-
mation about the probability measure. Let 'Po be the set of all probability
measures on Zl and let 'P c 'Po be a subset of the set 'Po. We say that one
has a priori information about the unknown probability measure F(z) if
one knows a set of measures 'P that contains F(z).
Consider the following generalization of the Growth function:

For the extreme case where 'P = 'Po, the Generalized Growth function
g~(£) coincides with the Growth function GA(£) because the measure that
assigns probability one on Zl, ... , Zl is contained in 'P. For another extreme
case where 'P contains only one function F(z), the Generalized Growth
function coincides with the Annealed VC entropy.

4There exist lower bounds on the rate of uniform convergence where the order
of magnitude is close to the order of magnitude obtained for the upper bounds
(Jhii in the lower bounds instead of J(h/l) In(l/h) in the upper bounds; see
(Vapnik and Chervonenkis, 1974) for lower bounds).
82 3. Bounds on the Rate of Convergence of Learning Processes

The rigorous bounds for the risk can be derived in terms of the Gener-
alized Growth function.
They have the same functional form as the distribution independent
bounds (3.15), (3.17), and (3.21) but a different expression for e. The new
expression for e is
e = 4 g~(2i) -In 1}/4.
f
However these bounds are nonconstructive because no general methods
have yet been found to evaluate the Generalized Growth function (in con-
trast to the original Growth function, where constructive bounds were ob-
tained on the basis of the VC dimension of the set of functions).
To find rigorous constructive bounds one has to find a way of evaluating
the Generalized Growth function for different sets P of probability mea-
sures. The main problem here is to find a subset P different from Po for
which the Generalized Growth function can be evaluated on the basis of
some constructive concepts (much as the Growth function was evaluated
using the VC dimension of the set of functions).
Informal Reasoning and Comments
3

A particular case of the bounds obtained in this chapter were already un-
der investigation in classical statistics. They are known as Kolmogorov-
Smirnov distributions, widely used in both applied and theoretical statis-
tics.
The bounds obtained in learning theory are different from the classical
ones in two respects:
(i) They are more general (they are valid for any set of indicator-functions
with finite VC dimension).
(ii) They are valid for a finite number of observations (the classical bounds
are asymptotic.)

3.9 KOLMOGOROV-SMIRNOV DISTRIBUTIONS

As soon as the Glivenko-Cantelli theorem became known, Kolmogorov ob-


tained asymptotically exact estimates on the rate of uniform convergence of
the empirical distribution function to the actual one (Kolmogorov, 1933).
He proved that if the distribution function for a scalar random variable
F(z) is continuous and if l is sufficiently large, then for any c: > 0 the
following equality holds:
84 3. Informal Reasoning and Comments - 3

This equality describes one of the main statistical laws according to which
the distribution of the random variable

~t = sup IF(z) - Ft(z)1


z

does not depend on the distribution function F(z) and has the form of Eq.
(3.32).
Simultaneously Smirnov found the distribution function for one-sided de-
viations of the empirical distribution function from the actual one (Smirnov,
1933). He proved that for continuous F(z) and sufficiently large £ the fol-
lowing equalities asymptotically hold:

p {S~P(F(Z) - c} = -2c £},


Ft(z)) > exp{ 2

p {S~P(Ft(Z) - F(z)) > c} = -2c £}.


exp{ 2

The random variables

6 = VlIF(x) - Ft(x)l,
~2 = Vl(F(x) - Fe(x))
are called the Kolmogorov-Smirnov statistics.
When the Glivenko-Cantelli theorem was generalized for multidimen-
sional distribution functions, 5 it was proved that for any c > 0 there exists
a sufficiently large £0 such that for £ > £0 the inequality

holds true, where a is any constant smaller than 2.


The results obtained in learning theory generalize the results of Kol-
mogorov and Smirnov in two directions:
(i) The obtained bounds are valid for any set of events (not only for sets
ofrays, as in the Glivenko-Cantelli case).
(ii) The obtained bounds are valid for any £ (not only asymptotically for
sufficiently large £).

5For an n-dimensional vector space Z the distribution function of the random


vectors z = (Z1, ... , zn) is determined as follows:

F(z) = P {Z1 < Z1, ... ,zn < Zn} .

The empirical distribution function Fi (z) estimates the frequency of (occurrence


of) the event Az = {Z1 < Z1, ... , zn < zn }.
3.10. Racing for the Constant 85

3.10 RACING FOR THE CONSTANT

Note that the results obtained in learning theory have the form of inequali-
ties, rather than equalities as obtained for a particular case by Kolmogorov
and Smirnov. For this particular case it is possible to evaluate how close
to the exact values the obtained general bounds are.
Let Q(z, a), a E A be the set of indicator functions with VC dimension
h. Let us rewrite the bound (3.3) in the form

P {s~p IJQ(z, a)dP(z) - ~ t. Q(Zi, a)1 > c: }

< 4exp {- (ac: 2 + h(ln2£~h) + 1) f}, (3.33)

where the coefficient a equals one. In the Glivenko-Cantelli case (for which
the Kolmogorov-Smirnov bounds are valid) we actually consider a set of
indicator functions Q(z, a) = (J(z - a). (For these indicator functions

F(a) = J
(J(z - a)dF(z),

where Zl, ... , Zl is LLd. data.) Note that for this set of indicator functions
the VC dimension is equal to one: using indicators of rays (with one di-
rection) one can shatter only one point. Therefore, for a sufficiently large
f, the second term in the parentheses of the exponent on the right-hand
side of Eq. (3.33) is arbitrarily small and the bound is determined by the
first term in the exponent. This term in the general formula coincides with
the (main) term in the Kolmogorov-Smirnov formulas up to a constant:
instead of a = 1 Kolmogorov-Smirnov bounds have constant 6 a = 2.
In 1988 Devroye found a way to obtain a nonasymptotic bound with the
constant a = 2 (Devroye, 1988). However, in the exponent of the right-hand
side of this bound the second term is

6In the first result obtained in 1968 the constant was a = 1/8 (Vapnik and
Chervonenkis, 1968, 1971), then in 1979 it was improved to a = 1/4 (Vapnik,
1979), in 1991 L. Bottou showed me a proof with a = 1. This bound also was
obtained by J. M. Parrondo and C. Van den Broeck (Parrano and Van den Broeck,
1993).
86 3. Informal Reasoning and Comments - 3

instead of
h(ln2ljh + 1)
(3.34)
i
For the case that is important in practice, namely, where

-lnry < h(lnh -1),

the bound with coefficient a = 1 and term (3.34) described in this chapter
is better.

3.11 BOUNDS ON EMPIRICAL PROCESSES

The bounds obtained for the set of real functions are generalizations of the
bounds obtained for the set of indicator functions. These generalizations
were obtained on the basis of a generalized concept of VC dimension that
was constructed for the set of real functions.
There exist, however, several ways to construct a generalization of the
VC dimension concept for sets of real functions that allow us to derive the
corresponding bounds.
One of these generalizations is based on the concept of a VC subgraph
introduced by Dudley (Dudley, 1978) (in the AI literature, this concept
was renamed pseudo-dimension). Using the VC subgraph concept Dudley
obtained a bound on the metric c-entropy for the set of bounded real func-
tions. On the basis of this bound, Pollard derived a bound for the rate
of uniform convergence of the means to their expectation (Pollard, 1984).
This bound was used by Haussler for learning machines. 7
Note that the VC dimension concept for the set of real functions de-
scribed in this chapter forms a slightly stronger requirement on the capac-
ity of the set of functions than Dudley's VC subgraph. On the other hand
using VC dimension concept one obtains more attractive bounds:

(i) They have a form that has a clear physical sense (they depend on the
ratio ijh).

(ii) More importantly, using this concept one can obtain bounds on uni-
form relative convergence for sets of bounded functions as well as for
sets of unbounded functions. The rate of uniform convergence (or uni-
form relative convergence) of the empirical risks to actual risks for
the unbounded set of loss-functions is the basis for an analysis of the
regression problem.

7D. Haussler (1992), "Decision theoretic generalization of the PAC model for
neural net and other applications", Inform. Compo 100 (1) pp. 78-150.
3.10. Racing for the Constant 87

The bounds for uniform relative convergence have no analogy in classical


statistics. They were derived for the first time in learning theory to obtain
rigorous bounds on the risk.
Chapter 4
Controlling the Generalization
Ability of Learning Processes

The theory for controlling the generalization ability of learning machines


is devoted to constructing an inductive principle for minimizing the risk
functional using a small sample of training instances.
The sample size i is considered to be small if the ratio ilh (ratio of the
number of training patterns to the VC dimension of functions of a learning
machine) is small, say ilh < 20.
To construct small sample size methods we use both the bounds for the
generalization ability of learning machines with sets of totally bounded
non-negative functions,

(4.1)

and the bounds for the generalization ability of learning machines with sets
of unbounded functions

R(ae) ::; Remp(ae) , (4.2)


(1- a(p)rJe) +
a(p) =
~ (p _1)P-l
2 p-2 '
where
£ = 2lnN -ln7J
i
90 4. Controlling the Generalization Ability of Learning Processes

if the set of functions Q(z, ai), 1, ... , N, contains N elements, and

" = 4--'----""----i:'---~--'-
"
h (In ~ + 1) -In(17/4)

if the set of functions Q( z, a), a E A contains an infinite number of elements


and has a finite VC dimension h. Each bound is valid with probability of
at least 1 - 17.

4.1
STRUCTURAL RISK MINIMIZATION (SRM)
INDUCTIVE PRINCIPLE

The ERM principle is intended for dealing with large sample sizes. It can
be justified by considering the inequalities (4.1) or (4.2).
e
When i/h is large, is small. Therefore, the second summand on the
right-hand side of inequality (4.1) (the second summand in denominator
(4.2)) becomes small. The actual risk is then close to the value of the
empirical risk. In this case, a small value of the empirical risk guarantees
a small value of the (expected) risk.
However, if i/h is small, a small Remp(al) does not guarantee a small
value of the actual risk. In this case, to minimize the actual risk R( a) one
has to minimize the right-hand side of inequality (4.1) (or (4.2)) simultane-
ously over both terms. Note, however, that the first term in inequality (4.1)
depends on a specific function of the set of functions, while the second term
depends on the VC dimension of the whole set of functions. To minimize
the right-hand side of the bound of risk, (4.1) (or (4.2)), simultaneously
over both terms, one has to make the VC dimension a controlling variable.
The following general principle, which is called the Structuml Risk Min-
imization (SRM) inductive principle, is intended to minimize the risk-
functional with respect to both terms, the empirical risk, and the confidence
interval (Vapnik and Chervonenkis, 1974).
Let the set S of functions Q(z, a), a E A, be provided with a structure
consisting of nested subsets of functions Sk = {Q(z, a), a E Ad, such that
(Fig. 4.1)
(4.3)
where the elements of the structure satisfy the following two properties:

(i) The VC dimension hk of each set Sk of functions is finite. 1 Therefore

IHowever, the VC dimension of the set S can be infinite.


4.2. Asymptotic Analysis of the Rate of Convergence 91

,,
:, ©
G)
- .~
..... ------- ...

82
.. ,
,,:
, ,
. . ....... _---_ ......... ~ ~ '

FIGURE 4.1. A structure on the set of functions is determined by the nested


subsets of functions.

(ii) Any element Sk of the structure contains either

a set of totally bounded functions,

or a set of functions satisfying the inequality

1
(J QP(z, a)dF(z)) P
sup
aEA k
J Q(z, a)dF(z) p>2 (4.4)

for some pair (p, Tk).

We call this structure an admissible structure.


For a given set of observations Zl, ... ,Zl the SRM principle chooses the
function Q(z, a;) minimizing the empirical risk in the subset Sk for which
the guaranteed risk (determined by the right-hand side of inequality (4.1)
or by right-hand side of inequality(4.2) depending on circumstances) is
minimal.
The SRM principle defines a trade-off between the quality of the approxi-
mation of the given data and the complexity of the approximating function.
As the subset index n increases, the minima of the empirical risks decrease,
however, the term responsible for the confidence interval (the second sum-
mand in inequality (4.1) or the multiplier in inequality (4.2) (Fig. 4.2))
increases. The SRM principle takes both factors into account by choosing
the subset Sn for which minimizing the empirical risk yields the best bound
on the actual risk.
92 4. Controlling the Generalization Ability of Learning Processes

Bound on the risk

I Confidence interval

Empirical risk

FIGURE 4.2. The bound on the risk is the sum of the empirical risk and of the
confidence interval. The empirical risk is decreased with the index of element of
the structure, while the confidence interval is increased. The smallest bound of
the risk is achieved on some appropriate element of the structure.

4.2 ASYMPTOTIC ANALYSIS OF THE RATE OF


CONVERGENCE

Denote by S* the set of function

USk·
00

S* =
k=l
4.2. Asymptotic Analysis of the Rate of Convergence 93

Suppose that the set of functions S* is everywhere dense 2 in S (recall


S = {Q(z, 0:), 0: E A}) with respect to the metric

For asymptotic analysis of the SRM principle one considers a law deter-
mining, for any given £, the number

n = n(£) (4.5)

of the element Sn of the structure (4.3) in which we will minimize the


empirical risk. The following theorem holds true.

Theorem 4.1. The SRM method provides approximations Q(z, 0:;(£))


for which the sequence of risks R(o:;(£)) converges to the smallest risk

R(o:o) = aEA
inf jQ(z,o:)dF(Z)

with asymptotic rate of convergence3

V(£) = rn(£) + Tn(£) (4.6)

if the law n = n(i) is such that

. T;,(£)hn(£) lni
hm
£-+00
i = 0, (4.7)

where

(i) Tn = Bn if one considers a structure with totally bounded functions


Q(z, 0:) ~ Bn in subsets Sn, and

2The set of functions R(z, {3), {3 E B is everywhere dense in the set


Q(z, 0:), 0: E A in the metric p(Q, R) if for any c > 0 and for any Q(z, 0:*)
one can find a function R(z, {3*) such that the inequality

p(Q(z,o:*),R(z,{3*)) ~ c

holds true.
3We say that the random variables ~l, l = 1,2, ... converge to the value ~o
with asymptotic rate V (l) if there exists a constant C such that
94 4. Controlling the Generalization Ability of Learning Processes

(ii) Tn = Tn if one considers a structure with elements satisfying the


equality (4.4);
r n(£) is the rote of approximation

rn = inf jQ(z,a)dF(z) - inf jQ(z,a)dF(z). (4.8)


aEAn aEA

To provide the best rate of convergence one has to know the rote of
approximation r n for the chosen structure. The problem of estimating r n
for different structures on sets of functions is the subject of classical function
approximation theory. We will discuss this problem in next section. If one
knows the rate of approximation r n one can a priori find the law n = n(i)
which provides the best asymptotic rate of convergence by minimizing the
right-hand side of equality (4.6).
Example. Let Q(z, a), a E A be a set of functions satisfying the in-
equality (4.4) for p > 2 with Tk < T* < 00. Consider a structure for which
n = hn . Let the asymptotic rate of approximation be described by the law

(This law describes the main classical results in approximation theory;


see the next section.) Then the asymptotic rate of convergence reaches its
maximum value if
n(l) = [~l] 2C~1 ,

where [a] is the integer part of a. The asymptotic rate of convergence is

V(l) (Inl)
= T
2C+1
(4.9)

4.3 THE PROBLEM OF FUNCTION APPROXIMATION


IN LEARNING THEORY

The attractive properties of the asymptotic theory of the rate of conver-


gence described in Theorem 4.1 are that one can a priori (before the learn-
ing process begins) find the law n = n(f) which provides the best (asymp-
totic) rate of convergence, and that one can a priori estimate the value of
asymptotic rate of convergence. 4 The rate depends on the construction of

4Note, however, that a high asymptotic rate of convergence does not neces-
sarily reflect a high rate of convergence on a limited sample size.
4.3. The Problem of Function Approximation in Learning Theory 95

the admissible structure (on the sequence of pairs (h n Tn), n = 1,2, ... )
and also depends on the rate of approximation Tn, n = 1,2, ....
On the basis on this information one can evaluate the rate of convergence
by minimizing Eq. (4.6). Note that in equation (4.6), the second term,
which is responsible for the stochastic behavior of the learning processes, is
determined by nonasymptotic bounds of the risk (see Eqs. (4.1) and (4.2)).
The first term (which describes the deterministic component of the learning
processes) usually only has an asymptotic bound, however.
Classical approximation theory studies connections between the smooth-
ness properties of functions and the rate of approximation of the function
by the structure with elements Sn containing polynomials (algebraic or
trigonometric) of degree n, or expansions in other series with n terms. Usu-
ally, smoothness of an unknown function is characterized by the number s
of existing derivatives. Typical results of the asymptotic rate of approxi-
mation have the form
(4.10)

where N is the dimensionality of the input space (Lorentz, 1966). Note that
this implies that a high asymptotic rate of convergence5 in high-dimensional
spaces can be guaranteed only for very smooth functions.
In learning theory, we would like to find the rate of approximation in the
following case:

(i) Q(z,o:), 0: E A is a set of high-dimensional functions.

(ii) The elements Sk of the structure are not necessarily linear manifolds.
(They can be any set of functions with finite VC dimension.)

Furthermore, we are interested in the cases where the rate of approxi-


mation is high.
Therefore, in learning theory we face the problem of describing the cases
for which a high rate of approximation is possible. This requires describing
different sets of "smo?th" functions and structures for these sets which
provide the bound 0 ~ In)
for Tn (Le., fast rate of convergence).

In 1989 Cybenko proved that using a superposition of sigmoid functions


(neurons) one can approximate any smooth function (Cybenko, 1989).
In 1992-1993 Jones, Baron, and Breiman described structure on differ-
ent sets of functions that has a fast rate of approximation (Jones, 1992),
(Barron, 1993), and (Breiman, 1993).
They considered the following concept of smooth functions. Let {f(x)}
be a set of functions and let {/(w)} be the set of their Fourier transforms.

5Let the rate of convergence be considered high if Tn:::; n- 1 / 2 •


96 4. Controlling the Generalization Ability of Learning Processes

Let us characterize the smoothness of the function f(x) by the quantity

d~ O. (4.11)

In terms of this concept the following theorem for the rate of approximation
r n holds true:
Theorem 4.2. (Jones, Barron, and Breiman) Let the set of functions
f(x) satisfy Eq. (4.11). Then the rate of approximation of the desired func-
tions by the best function of the elements of the structure is bounded by
.in)
o ( if one of the following holds:
(i) The set of functions {I(x)} is determined by Eq. (4.11) with d =0
and the elements Sn of the structure contain the functions
n
f(x, (x, w, v) = I: (Xi sin [(x· Wi) + Vi] , (4.12)
i=l

where (Xi and Vi are arbitrary values and Wi are arbitrary vectors
(Jones, 1992).
(ii) The set of functions {I(x)} is determined by equation (4.11) with
d = 1 and the elements Sn of the structure contain the functions
n
f(x, (x, w, v) = L (XiS [(x· Wi) + Vi], (4.13)
i=l

where (Xi and Vi are arbitrary values, Wi are arbitrary vectors, and
S (u) is a sigmoid function (a monotonically increasing function such
that limu -+_ oo S(u) = -1, lim u -+ oo S(u) = 1)
(Barron, 1993).
(iii) The set of functions {I(x)} is determined by Eq. (4.11) with d = 2
and the elements Sn of the structure contain the functions
n
f(x, (X, w, v) = I: (Xi I(x· Wi) + vil+, lul+ = max:(O, u), (4.14)
i=l

where (Xi and Vi are arbitrary values and Wi are arbitrary vectors
(Breiman, 1993).
In spite of the fact that in this theorem the concept of smoothness is dif-
ferent from the number of bounded derivatives, one can observe a similar
phenomenon here as in the classical case: to keep a high rate of convergence
for a space with increasing dimensionality, one has to increase the smooth-
ness of the functions simultaneously as the dimensionality of the space is
4.4. Examples of Structures for Neural Nets 97

increased. Using constraint (4.11) one attains it automatically. Girosi and


Anzellotti (Girosi and Anzellotti, 1993) observed that the set of functions
satisfying Eq. (4.11) with d = 1 and d = 2 can be rewritten as
1 1
f(x) = Ixl n- 1 * A(X), f(x) = Ixl n- 2 * A(X),
where A(X) is any function whose Fourier transform is integrable, and *
stands for the convolution operator. In these forms it becomes more appar-
ent that due to more rapid fall-off of the terms l/lxln-l, functions satis-
fying Eq. (4.11) become more and more constrained as the dimensionality
increases.
The same phenomenon is also clear in the results of Mhaskar (Mhaskar,
1992), who proved that the rate of convergence of approximation of func-
tions with s continuous derivatives by the structure (4.13) is O(n-il).
Therefore, if the desired function is not very smooth one cannot guarantee
a high asymptotic rate of convergence of the functions to the unknown
function.
In Section 4.5 we describe a new model of learning that is based on the
idea of local approximation of the desired function (instead of global, as
considered above). We consider the approximation of the desired function
in some neighborhood of the point of interest, where the radius of the
neighborhood can decrease with increasing number of observations.
The rate of local approximation can be higher than the rate of global
approximation, and this effect provides a better generalization ability of
the learning machine.

4.4 EXAMPLES OF STRUCTURES FOR NEURAL


NETS

The general principle of SRM can be implemented in many different ways.


Here we consider three different examples of structures built for the set of
functions implemented by a neural network.
1. A structure given by the architecture of the neural network.
Consider an ensemble of fully connected feed-forward neural networks
in which the number of units in one of the hidden layers is monotonically
increased. The sets of implement able functions define a structure as the
number of hidden units is increased (Fig. 4.3).
2. A structure given by the learning procedure.
Consider the set offunctions S = {f (x, w), w E W}, implement able by a
neural net of fixed architecture. The parameters {w} are the weights of the
98 4. Controlling the Generalization Ability of Learning Processes

FIGURE 4.3. A structure determined by the number of hidden units.

neural network. A structure is introduced through Sp = {J(x, w), Ilwll :::;


Cp } and C1 < C2 < ... < Cn. Under very general conditions on the set
of loss-functions, the minimization of the empirical risk within the element
Sp of the structure is achieved through the minimization of

1 l
E(w,'Yp) = f LL(Yi,/(Xi'W» +'YpllwI1 2
i=1

with appropriately chosen Lagrange multipliers 'Y1 > 'Y2 > ... > 'Yn. The
well-known ''weight decay" procedure refers to the minimization of this
functional.
3. A structure given by preprocessing.
Consider a neural net with fixed architecture. The input representation is
modified by a transformation z = K(x, (3), where the parameter {3 controls
the degree of degeneracy introduced by this transformation ({3 could for
instance be the width of a smoothing kernel).
A structure is introduced in the set of functions S = {J(K(x, (3), w), w E
W} through {3 ~ Cp , and C1 > C2 > ... > Cn.
To implement the SRM principle using these structures, one has to know
(estimate) the VC dimension of any element Sk of the structure, and has
to be able for any Sk to find the function which minimizes the empirical
risk.

4.5 THE PROBLEM OF LOCAL FUNCTION


ESTIMATION

Let us consider a model of local risk minimization (in the neighborhood of


a given point xo) on the basis of empirical data. Consider a non-negative
4.5. The Problem of Local Function Estimation 99

----------r-........,.....-...,

o x 0 x
(a) (b)

FIGURE 4.4. Examples of vicinity functions: (a) shows a hard-threshold vicinity


function and (b) shows a soft-threshold vicinity function.

function K(x, Xoj (3) which embodies the concept of neighborhood. This
function depends on the point Xo and a "locality" parameter (3 E (0, 00)
and satisfies two conditions:

o ~ K(x, Xoj (3) ~ 1,


K(xo, Xoj (3) = 1. (4.15)
For example, both the "hard threshold" vicinity function (Fig. 4.4(a»

K 1 (x x . (3)
, 0,
= {I if Ilx - xoll <
0 otherwise
~ (4.16)

and the "soft threshold" vicinity function (Fig. 4.4(b))

(4.17)

meet these conditions.


Let us define a value

K(xo, (3) = J K(x, Xoj (3)dF(x). (4.18)

For the set of functions !(x, a), a E A, let us consider the set of loss-
functions Q(z,a) = L(y,!(x,a», a E A. Our goal is to minimize the local
risk functional

(4.19)

over both the set of functions !(x, a), a E A and different vicinities of
the point Xo (defined by parameter (3) in situations when the probability
100 4. Controlling the Generalization Ability of Learning Processes

measure F(x,y) is unknown, but we are given the independent identically


distributed examples
(Xl, Y1), ... , (Xi, yd·
Note that the problem of local risk minimization on the basis of empirical
data is a generalization of the problem of global risk minimization. (In the
last problem we have to minimize the functional (4.19) with K(x,xoif3) =
1).
For the problem of local risk minimization one can generalize the bound
obtained for the problem of global risk minimization with probability 1 - '"

°
simultaneously for all bounded functions A ~ L(y,J(x, a) ~ B, a E A and
all functions ~ K(x, xo, (3) ~ 1, f3 E (0,00) the inequality

f3 . ) i 2::=1 L(Yi, I(Xi' a))K(Xi' Xoi (3) + (B - A)E(t, hE)


R(
a, ,xo ~ (i
i 2:i=l K(Xi' xoi (3) - )'
E(t, hf3) +

th(ln(2t/h + 1) -In ",/2


E(t, h) = t
holds true, where hE is the ve dimension of the set of functions
L(y, I(x, a))K(x, Xoi (3), a E A, f3 E (0,00) and hf3 is ve dimension of the
set of functions K(x, Xo, (3) (Vapnik and Bottou, 1993).
Now using the SRM principle one can minimize right-hand side of the
inequality over three parameters: the value of empirical risk, the ve di-
mension hE, and the value ofthe vicinity f3 (Ve dimension h{3).
The local risk minimization approach has an advantage when on the basis
of the given structure on the set of functions, it is impossible to approximate
well the desired function using a given number of observations. However, it
may be possible to provide a reasonable local approximation to the desired
function at any point of interest (Fig. 4.5).

4.6 THE MINIMUM DESCRIPTION LENGTH (MDL)


AND SRM PRINCIPLES

Along with the SRM inductive principle, which is based on the statistical
analysis of the rate of convergence of empirical processes, there exists an-
other principle of inductive inference for small sample sizes, the so-called
Minimum Description Length (MDL) principle, that is based on an infor-
mation theoretic analysis of the randomness concept. In this section we
consider the MDL principle and point out the connections between the
SRM and the MDL principles for the pattern recognition problem.
In 1965 Kolmogorov defined a random string using the concept of algo-
rithmic complexity.
4.6. The Minimum Description Length (MDL) and SRM Principles 101


FIGURE 4.5. Using linear functions one can estimate the unknown smooth func-
tion in the vicinity of any point of interest.

He defined the algorithmic complexity of an object to be the length of


the shortest binary computer program that describes this object, and he
proved that the value of the algorithmic complexity, up to an additive con-
stant, does not depend on the type of computer. Therefore it is a universal
characteristic of the object.
The main idea of Kolmogorov is:

Consider the string describing an object to be random if the algorithmic


complexity of the object is high - that is, if the string which describes the
object cannot be compressed significantly.

Ten years after the concept of algorithmic complexity was introduced,


Rissanen suggested using Kolmogorov's concept as the main tool of in-
ductive inference of learning machines; he suggested the so-called MDL
principle6 (Rissanen, 1978]).

6The use of the Algorithmic Complexity as a general inductive principle


was considered by Solomonoff even before Kolmogorov suggested his model
of randomness. Therefore, the principle of descriptive complexity is called the
Solomonoff-Kolmogorov principle. However, only starting with Rissanen's work
was this principle considered as a tool for inference in learning theory.
102 4. Controlling the Generalization Ability of Learning Processes

4.6.1 The MDL Principle


Suppose that we are given a training set of pairs

(pairs drawn randomly and independently according to some unknown


probability measure). Consider two strings: the binary string

(4.20)

and the string of vectors


(4.21)
The question is:
Given (4.21) is the string (4.20) a random object?
To answer this question let us analyze the algorithmic complexity of
the string (4.20) in the spirit of Solomonoff-Kolmogorov's ideas. Since the
WI, ••• , w'- are binary valued, the string (4.20) is described by '- bits.
To determine the complexity of this string let us try to compress its
description. Since training pairs was drown randomly and independently
the value Wi may depend only on vector Xi but not on vector Xj, i #- j (of
course, only if the dependency exists).
Consider the following model: suppose that we are given some fixed code-
book Cb with N « 2'- different tables Ti , i = 1, ... , N. Any table Ti
describes some function 7 from X to w.
Let us try to find the table T in the code-book Cb that describes the
string (4.20) in the best possible way, namely, the table that on the given
string (4.21) returns the binary string

wr,···,w; (4.22)

for which the Hamming distance between string (4.20) and string (4.22) is
minimal (Le., the number of errors in decoding string (4.20) by this table
T is minimal).
Suppose we found a perfect table To for which the Hamming distance
between the generated string (4.22) and string (4.20) is zero. This table
decodes the string (4.20).
Since the code-book Cb is fixed, to describe the string (4.20) it is sufficient
to give the number 0 of table To in the code-book. The minimal number of

7Formally speaking, to get tables of finite length in the code-book, the input
vector x has to be discrete. However, as we will see, the number of levels in quan-
tization will not affect the bounds on generalization ability. Therefore one can
consider any degree of quantization, even giving tables with an infinite number
of entries.
4.6. The Minimum Description Length (MDL) and SRM Principles 103

bits to describe the number of anyone of the N tables is Pg2 Nl, where
rA 1is the minimal integer that is not smaller than A. Therefore in this case
to describe string (4.20) we need Pg2 Nl (rather than £) bits. Thus using a
code book with a perfect decoding table, we can compress the description
length of string (4.20) by a factor

K(To) = pg~ Nl. (4.23)

Let us call K(T) the coefficient of compression for the string (4.20).
e
Consider now the general case: the code-book b does not contain the
perfect table. Let the smallest Hamming distance between the strings (gen-
erated string (4.22) and desired string (4.20)) be d ;::: O. Without loss of
generality we can assume that d:::; £/2. (Otherwise, instead of the smallest
distance one could look for the largest Hamming distance and during de-
coding change one to zero and vice-versa. This will cost one extra bit in the
coding scheme). This means that to describe the string one has to make d
corrections to the results given by the chosen table in the code-book.
For fixed d there are et different possible corrections to the string of
length £. To specify one of them (i.e., to specify an number of one of the
et variants) one needs Pg2 etl bits.
Therefore to describe the string (4.20) we need: Pg2 Nl bits to define
the number of the table, and Pg2 etl bits to describe the corrections. We
also need Pg2 dl + dd bits to specify the number of corrections d,where
dd < 2lg2 192 d, d > 2. Altogether, we need Pg2 Nl + Pg2 etl + Pg2 dl + dd
bits for describing string (4.20). This number should be compared to £,
the number of bits needed to describe an arbitrary binary string (4.20).
Therefore the coefficient of compression is

(4.24)

If the coefficient of compression K(T) is small, then according to the


Solomonoff-Kolmogorov idea, the string is not random and somehow de-
pends on the input vectors x. In this case, the decoding table T somehow
approximates the unknown functional relation between x and w.

4.6.2 Bounds for the MDL Principle


The important question is :
Does the compression coefficient K(T) determine the probability of test
error in classification (decoding) vectors x by the table T?
The answer is yes.
To prove this, let us compare the result obtained for the MDL principle
to that obtained for the ERM principle in the simplest model (the learning
machine with a finite set of functions).
104 4. Controlling the Generalization Ability of Learning Processes

In the beginning of this section we considered the bound (4.1) for the
generalization ability of a learning machine, for the pattern recognition
problem. For the particular case where the learning machine has a finite
number N of functions, we obtained that with probability of at least 1- 'T},
the inequality

1+ _2Re---,m~p,,--(_Ti_)f) (4.25)
InN -In'T}

holds true simultaneously for all N functions in the given set of functions
(for all N tables in the given code-book). Let us transform the right-hand
side of this inequality using the concept of the compression coefficient, and
the fact that

Note that the inequality

d
g+
InN -In'T} (1
f + 1+ -In-N,.,,2-~-I-n-'T} )
< 2 (rlnNl + rlnetl + flg2 dl + Lld _ l~'T}) (4.26)

is valid (one can easily check it). Now let us rewrite the right-hand side of
inequality (4.26) in terms of the compression coefficient (4.24):

Since inequality (4.25) holds true with probability of at least 1 - 'T} and
inequality (4.26) holds with probability 1, the inequality

(4.27)

holds with probability of at least 1 - 'T}.

4.6.3 The SRM and the MDL Principles


Now suppose that we are given M code-books which have the following
structure: code-book 1 contains a small number of tables, code-book 2
contains these tables and some more tables, and so on.
In this case one can use a more sophisticated decoding scheme to describe
string (4.20): first, describe the number m of the code-book (this requires
4.6. The Minimum Description Length (MDL) and SRM Principles 105

Pg2 m1+ ~m' ~m < 2 Pg2lg2 m1 bits) and then, using this code-book, de-
scribe the string (which as shown above takes Pg2 Nl + Pg2 ell + flg2 dl +
~d bits).
The total length of the description in this case is Pg2 N1 + Pg2 ell +
Pg2 dl + ~d + flg2 m1+ ~m and the compression coefficient is
K (T) = Pg2 N1 + flg2 ell + flg2 d1 + ~d + Pg2 m 1+ ~m .
l
For this case an inequality analogous to inequality (4.27) holds. Therefore,
the probability of error for the table which was used for compressing the
description of string (4.20), is bounded by inequality (4.27).
Thus we have proved the following theorem:
Theorem 4.3. If, on a given structure of code-books, one compresses by
a factor K(T) the description of string (4-20) using a table T, then with
probability of at least 1 - TJ one can assert that the probability of committing
an error by the table T is bounded by

R(T) < 2 (K(T)ln2-I~TJ). (4.28)

Note how powerful the concept of the compression coefficient is: to obtain
a bound on the probability of error, we actually need only information
about this coefficient. 8 We do not need such details as
(i) How many examples we used,
(ii) how the structure of code-books was organized,
(iii) which code-book was used,
(iv) how many tables were in the code-book,
(v) how many training errors were made using this table.
Nevertheless, the bound (4.28) is not much worse than the bound on the
risk (4.25) obtained on the basis of the theory of uniform convergence.
The latter has a more sophisticated structure and uses information about
the number of functions (tables) in the sets, the number of errors on the
training set, and the number of elements of the training set.
Note also that, the bound (4.28) cannot be improved more than by factor
2: it is easy to show that in the case when there exists a perfect table in
the code-book, the equality can be achieved with factor 1.
This theorem justifies the MDL principle: to minimize the probability of
error one has to minimize the coefficient of compression.

8The second term, -In 'Tf/ l, on the right-hand side is actually fool-proof: for
reasonable 'Tf and l, it is negligible compared to the first term, but it prevents one
from considering too small 'Tf and/or too small i.
106 4. Controlling the Generalization Ability of Learning Processes

4.6.4 A Weak Point of the MDL Principle


There exists, however, a weak point in the MDL principle.
Recall that the MDL principle uses a code-book with a finite number of
tables. Therefore, to deal with a set of functions determined by a continuous
range of parameters, one must to make a finite number of tables.
This can be done in many ways. The problem is:
What is a "smart" code-book for the given set of functions?
In other words, how, for a given set of functions, can one construct a
code-book with a small number of tables, but with good approximation
ability?
A "smart" quantization could significantly reduce the number of tables
in the code-book. This affects the compression coefficient. Unfortunately,
finding a "smart" quantization is an extremely hard problem. This is the
weak point of the MDL principle.
In the next chapter we will consider a normalized set of linear functions
in a very high dimensional space (in our experiments we use linear functions
in N rv 1013 dimensional space). We will show that the VC dimension h
of the subset of functions with bounded norm depends on the value of the
bound. It can be small value (in our experiments h rv 102 - 103 ). One can
guarantee that if a function from this set separates a training set of size i
without error then the probability of test error is proportional to h ~n l .
The problem for the MDL approach to this set of indicator functions is:
how to construct code-book with rv i h tables (but not with rv iN tables)
that approximates this set of linear functions well.

The MDL principle works well when the problem of constructing rea-
sonable code-books has an obvious solution. But even in this case, it is
not better than the SRM principle. Recall that the bound for the MDL
principle (which cannot be improved using only concept of compression
coefficient) was obtained by roughening the bound for the SRM principle.
Informal Reasoning and Comments
4

Attempts to improve performance in various areas of computational math-


ematics and statistics have essentially led to the same idea that we call the
Structural Risk Minimization inductive principle.
First this idea appeared in the methods for solving ill-posed problems:
(i) Methods of quasi-solutions (Ivanov, 1962),
(ii) methods of regularization (Tikhonov, 1963)).
It then appeared in the method for nonparametric density estimation:
(i) Parzen windows (Parzen, 1962),
(ii) projection methods (Chentsov, 1963),
(iii) conditional maximum likelihood method (the method of sieves (Grenan-
der, 1981)),
(iv) maximum penalized likelihood method (Tapia and Thompson, 1978)),
etc ..
The idea then appeared in methods for regression estimation:
(i) Ridge regression (Hoel and Kennard, 1970),
(ii) model selection (see review in (Miller, 1990)).
Finally, it appeared in regularization techniques for both pattern recogni-
tion and regression estimation algorithms (Poggio and Girosi, 1990).
108 4. Informal Reasoning and Comments - 4

Of course, there were a number of attempts to justify the idea of searching


for a solution using a structure on the admissible set of functions. However,
in the framework of the classical approach the justifications were obtained
only for specific problems and only for the asymptotic case.
In the model of risk minimization from empirical data, the SRM principle
provides capacity (VC dimension) control and it can be justified for a finite
numbers of observations.

4.7 METHODS FOR SOLVING ILL-POSED PROBLEMS

In 1962 Ivanov suggested an idea for finding a quasi-solution of the linear


operator equation
A/=F, /EM (4.29)
in order to solve ill-posed problems. (The linear operator A maps the el-
ements of the metric space M C El with metric PEl into elements of the
metric space N C E2 with metric PE2') He suggested considering a set of
nested convex compact subsets

(4.30)
00

UMi=M (4.31)
i=l
and for any subset Mi to find a function It E Mi minimizing the distance

Ivanov proved that under some general conditions the sequence of solutions

ft,···,/:,··· .
converges to the desired one.
The quasi-solution method was suggested at the same time when Tikhonov
proposed his regularization technique; in fact the two are equivalent. In the
regularization technique, one introduces a non-negative semi-continuous
(from below) functional flU) that possesses the following properties:
(i) The domain of the functional coincides with M (the domain to which
the solution of Eq. (4.29) belongs).
(ii) The region for which the inequality

holds, forms a compactum in the metric of space E 1 .


4.8. Stochastic Ill-Posed Problems 109

(iii) The solution of Eq. (4.29) belongs to some Mt:

nu) :::; d* < 00.


Tikhonov suggested to find a sequence of functions I'Y minimizing the func-
tionals
if>'YU) = P~2 (AI, F) + ,nu)
for different ,. He proved that I'Y converges to the desired solution as ,
converges to O.
Tikhonov also suggested to use the regularization technique even in the
case where the right-hand side of the operator equation is only given within
some 8-accuracy:
PE2 (F,F6 ) :::; 8.
In this case minimizing the functionals

(4.32)

one obtains a sequence 16 of solutions converging (in the metric of Ed to


the desired one 10 as 8 - 0 if

lim ,(8)
6~0
= 0,

82
lim ($;:)
6~0, u
= O.
In both methods the formal convergence proofs do not explicitly contain
"capacity control." Essential, however, was the fact that any subset Mi in
Ivanov's scheme and any subset M = {f: nu) :::; c} in Tikhonov's scheme
are compact. That means it has a bounded capacity (a metric e-entropy).
Therefore both schemes implement an SRM principle: first define a struc-
ture on the set of admissible functions such that any element of the struc-
ture has a finite capacity, increasing with the number of the element. Then,
on any element of the structure, the function providing the best approxi-
mation of the right-hand side of the equation is found. The sequence of the
obtained solutions converges to the desired one.

4.8 STOCHASTIC ILL-POSED PROBLEMS AND THE


PROBLEM OF DENSITY ESTIMATION

In 1978 we generalized the theory of regularization to stochastic ill-posed


problems (Vapnik and Stefanyuk, 1978). We considered a problem of solv-
ing the operator equation (4.29) in the case where the right-hand side is
unknown, but we are given a sequence of approximations F6, possessing
the following properties:
110 4. Informal Reasoning and Comments - 4

(i) Each of these approximations Fa is a random function. 9

(ii) The sequence of approximations converges in probability (in the met-


ric of the space E 2 ) to the unknown function F as 8 converges to zero.
In other words the sequence of random functions Fa has the property

Using Tikhonov's regularization technique one can obtain, on the basis of


random functions Fa, a sequence of approximations fa to the solution of
Eq. (4.29).
We proved that for any 6 > 0 there exists 'Yo = 'YO(6) such that for any
'Y( 8) :::; 'Yo the functions minimizing functional (4.32) satisfy the inequality

(4.33)

In other words we connected the distribution of the random deviation of the


approximations from the exact right-hand side (in the E2 metric) with the
distribution of the deviations of the solutions obtained by the regularization
method from the desired one (in El metric).
In particular, this theorem gave us an opportunity to find a general way
for constructing various density estimation methods.
As mentioned in the Section 1.8, density estimation requires us to solve
the integral equation
[Xoo p(t)dt = F(x),
where F(x) is an unknown probability distribution function, using i.i.d.
data xI, . . . ,Xl •. ··
Let us construct the empirical distribution function

which is a random approximation to F(x) since it was constructed using


random data XI, .•• ,Xl.
In the Section 3.9, we found that the differences supx IF(x) - Fl(X)1 are
described by the Kolmogorov-Smirnov bound. Using this bound we obtain

9 A random function is one that is defined by a realization of some random


event. For definition of random functions see any advanced textbook in probabil-
ity theory, for example, A.N. Schiryaev, Probability, Springer, New York.
4.9. The Problem of Polynomial Approximation 111

Therefore if one minimizes the regularized functional

R(P) = P~2 (J~ p(t)dt, Fe(X») + 1'e O(p) , (4.34)

then according to inequality (4.33) one obtains the estimates pe(t) whose
deviation from the desired solution can be described as follows:

Therefore the conditions for consistency of the obtained estimators are

(4.35)

Thus, minimizing functionals of type (4.34) under the constraint (4.35)


gives consistent estimators. Using various norms E2 and various function-
als O(P) one can obtain various types of density estimators (including all
classical estimators 10 ). For our reasoning it is important that all nonpara-
metric density estimators implement the SRM principle. By choosing the
functional O{p), one defines a structure on the set of admissible solutions
{the nested set of functions Me = {p : O{p) :s: c} determined by constant c);
using the law 1'e one determines the appropriate element of the structure.

4.9 THE PROBLEM OF POLYNOMIAL


APPROXIMATION OF THE REGRESSION

The problem of constructing a polynomial approximation of regression that


was very popular in the 1970s played an important role in understanding
the problems that arose in small sample size statistics.
Consider for simplicity the problem of estimating a one-dimensional re-
gression by polynomials. Let the regression f (x) be a smooth function.

lOBy the way, one can obtain all classical estimators if one approximates an
unknown distribution function F(x) by the the empirical distribution function
Fi (x). The empirical distribution function however is not the best approxima-
tion to the distribution function since, according to definition, the distribution
function should be an absolutely continuous one while the empirical distribu-
tion function is discontinuous. Using absolutely continuous approximations (e.g.,
polygon in the one-dimensional case) one can obtain estimators that in addi-
tion to nice asymptotic properties (shared by the classical estimators) possess
some useful properties from the point of view of limited numbers of observations
(Vapnik, 1988).
112 4. Informal Reasoning and Comments - 4

Suppose that we are given a finite number of measurements of this func-


tion corrupted with additive noise
Yi = !(Xi) + ~i i = 1, ... ,£
(in different settings of the problem, different types of information about
the unknown noise are used; in this model of measuring with noise, we
suppose that the value of noise ~i does not depend on Xi, and that the
point of measurement Xi is chosen randomly according to an unknown
probability distribution F(x)).
The problem is to find the polynomial that is closest (say in the L2(F)
metric) to the unknown regression function ! (x). In contrast to the classical
regression problem described in the Section 1.7.3, the set of functions in
which one has to approximate the regression is now rather wide (polynomial
of any degree), and the number of observations is fixed.
Solving this problem taught statisticians a lesson in understanding the
nature of the small sample size problem. First the simplified version of this
problem was considered: the case where the regression itself is a polynomial
(but the degree of the polynomial is unknown) and the model of noise is
described by a normal density with zero mean. For this particular problem
the classical asymptotic approach was used: on the basis of the technique of
testing hypotheses, the degree of the regression polynomial was estimated
and then the coefficients of the polynomial were estimated. Experiments,
however, showed that for small sample sizes this idea was wrong: even if
one knows the actual degree of the regression polynomial, one often has to
choose a smaller degree for the approximation, depending on the available
number of observations.
Therefore several ideas for estimating the degree of the approximating
polynomial were suggested including (Akaike, 1970), and (Schwartz, 1978)
(see (Miller, 1990)). These methods, however, were justified only in asymp-
totic cases.

4.10 THE PROBLEM OF CAPACITY CONTROL

4.10.1 Choosing the Degree of the Polynomial


Choosing the appropriate degree p of the polynomial in the regression prob-
lem can be considered on the basis of the SRM principle, where the set of
polynomials is provided with the simplest structure: the first element of
structure contains polynomials of degree one
h(x,a) = aIx+aO, a = (a!,ao) E R2,
the second element contains polynomials of degree two:
h(x, a) = a2x2 + aIX + ao, a = (a2' a!, ao) E R3
4.10. The Problem of Capacity Control 113

and so on.
To choose the polynomial of the best degree, one can minimize the fol-
lowing functional (the right hand side of bound (3.30))

R( ) _ iE~=I(Yi-fm(Xi,a))2 (4.36)
a, m - ( . IT.\ '
1 - CV£lJ+

hm{ln ;l + 1) -In fI/ 4


£1. =4 m f '
where h m is the VC dimension of the set of the loss functions

Q(z,a) = (y - fm(x,a))2, a E A,
and C is a constant determinimng the "tails of distributions" (see Sections
3.4 and 3.7).
One can show that the VC-dimension h of the set of real functions

Q(z, a) = F(lg(z, a)l) a E A,


where F(u) is any fixed monotonic function, does not exceed 2h*, where
h* is the VC dimension of the set of indicators

I(z,a,/3) = O(g(x,a) - /3), a E A, /3 E RI.


Therefore for our loss functions the VC dimension is bounded as follows

hm ~ 2m+2.
To find the best approximating polynomial, one has to choose both the de-
gree m of polynomial and the coefficients a that minimize functional l l (4.36).

4.10.2 Choosing the Best Sparse Algebraic Polynomial


Let us now introduce another structure on the set of algebraic polynomi-
als: let the first element of the structure contain polynomials PI (x, a) =
alx d, a E RI (of arbitrary degree d) with one nonzero term, let the sec-
ond element contain polynomials P2(x,a) = alxd1 + a2xd2, a E R2 with
two nonzero terms and so on. The problem is to choose the best sparse
polynomial Pm (x) to approximate a smooth regression function.
To do this, one has to estimate the VC dimension of the set of loss
functions
Q(z, a) = (y - Pm(x, a))2,

llWe used this functional (with constant c = 1, and £t = [m(lni/m+ 1) +3]/i)


in several benchmark studies for choosing the degree of the best approximating
polynomial. For small sample sizes the results obtained were often better than
ones based on the classical suggestions.
114 4. Informal Reasoning and Comments - 4

where Pm(x, a), a E Rm is a set of polynomials of arbitrary degree that


contain m terms. Consider the case of one variable x.
The VC dimension h for this set of loss functions can be bounded by 2h*
where h* is the VC dimension of the indicators

Karpinski and Werther showed that the VC dimension h* of this set of


indicators is bounded as follows

3m ::; h* ::; 4m + 3

(Karpinski and Werther, 1989). Therefore the our set ofloss functions has
VC dimension less than 8m + 6. This estimate can be used for finding the
sparse algebraic polynomial that minimizes the functional (4.36).

4.10.3 Structures on the Set of Trigonometric Polynomials


Consider now structures on the set of trigonometric polynomials. First we
consider a structure that is determined by the degree of the polynomials. 12
The VC dimension of the set of our loss function with trigonometric poly-
nomials of degree m is less than h = 4m + 2. Therefore, to choose the best
trigonometric approximation one can minimize the functional (4.36). For
this structure there is no difference between algebraic and trigonometric
polynomials.
The difference appears when one constructs a structure of sparse trigono-
metric polynomials. In contrast to the sparse algebraic polynomials where
any element of the structure has finite VC dimension, the VC dimension
of any element of the structure on the sparse trigonometric polynomials is
infinite.
This follows from the fact that the VC dimension of the following set of
indicator functions:

f(x,a) = O(sinax), a E Rl, x E (0,1)

is infinite (see Example 2, Section 3.6).

12Trigonometric polynomials of degree m have the form

fp(x) = I:(ak sinkx + bk cos kx) + ao.


k=l
4.11. The Problem of Capacity Control and Bayesian Inference 115

4.10.4 The Problem of Feature Selection


The problem of choosing sparse polynomials plays an extremely important
role in learning theory since the generalization of this problem is a problem
of feature selection (feature construction) using empirical data.
As was demonstrated on the examples, the above problem of feature se-
lection (the terms in sparse polynomials can be considered as the features)
is quite delicate. To avoid the effect encountered for sparse trigonometric
polynomials, one needs to construct a priori a structure containing ele-
ments with bounded VC dimension and then choose decision rules from the
functions of this structure.
Constructing a structure for learning algorithms that select (construct)
features and control capacity is usually a hard combinatorial problem.
In the 1980s in applied statistics, several attempts were made to find
reliable methods of selecting nonlinear functions that control capacity. In
particular, statisticians started to study the problem of function estimation
in the following sets of the functions:
m

Y = LajK(x,wj) + aQ,
j=l

where K(x,w) is a symmetric function with respect to vectors x and w,


WI> ••. , Wmare unknown vectors and aI> ..• , am are unknown scalars (Fried-
man and Stuetzle, 1981), (Breiman, Friedman, OIshen, and Stone, 1984)
(in contrast to approaches developed in the 1970s for estimating linear in
pammeters functions (Miller, 1990)). In these classes of functions choosing
the functions K(x, Wj), j = 1, ... , m can be interpreted as feature selection.
As we will see in the next chapter for the sets of functions of this type, it
is possible to effectively control both factors responsible for generalization
ability - the value of the empirical risk and the VC dimension.

4.11 THE PROBLEM OF CAPACITY CONTROL AND


BAYESIAN INFERENCE

4.11.1 The Bayesian Approach in Learning Theory


In the classical paradigm of function estimation, an important place belongs
to the Bayesian approach (Berger, 1985).
According to Bayes's formula two events A and B are connected by the
equality
P(AIB) = P(BIA)P(A)
P(B) .
One uses this formula to modify the ML models of function estimation
discussed in the comments on Chapter 1.
116 4. Informal Reasoning and Comments - 4

Consider, for simplicity, the problem of regression estimation from mea-


surements corrupted by additive noise
Yi = f(x, ao) + ~i.
In order to estimate the regression by the ML method, one has to know a
parametric set of functions f(x, a), a E A c R!' that contain the regression
f(x, ao) and one has to know a model of noise P(~).
In the Bayesian approach, one has to possess additional information:
one has to know the a priori density function P(a), that for any function
from the parametric set offunctions f(x,a), a E A, defines the probability
for it to be the regression. If f(x, ao) is the regression function, then the
probability of the training data
[Y, Xl = (Yb xd,···, (Yl, Xi)
equals
l
P([Y,Xllao) = II P(Yi - f(xi,aO)).
i=1
Having seen the data, one can a posteriori estimate the probability that
parameter a defines the regression:

P( I[Y. Xl) = P([Y,Xlla)P(a) (4.37)


a , P([Y,X]).

One can use this expression to choose an approximation to the regression


function.
Let us consider the simplest way: we choose the approximation f(x, a*)
such that it yields the maximum conditional probability.13 Finding a* that
maximizes this probability is equivalent to maximizing the following func-
tional:
l
<P(a) = L InP(Yi - f(Xi, a)) + lnP(a). (4.38)
i=1

13 Another estimator constructed on the basis of the a posteriori probability

4>o(xl[Y,Xl) = J f(x,a)P(aI[Y, Xl) 00

possesses the following remarkable property: it minimizes the average quadratic


deviation from the admissible regression functions

R(4)) = J
(f(x, a) - 4>(xI[Y, X])) 2 P([Y, XlIa)P(o:)dxd([Y, Xl) 00.

To find this estimator in explicit form one has to conduct integration analytically
(numerical integration is impossible due to the high dimensionality of 0:). Unfor-
tunately, analytic integration of this expression is mostly an unsolvable problem.
4.11. The Problem of Capacity Control and Bayesian Inference 117

Let us for simplicity consider the case where the noise is distributed ac-
cording to the normal law

P(~) = 1 exp -
~O' {e } 20'2 •

Then from Eq. (4.37) one obtains the following functional:

(4.39)

that has to be minimized with respect to a in order to find the approxima-


tion function. The first term of this functional is the value of the empirical
risk, the second term can be interpreted as a regularization term with the
explicit form of the regularization parameter.
Therefore the Bayesian approach brings us to the same scheme that is
used in SRM or MDL inference.
The goal of these comments is, however, to describe a difference between
the Bayesian approach and SRM or MDL.

4.11.2 Discussion of the Bayesian Approach and Capacity


Control Methods
The only (but significant) shortcoming of the Bayesian approach is that it
is restricted to the case where the set of functions of the learning machine
coincides with the set of problems that the machine has to solve. Strictly
speaking, it can not be applied in a situation where the set of admissible
problems differs from the set of admissible functions of the learning ma-
chine. For example, it cannot be applied to the problem of approximation
of the regression function by polynomials if the regression function is not
polynomial, since the a priori probability P( a) for any function from the
admissible set of polynomials to be the regression is equal to zero. There-
fore, the a posteriori probability (4.37) for any admissible function of the
learning machine is zero. To use the Bayesian approach one must possess
the strong a priori information:

(i) The given set of functions of the learning machine coincides with the
set of problems to be solved.

(ii) The a priori distribution on the set of problems is described by the


given expression P(a).14

14This part of the a priori information is not as important as first one. One can
prove that with increasing numbers of observations the influence of an inaccurate
description of P(a) is decreased.
118 4. Informal Reasoning and Comments - 4

In contrast to the Bayesian method, the capacity (complexity) control


methods SRM or MDL use weak (qualitative) a priori information about
reality: they use a structure on the admissible set of functions (the set of
functions is ordered according to an idea of usefulness of the functions);
this a priori information does not include any quantitative description of
reality. Therefore, using these approaches, one can approximate a set of
functions that is different from the admissible set of functions of the learn-
ing machine.
Thus, inductive inference in the Bayesian approach is based (along with
training data) on given strong (quantitative) a priori information about
reality, while inductive inference in the SRM or MDL approaches is based
(along with training data) on weak (qualitative) a priori information about
reality, but uses capacity (complexity) control.
In discussions with advocates of the Bayesian formalism, who use this
formalism in the case where the set of problems to be solved and the set of
admissible functions of the machine do not coincide, one hears the following
claim:
The Bayesian approach also works in geneml situations.
The fact that the Bayesian formalism sometimes works in general situa-
tions (where the functions implemented by the machine do not necessarily
coincide with these being approximated), has the following explanation.
Bayesian inference has an outward form of capacity control. It has two
stages: an informal stage, where one chooses a function describing (quan-
titative) a priori information P( a) for the problem at hand, and a formal
stage where one finds the solution by minimizing the functional (4.38). By
choosing the distribution P( a) one controls capacity.
Therefore, in the general situation the Bayesian formalism realizes a
human-machine procedure for solving the problem at hand, where capacity
control is implemented by a human choice of the regularizer lnP(a).
In contrast to Bayesian inference, SRM or MDL inference are pure ma-
chine ways for solving problems. For any f they use the same structure
on the. set of admissible functions and the same formal mechanisms for
capacity control.
Chapter 5
Constructing Learning Algorithms

To implement the SRM inductive principle in learning algorithms one has


to minimize the risk in a given set of functions by controlling two factors:
the value of the empirical risk and the value of the confidence interval.
Developing such methods is the goal of the theory for constructing learn-
ing algorithms.
In this chapter we describe learning algorithms for pattern recognition
and consider their generalizations for the regression estimation problem.

5.1 WHY CAN LEARNING MACHINES GENERALIZE?

The generalization ability of learning machines is based on the factors de-


scribed in the theory for controlling the generalization ability of learning
processes. According to this theory, to guarantee a high level of generaliza-
tion ability of the learning process one has to construct a structure

8 1 C 8 2 C, ... , c 8

on the set of loss functions 8 = {Q(z, a), a E A} and then choose both an
appropriate element 8k of the structure and a function Q(z, a;) E 8k in
this element that minimizes the corresponding bounds, for example, bound
(4.1). The bound (4.1) can be rewritten in the simple form

l
R(ai) :::; Remp(ai) + ~(hk)'
k k
(5.1)
120 5. Constructing Learning Algorithms

where the first term is the empirical risk and the second term is the confi-
dence interval.
There are two constructive approaches to minimizing the right-hand side
of inequality (5.1).
In the first approach, during the design of the learning machine one
determines a set of admissible functions with some VC dimension h*. For
a given amount £ of training data, the value h* determines the confidence
interval ~(~.) for the machine. Choosing an appropriate element of the
structure is therefore a problem of designing the machine for a specific
amount of data.
During the learning process this machine minimizes the first term of the
bound (5.1) (the number of errors on the training set).
If for a given amount of training data one designs too complex a machine,
the confidence interval ~(~.) will be large. In this case even if one could
minimize the empirical risk down to zero the number of errors on the test
set could still be large. This phenomenon is called overfitting.
To avoid overfitting (to get a small confidence interval) one has to con-
struct machines with small VC dimension. On the other hand, if the set of
functions has a small VC dimension, then it is difficult to approximate the
training data (to get a small value for the first term in inequality (5.1)).
To obtain a small approximation error and simultaneously keep a small
confidence interval one has to choose the architecture of the machine to
reflect a priori knowledge about the problem at hand.
Thus, to solve the problem at hand by these types of machines, one first
has to find the appropriate architecture of the learning machine (which is
a result of the trade off between overfitting and poor approximation) and
second, find in this machine the function that minimizes the number of
errors on the training data. This approach to minimizing the right-hand
side of inequality (5.1) can be described as follows:
Keep the confidence interval fixed· (by choosing an appropriate construc-
tion 0/ machine) and minimize the empirical risk.
The second approach to the problem of minimizing the right-hand side
of inequality (5.1) can be described as follows:
Keep the value o/the empirical risk fixed (say equal to zero) and minimize
the confidence interval.
Below we consider two different types of learning machines that imple-
ment these two approaches:
(i) Neural Networks (which implement the first approach), and
(ii) Support Vector machines (which implement the second approach).
Both types of learning machines are generalizations of the learning ma-
chines with a set of linear indicator functions constructed in the 1960s.
5.2. Sigmoid Approximation of Indicator Functions 121

5.2 SIGMOID APPROXIMATION OF INDICATOR


FUNCTIONS

Consider the problem of minimizing the empirical risk on the set of linear
indicator functions

f(x,w) = sign{(w· x)}, wE R n , (5.2)

where (W· x) denotes a inner product between vectors w and x. Let

be a training set, where Xj is a vector, and Yj E {I, -I}, j = 1, ... ,i.


The goal is to find the vector of parameters Wo (weights) which minimize
the empirical risk functional

(5.3)

If the training set is separable without error (Le. the empirical risk can
become zero) then there exists a finite step procedure that allows us to find
such a vector wo, for example the procedure that Rosenblatt proposed for
the perceptron (see the Introduction).
The problem arises when the training set cannot be separated without
errors. In this case the problem of separating the training data with the
smallest number of errors is NP-complete. Moreover, one cannot apply reg-
ular gradient based procedures to find a local minimum of functional (5.3),
since for this functional the gradient is either equal to zero or undefined.
Therefore, the idea was proposed to approximate the indicator functions
(5.2) by so-called sigmoid functions (see Fig. 0.3 )

/(x,w)=8{(w·x)}, (5.4)

where 8(u) is a smooth monotonic function such that

8( -00) = -1, 8(+00) = 1,

for example,
8(u) = tanh u = exp(u) - exp( -u).
exp(u) + exp( -u)
For the set of sigmoid functions, the empirical risk functional

Remp(w) = e1 2)Yj - 8{(w· x)})2


l

j=l
122 5. Constructing Learning Algorithms

is smooth in w. It has a gradient


£
gradwRemp(w) = -~ L [Yj - S((w· Xj))] s' {(w· Xj)}xJ
j=l

and therefore it can be minimized using standard gradient-based methods,


for example, the gmdient descent method:

wnew = wold -)()grad Remp(Wold),


where 'Y(.) = 'Y( n) ~ 0 is a value which depends on the iteration number
n. For convergence of the gradient descent method to local minima it is
sufficient that the values of gradient are bounded and that coefficients 'Y( n)
satisfies the following conditions:

L 'Y(n) = L 'Y2(n) <


00 00

00, 00.
n=l n=l

Thus, the idea is to use the sigmoid approximation at the stage of esti-
mating the coefficients, and use the threshold functions (with the obtained
coefficients) for the last neuron at the stage of recognition.

5.3 NEURAL NETWORKS


In this section we consider classical neural networks, which implement the
first strategy: keep the confidence interval fixed and minimize the empirical
risk.
This idea is used to estimate the weights of all neurons of a multi-layer
perceptron (Neural Network). Instead of linear indicator functions (single
neurons) in the networks one considers a set of sigmoid functions.
The method for calculating the gradient of the empirical risk for the sig-
moid approximation of neural networks, called the back-propagation method
was proposed 1 in 1986 (Rumelhart, Hinton, and Williams, 1986), (LeCun,
1986).
Using this gradient, one can iteratively modify the coefficients (weights)
of a neural net on the basis of standard gradient-based procedures.

5.3.1 The Back-Propagation Method


To describe the back-propagation method we use the following notations
(Fig. 5.1):

lSee footnote 5 on page 12.


5.3. Neural Networks 123

Xi (2) = (x~ (2) . .... X ;(2»

Xi ( I ) = (xl( l ) ..... X ~(I»

X, (0) = (x! (0) .... , <" (0»


X"I

FIGURE 5.1. A Neural Network is a combination of several levels of sigmoid


elements. The outputs of one layer form the input for the next layer.

(i) The neural net contains m + 1 layers: the first layer x(O) describes
the input vector x = (xl, ... ,xn ). We denote the input vector by
Xi = (x~(O), ... xi(O», i = 1, ... ,t',
and the image of the input vector Xi(O) on the kth layer by
xi(k) = (x~(k), ... ,x~k(k», i = 1, ... ,t',
where we denote by nk the dimensionality of the vectors xi(k) i =
1, ... ,t' (nk, k = 1, ... ,m -1 can be any number, but nm = 1).
(ii) The layer k -1 is connected with the layer k through the (nk x nk-l)
matrix w(k),:
xi(k) = S{W(k)Xi(k - I)}, k = 1,2, ... , m, i = 1, ... , t', (5.5)
where S{W(k)Xi(k - I)} defines the sigmoid function of the vector
ui(k) = W(k)Xi(k - 1) = (ut(k), ... , u~k(k»
124 5. Constructing Learning Algorithms

as the vector coordinates transformed by the sigmoid:


S(ui(k)) = (S(u}(k)), ... , S(U~k (k))).
The goal is to minimize the functional
l
I(w(I), ... , w(m)) = ~)Yi - xi(m))2 (5.6)
i=l

under conditions (5.5).


This optimization problem is solved by using the standard technique of
Lagrange multipliers for equality type constraints. We will minimize the
Lagrange function
L(W,X,B)
1 l l
LL
m
= "i ~)Y'" - xi(m))2 - (bi(k) . [xi(k) - S{W(k)Xi(k - I)})),
i=l i=l k=l

where bi(k) ~ 0 are Lagrange multipliers corresponding to the constraints


(5.5) that describe the connections between vectors xi(k - 1) and vectors
xi(k).
It is known that
VL(W,X,B) = 0
is a necessary condition for a local minimum of the performance function
(5.6) under the constraints (5.5) (the gradient with respect to all parame-
ters from bi(k), xi(k), w(k), i = 1, ... , i, k = 1, ... , m, is equal to zero).
This condition can be split into three subconditions:
8L(W,X,B)
(i) =0, Vi, k,
8bi (k)
8L(W,X,B)
(ii) =0, Vi, k,
8Xi(k)
8L(W,X,B)
(iii) =0 V w(k).
8w(k)
The solution of these equations determines a stationary point (Wo, X o, B o)
which includes the desired matrices of weights Wo = (wO(I), ... , wO(m)).
Let us rewrite these three subconditions in explicit form:
(i) The first sub condition
The first subcondition gives a set of equations:
xi(k)=S{w(k)xi(k-I)}, i=I, ... ,i, k=I, ... ,m
with initial conditions
Xi(O) = Xi,
the equation of the so-called forward dynamics.
5.3. Neural Networks 125

(ii) The second subcondition


We consider the second sub conditions for two cases: for the case k =
m (for the last layer) and for the case k =f m (for hidden layers).
For the last layer we obtain

For the general case (hidden layers) we obtain

i=I, ... ,t, k=I, ... ,m-I,


where VS {w(k + I)xi{kn is a diagonal nk+l x nk+1 matrix with
diagonal elements S'(u r ), where U r is the rth coordinate of (nk+1
dimensional) vector w{k + I)xi{k). This equation describes the back-
ward dynamics.

(iii) The third subcondition.


Unfortunately the third subcondition does not give a direct method
for computing the matrices of weights w{k), k = 1, ... , m. Therefore
to estimate the weights, one uses steepest gradient descent:

_ (.) 8L{W, X, B)
w (k) +-- w (k) 7 8w{k)' k = 1, ... , m.

In explicit form this equation is


l
w{k) +-- w{k) - 7{·) L bi{k)VS {W(k)Xi{k - In w{k)xf(k - 1),
i=l

k = I,2, ... ,m.


This equation describes the rule for weight update.

5.3.2 The Back-Propagation Algorithm


Therefore the back-propagation algorithm contains three elements:

(i) Forward pass:

xi{k)=S{w{k)xi(k-In, i=I, ... ,t, k=I, ... ,m

with the boundary conditions

Xi{O) = Xi, i = 1, ... t.


126 5. Constructing Learning Algorithms

(II) Backward pass:


bi(k) = wT (k + 1)\7 S {w(k + l)xi(kn bi(k + 1),
i = 1, ... ,l, k = 1, ... ,m - 1
with the boundary conditions

(iii) Weight update for weight matrices w(k), k = 1,2, ... , m:


l
w(k) +-- w(k) - 'Y(-) L bi (k)\7S {W(k)Xi(k - In w(k)xf(k - 1),
i=l

Using the back-propagation technique one can achieve a local minimum for
the empirical risk functional.

5.3. 3 Neural Networks for the Regression Estimation Problem


To adapt neural networks for solving the regression estimation problem, it is
sufficient to use in the last layer a linear function instead of a sigmoidal one.
This implies only the following changes in the described above equations:
xi(m) = w(m)xi(m -1), i = 1, ... ,l
\7S{w(m),xi(m -In = 1.

5.3.4 Remarks on the Back-Propagation Method


The main problems with the neural net approach are:
(i) The empirical risk functional has many local minima. Standard opti-
mization procedures guarantee convergence to one of them. The qual-
ity of the obtained solution depends on many factors, in particular
on the initialization of weight matrices w( k), k = 1, ... , m.
The choice of initialization parameters to achieve a "small" local min-
imum is based on heuristics.
(ii) The convergence of the gradient based method is rather slow. There
are several heuristics to speed-up the rate of convergence.
(iii) The sigmoid function has a scaling factor which affects the quality
of the approximation. The choice of the scaling factor is a trade-off
between the quality of approximation and the rate of convergence.
There are empirical recommendations for choosing the scaling factor.
Therefore neural networks are not well controlled learning machines. Never-
theless, in many practical applications, Neural Networks demonstrate good
results.
5.4. The Optimal Separating Hyperplane 127

5.4 THE OPTIMAL SEPARATING HYPERPLANE

Below we consider a new type of universal learning machine that imple-


ments the second strategy: keep the value of the empirical risk fixed and
minimize the confidence interval.
As in the case of neural networks, we start by considering linear deci-
sion rules (the separating hyperplanes). However, in contrast to previous
considerations, we use a special type of hyperplane, the so-called Optimal
separating hyperplanes (Vapnik and Chervonenkis, 1974), (Vapnik, 1979).
First we consider the Optimal separating hyperplane for the case where the
training data are linearly separable. Then, in Section 5.5.1 we generalize the
idea of Optimal separating hyperplanes to the case of nonseparable data.
Using a technique for constructing Optimal hyperplanes, we describe a new
type of universal learning machine, the Support Vector machine. Finally
we construct the Support Vector machine for solving regression estimation
problems.

5.4.1 The Optimal Hyperplane


Suppose the training data

can be separated by a hyperplane:

(w· x) + b = O. (5.7)

We say that this set of vectors is separated by the Optimal hyperplane if it


is separated without error and the distance between the closest vector to
the hyperplane is maximal (Fig. 5.2 ).
To describe the separating hyperplane let us use the following canonical
form:
(W . Xi) + b ;::: 1 if Yi = 1 ,

(w . Xi) + b ::; -1 if Yi = -l.


In the following we use a compact notation for these inequalities:

yd(w' Xi) + bJ ;::: 1, i = 1, ... ,.e. (5.8)

It is easy to check that the Optimal hyperplane is the one that satisfies the
conditions (5.8) and minimizes

(5.9)

(The minimization is taken with respect to both vector w and scalar b.)
128 5. Constructing Learning Algorithms

FIGURE 5.2. The Optimal separating hyperplane is the one that separates the
data with maximal margin.

5.4.2 The Structure of Canonical Hyperplanes


Now let separating hyperplanes is defined on the set of vectors

bounded by a sphere of the radius R


IXi - al ~ R, Xi E X*
(a is the center of the sphere). Consider a set of hyperplanes in canonical
form (with respect to these vectors) defined by the pairs (w, b) satisfying
the condition
min I(w· Xi) + bl = 1.
x,EX'
Note that the set of canonical separating hyperplanes coincides with the
set of all separating hyperplanes. It only specifies the normalization of the
parameters of hyperplanes.
The idea of constructing a machine that fixes the empirical risk and
minimizes the confidence interval is based on the existence of the following
bound on the VC dimension of canonical hyperplanes.
Theorem 5.1. A subset of canonical hyperplanes
f(x, w, b) = sign{{w· x) + b},
defined on X* and satisfying the constraint

Ilwll ~A
has the VC dimension h bounded by the inequality
h ~ min ([R2 A2], n) + 1.
5.5. Constructing the Optimal Hyperplane 129

In Section 3.5 we stated that the VC dimension of the set of hyperplanes


is equal to n + 1, where n is dimensionality of the space. However, the
VC dimension of the subset of the set of hyperplanes, with canonical form
satisfying Iwl 2 ::; A2, can be less. 2
Below we consider hyperplanes only in canonical form, constructed on
the basis of the training vectors X* = XI, .•. , Xl. 3 For simplicity we call
them hyperplanes.
Let us construct the structure on the set of hyperplanes by increasing the
norm of the weights w. Then in order to obtain the smallest probability
of error on the test set, we choose the hyperplane from the element of
the structure which separates the training data and whose element of the
structure gives the smallest bound on the VC dimension, that is, with the
smallest norm of weights.

5.5 CONSTRUCTING THE OPTIMAL HYPERPLANE

To construct the Optimal hyperplane one has to separate the vectors Xi of


the training set
(Yl,Xl), ... , (Yl,Xl)
belonging to two different classes Y E {-I, I} using the hyperplane with
the smallest norm of coefficients.
To find this hyperplane one has to solve the following quadratic program-
ming problem: minimize the functional

1
cI>(w) = 2(w. w) (5.10)

under the constraints of inequality type

Yi[(xi·w)+b]~l, i=1,2, ... ,i. (5.11)

The solution to this optimization problem is given by the saddle point of


the Lagrange functional (Lagrangian):

1
L Oi{[(X· w) + b]Yi -
l
L(w, b,o) = 2(w. w) - I}, (5.12)
i=l

where the 0i are Lagrange multipliers. The Lagrangian has to be minimized


with respect to w, b and maximized with respect to 0i > o.

2In Section 5.7 we describe a separating hyperplane in 10 13 dimensional space


with relatively small estimate of the VC dimension (~ 103).
3In Section 5.11 we will discuss this choice of the set X'.
130 5. Constructing Learning Algorithms

In the saddle point, the solutions wo, bo, and QO should satisfy the con-
ditions
8L(wo, bo, QO)
8b
=°'
8L(wo,bo,QO) =°
8w .
Rewriting these equations in explicit form one obtains the following prop-
erties of the Optimal hyperplane:

(i) The coefficients Q? for the Optimal hyperplane should satisfy the
constraints
l
LQ?Yi =0, Q? 2: 0, i = 1, ... ,i (5.13)
i=l

(first equation).

(ii) The Optimal hyperplane (vector wo) is a linear combination of the


vectors of the training set.
l
wo = LYiQ?Xi, Q? 2: 0, i = 1, ... ,i (5.14)
i=l

(second equation).
(iii) Moreover, only the so-called support vectors can have nonzero coeffi-
cients Q? in the expansion of wo. The support vectors are the vectors
for which, in inequality (5.11), the equality is achieved. Therefore we
obtain

Wo = L YiQ?Xi, Q? 2: 0. (5.15)
support vectors
This fact follows from the classical Kiihn-'IUcker theorem, accord-
ing to which the necessary and sufficient conditions for the Optimal
hyperplane are that the separating hyperplane satisfy the conditions:

Q?{[(Xi· wo) + bO]Yi -I} = 0, i = 1, ... ,i. (5.16)

Putting the expression for Wo into the Lagrangian and taking into account
the Kiihn-'IUcker conditions, one obtains the functional

l 1 l
W(Q) = LQi - 2 LQiQjYiYj(Xi ·Xj). (5.17)
i=l i,j
5.5. Constructing the Optimal Hyperplane 131

It remains to maximize this functional in the non-negative quadrant

ai 2: 0, i = 1, ... , i (5.18)

under the constraint


e
LaiYi = O. (5.19)
i=l

According to Eq. (5.15) the Lagrange multipliers and support vectors deter-
mine the Optimal hyperplane. Thus, to construct the Optimal hyperplane
one has to solve a simple quadratic programming problem: maximize the
quadratic form (5.17) under constraints4 (5.18) and (5.19).
Let ao = (a~, ... ,a~) be a solution to this quadratic optimization prob-
lem. Then the norm of the vector Wo corresponding to the Optimal hyper-
plane equals:

Iwol 2 = 2W(ao) = L a?aJ(xi· Xj)YiYj·


support vectors

The separating rule, based on the Optimal hyperplane, is the following


indicator function

f(x) = sign ( L Yia?(Xi . x) - bO) , (5.20)


support vectors

where Xi are the support vectors, a? are the corresponding Lagrange coef-
ficients, and bo is the constant (threshold)

bo = ~ [(wo· x*(1)) + (wo· x*(-1))] ,


where we denote by x*(1) some (any) support vector belonging to the first
class and we denote by X* (-1) a support vector belonging to the second
class (Vapnik and Chervonenkis, 1974), (Vapnik, 1979).

5.5.1 Generalization for the Nonseparable Case


To construct the Optimal type hyperplane in the case when the data are
linearly nonseparable, we introduce non-negative variables ~i 2: 0 and a

4This quadratic programming problem is simple because it has simple con-


straints. For the solution of this problem, one can use special methods which are
fast and applicable for the case with a large number of support vectors (::::: 104
support vectors) (More and Toraldo, 1991). Note that in the training data the
support vectors constitute only a small part of the training vectors (in our ex-
periments 3% to 5%).
132 5. Constructing Learning Algorithms

function
l
Fu(e) = Lei
i=l

with parameter (T > O.


Let us minimize the functional Fu(e) subject to constraints

i = 1,2, ... , i, (5.21)

and one more constraint


(w·w):::;en· (5.22)
For sufficiently small (T > 0 the solution to this optimization problem
defines a hyperplane that minimizes number of training errors under con-
dition that the parameters of this hyperplane belong to the subset (5.22)
(to the element of the structure

Sn = {(w· x) + b: (w· w) :::; en}

determined by constant en).


For computational reasons, however, we consider the case (T = 1. This
case corresponds to the smallest (T > 0 that is still computationally simple.
We call this hyperplane the Generalized Optimal hyperplane.
1. One can show (using the technique described above) that the Gener-
alized Optimal hyperplane is determined by the vector

where parameters ai, i = 1, ... , i and C* are the solutions to the following
convex optimization problem:
Maximize the functional

l 1 l C*
W(a , C*) = L..J
~ a·t - ~ a·a·y·y·(x··
-2C* L..J '3' 3 '3
x·) - ~
2
i=l i,j=l

subject to constraints
l
LYiai =0
i=l

o :::; ai :::; 1, i = 1, ... , i

C* 2: 0
5.6. Support Vector (SV) Machines 133

2. To simplify computations one can introduce the following (slightly


modified) concept of the Generalized Optimal hyperplane (Cortes and Vap-
nik, 1995). The Generalized Optimal hyperplane is determined by the vec-
tor w that minimizes the functional

(here C is a given value) subject to constraint (5.21).


The technique of solution of this quadratic optimization problem is al-
most equivalent to the technique used in the separable case: to find the
coefficients of the generalized Optimal hyperplane
£
W =L CiiYiXi,
i=l

one has to find the parameters Cii, i = 1, ... , i one that maximize the same
quadratic form as in the separable case

under slightly different constraints

o ~ Cii ~ C, i = 1, ... , i,

i=l
As in the separable case, only some of the coefficients Cii, i = 1, ... , i differ
from zero. They determine the support vectors.
Note that if the coefficient C in the functional <1>( w, e) is equal to the
optimal value of parameter C* for minimization of the functional F1(e)

C=C*,

then the solution to both optimization problems (defined by the functional


FI (e) and by the functional <1>( w, e)) coincide.

5.6 SUPPORT VECTOR (SV) MACHINES

The Support Vector (SV) machine implements the following idea: it maps
the input vectors x into a high-dimensional feature space Z through some
nonlinear mapping, chosen a priori. In this space, an Optimal separating
hyperplane is constructed (Fig. 5.3).
134 5. Constructing Learning Algorithms

Optimal hyperplane in t he feature space

'* (& $- ® @ $ Input space

FIGURE 5.3. The SV machine maps the input space into a high-dimensional
feature space and then constructs an Optimal hyperplane in the feature space.

Example. To construct a decision surface corresponding to a polynomial


of degree two, one can create a feature space Z which has N = n(n2+3)
coordinates of the form
Zl = Xl , ... , zn = xn , n coordinates ,
zn+1=(xl)2, ... ,z2n=(xn)2, n coordinates,
z 2n+1 I 2 N
=X x , ... , Z =X x ,
n n-l
coord·Inat es ,
n(n-l)
2

where x = (xl, ... , xn). The separating hyperplane constructed in this


space is a second degree polynomial in the input space.

Two problems arise in the above approach: one conceptual and one tech-
nical.
(i) How to find a separating hyperplane that will generalize well?
(The conceptual problem.)
The dimensionality of the feature space will be huge, and a hyperplane
that separates the training data will not necessarily generalize well. 5
(ii) How to treat computationally such high-dimensional spaces?
(The technical problem.)
To construct a polynomial of degree 4 or 5 in a 200 dimensional
space it is necessary to construct hyperplanes in a billion dimensional
feature space. How can this "curse of dimensionality" be overcome?

5Recall Fisher's concern about the small amount of data for constructing a
quadratic discriminant function in classical discriminant analysis (Section 1.9).
5.6. Support Vector (SV) Machines 135

5.6.1 Generalization in High-Dimensional Space


The conceptual part of this problem can be solved by constructing the
Optimal hyperplane.
According to Theorem 5.1, if it happens that in the high-dimensional
input space one can construct a separating hyperplane with a small value
of [R2 A2], the VC dimension of the corresponding element of the structure
will be small, and therefore the generalization ability of the constructed
hyperplane will be high.
Furthermore, the following theorem holds.
Theorem 5.2. If the training vectors are separated by the Optimal hy-
perplane (or generalized Optimal hyperplane), then the expectation value of
the probability of committing an error on a test example is bounded by the
ratio of the expectation of the number of support vectors to the number of
examples in the training set:

E[P(error)] < E[number of support vectors] (5.23)


- (number of training vectors) - 1

This bound depends neither on the dimensionality of the space, nor on


the norm of the vector of coefficients, nor on the bound of the norm of
the input vectors. Therefore, if the Optimal hyperplane can be constructed
from a small number of support vectors relative to the training set size,
the generalization ability will be high - even in an infinite-dimensional
space. 6

5.6.2 Convolution of the Inner Product


However, even if the Optimal hyperplane generalizes well and can theoret-
ically be found, the technical problem of how to treat the high-dimensional
feature space remains.
In 1992 it was observed (Boser, Guyon, and Vapnik, 1992) that for con-
structing the Optimal separating hyperplane in the feature space Z, one

60ne can compare the result of this theorem to result of analysis of the fol-
lowing compression scheme. To construct the Optimal separating hyperplane one
only needs to specify among the training data the support vectors and its classifi-
cation. This requires: ~ rlg2 m 1 bits to specify the number m of support vectors,
rlg2 efl bits to specify the support vectors; and rlg2 e;;:11 bits to specify rep-
resentatives of the first class among the support vectors. Therefore for m < < £
and ml ~ m/2 the compression coefficient is

The expectation of this coefficient should be compared to the value Eml(£ - 1)


(the right hand side of inequality (5.23)).
136 5. Constructing Learning Algorithms

does not need to consider the feature space in explicit form. One only has
to be able to calculate the inner products between support vectors and the
vectors of the feature space (Eqs. (5.17) and (5.20)).
Consider a general expression for the inner product in Hilbert space7

where z is the image in feature space of the vector x in input space.


According to Hilbert-Schmidt theory, K(x, Xi) can be any symmetric
function satisfying the following general conditions (Courant and Hilbert,
1953):
Theorem 5.3. (Mercer) To guarantee that the symmetric function K(u, v)
from L2 has an expansion

= L ak'¢k (U)'¢k (v)


00

K(u, v) (5.24)
k=l

with positive coefficients ak > 0 {i. e., K (u, v) describes a inner product in
some feature space), it is necessary and sufficient that the condition

JJ K(u, v)g(u)g(v) dudv > 0

be valid for all g=/:.O for which

/ g2(u)du < 00.

5.6.3 Constructing BV Machines


The convolution of the inner product allows the construction of decision
functions that are nonlinear in the input space

f(x) = sign ( L Yill:iK(Xi'X) - b) , (5.25)


support vectors

and that are equivalent to linear decision functions in the high-dimensional


feature space '¢1 (x), .. , '¢ N(x) (K (Xi, x) is a convolution of the inner prod-
uct for this feature space).

7This idea was used in 1964 by Aizerman, Braverman, and Rozonoer in their
analysis of the convergence properties of the method of Potential functions (Aiz-
erman, Braverman, and Rozonoer, 1964, 1970). It happened at the same time
(1965) when the method of the Optimal hyperplane was developed (Vapnik and
Chervonenkis 1965). However, combining these two ideas, which lead to the SV
machines, was only done in 1992.
5.6. Support Vector (SV) Machines 137

To find the coefficients (ti in the separable case (analogously in the non-
separable case) it is sufficient to find the maximum of the functional

(5.26)

subject to the constraints


l
L(tiYi = 0,
i=l

(ti ~ 0, i = 1,2, ... ,f. (5.27)


This functional coincides with the functional for finding the Optimal
hyperplane, except for the form of the inner products: instead of inner
products (Xi' Xj) in Eq. (5.17), we now use the convolution of the inner
products K(Xi,Xj).
The learning machines which construct decision functions of the type
(5.25) are called Support Vector (SV) Machines. (With this name we stress
the idea of expanding the solution on support vectors. In SV machines the
complexity of the construction depends on the number of support vectors
rather than on the dimensionality of the feature space.) The scheme of SV
machines is shown in Fig. 5.4.

5.6.4 Examples of SV Machines


Using different functions for convolution of the inner products K(x, Xi), one
can construct learning machines with different types of nonlinear decision
surfaces in input space. Below, we consider three types of learning machines:

(i) Polynomial Learning Machines,


(ii) Radial Basis Functions Machines, and
(iii) Two Layer Neural Networks.
For simplicity we consider here the regime where the training vectors are
separated without error.

Note that the support vector machines implement the SRM principle.
Indeed, let

be a feature space and W = (WI, ... , W N) be a vector of weights determining


a hyperplane in this space. Consider a structure on the set of hyperplanes
with elements Sk containing the functions satisfying the conditions
138 5. Constructing Learning Algorithms

Decision rule
y N
Y = sign ( 1: Yi a i F ( xi ,a ) - b )
i= l

Weights yla l , . . , YNaN

Nonlinear transformation
based on support vectors
xl ' '. ' ,x N

Input vector x = ( xl, ... , xn)


I 2
x 'I
1\
x x X

FIGURE 5.4. The two-layer SV machine is a compact realization of an Optimal


hyperplane in the high-dimensional feature space Z.

where R is the radius of the smallest sphere that contains the vectors \II (x),
Iwl is the norm of the weights (we use canonical hyperplanes in feature
space with respect to the vectors z = \II(Xi) where Xi are the elements of
the training data).
According to Theorem 5.1 (now applied in the feature space), k gives an
estimate of the VC dimension of the set of functions Sk.
The SV machine separates without error the training data

Yd(1J!(Xi)·W)+b] 2: 1, Yi={+I, -I}, i=I,2, ... ,£,

and has a minimal norm Iwl.


In other words, the SV machine separates the training data using func-
tions from element Sk with the smallest estimate of the VC dimension.
Recall that in the feature space the equality

l
Iwol 2 = La?a~K(xi,Xj)YiYj (5.28)

holds true. To control the generalization ability of the machine (to min-
imize the probability of test errors) one has to construct the separating
5.6. Support Vector (SV) Machines 139

hyperplane that minimizes the functional

(5.29)

Indeed, for separating hyperplanes the probability of test errors with prob-
ability 1 - 'fJ is bounded by the expression

£ =4
h(ln ¥ + 1)l - In 'fJ I 4
.

The right-hand side of £ attains its minimum when hil is minimal. We es-
timate the minimum of hil by estimating h by hest = R21wo 12. To estimate
this functional it is sufficient to estimate Iwol2 (say by expression (5.28))
and estimate R2 by finding

R2 = R2(K) = min max [K(Xi,Xi) + K(a, a) - 2K(Xi,a)]. (5.30)


a Xi

Polynomial Learning Machine


To construct polynomial decision rules of degree d, one can use the fol-
lowing function for convolution of the inner product:

(5.31)

This symmetric function satisfies the conditions of Theorem 5.3, therefore


it describes a convolution of the inner product in the feature space that con-
tains all products Xi . Xj . Xk up to degree d. Using the described technique,
one constructs a decision function of the form

!(X, a) = sign ( L Yiai[(Xi . x) + 1Jd - b) ,


support vectors
which is a factorization of d-dimension polynomials in n-dimensional input
space.
In spite of the very high dimensionality of the feature space (polynomials
of degree d in n-dimensional input space have O(n d ) free parameters) the
estimate of the VC dimension of the subset of polynomials that solve real
life problems can be low.
As described above to estimate the VC dimension of the element of the
structure from which the decision function is chosen, one has only to esti-
mate the radius R of the smallest sphere that contains the training data,
and the norm of weights in feature space (Theorem 5.1).
Note that both the radius R = R(d) and the norm of weights in the
feature space depends on the degree of the polynomial.
This gives the opportunity to choose the best degree of polynomial for
the given data.
140 5. Constructing Learning Algorithms

To make a local polynomial approximation in the neighborhood of a point


of interest xo, let us consider the hard threshold neighborhood function
(4.16). According to the theory of local algorithms, one chooses a ball with
radius Rf3 around point Xo in which lf3 elements of the training set fall,
and then using only these training data, constructs the decision function
that minimizes the probability of errors in the chosen neighborhood. The
solution to this problem is a radius Rf3 that minimizes the functional

(5.32)

(the parameter Iwol depends on the chosen radius as well). This functional
describes a trade-off between the chosen radius Rf3, the value of the mini-
mum of the norm Iwol, and the number of training vectors lf3 that fall into
radius Rf3.
Radial Basis Function Machines
Classical Radial Basis Function (RBF) Machines use the following set of
decision rules:

(5.33)

where Ky(lx - XiI) depends on the distance Ix - xii between two vectors.
For the theory of RBF machines see (Micchelli, 1986), (Powell, 1992).
The function Ky(lx - XiI) is for any fixed ,,(, a non-negative monotonic
function; it tends to zero as z goes to infinity. The most popular function
of this type is
K'"((lx - XiI) = exp{ -"(Ix - xiI 2 }. (5.34)
To construct the decision rule (5.33) one has to estimate
(i) The value of the parameter ,,(,

(ii) the number N of the centers Xi,


(iii) the vectors Xi, describing the centers,
(iv) the value of the parameters ai.

In the classical RBF method the first three steps (determining the param-
eters ,,(, N, and vectors (centers) Xi, i = 1, ... , N) are based on heuristics
and only the fourth step (after finding these parameters) is determined by
minimizing the empirical risk functional.
The radial function can be chosen as a function for the convolution of the
inner product for a SV machine. In this case, the SV machine will construct
a function from the set (5.33). One can show (Aizerman, Braverman, and
5.7. Experiments with SV Machines 141

Rozonoer, 1964, 1970) that radial functions (5.34) satisfy the condition of
Theorem 5.3.
In contrast to classical RBF methods, in the SV technique all four types
of parameters are chosen to minimize the bound on probability of test error
by controlling the parameters R, Wo in the functional (5.29). By minimizing
the functional (5.29) one determines
(i) N, the number of support vectors,
(ii) Xi, (the pre-images of) support vectors;
(iii) ai = aiYi, the coefficients of expansion, and
(iv) 7, the width parameter of the kernel-function.

Two-Layer Neural Networks


Finally, one can define two-layer neural networks by choosing kernels:

K(x, Xi) = S[v(x . Xi) + c],


where S(u) is a sigmoid function. In contrast to kernels for polynomial
machines or for radial basis function machines that alway satisfy Mercer
conditions, the sigmoid kernel tanh(vu+c), lui ~ 1, satisfies Mercer condi-
tions only for some values of parameters v, c. For these values of parameters
one can construct SV machines implementing the rules:

!(X,o) ~,.;gn {t. OiS(V(X· Xi) + c) + b}.


Using the technique described above, the following are found automatically:
(i) The architecture of the two layer machine, determining the number
N of hidden units (the number of support vectors),

(ii) the vectors of the weights Wi = Xi in the neurons of the first (hidden)
layer (the support vectors), and
(iii) the vector of weights for the second layer (values of a).

5.7 EXPERIMENTS WITH SV MACHINES

In the following we will present two types of experiments constructing the


decision rules in the pattern recognition problems:

8The experiments were conducted in the Adaptive System Research Depart-


ment, AT&T Bell Laboratories.
142 5. Constructing Learning Algorithms

FIGURE 5.5. Two classes of vectors are represented in the picture by black and
white balls. The decision boundaries were constructed using an inner product of
polynomial type with d = 2. In the pictures the examples cannot be separated
without errors; the errors are indicated by crosses and the support vectors by
double circles.

(i) Experiments in the plane with artificial data that can be visualized,
and

(ii) experiments with real-life data.

5.7.1 Example in the Plane


To demonstrate the SV technique we first give an artificial example (Fig.
5.5).
The two classes of vectors are represented in the picture by black and
white balls. The decision boundaries were constructed using a inner prod-
uct of polynomial type with d = 2. In the pictures the examples cannot
be separated without errors; the errors are indicated by crosses and the
support vectors by double circles.
Notice that in both examples the number of support vectors is small
relative to the number of training data and that the number of training
errors is minimal for polynomials of degree two.

5.7.2 Handwritten Digit Recognition


Since the first experiments of Rosenblatt, the interest in the problem of
learning to recognize handwritten digits has remained strong. In the fol-
lowing we describe results of experiments on learning the recognition of
handwritten digits using different SV machines. We also compare these re-
sults to results obtained by other classifiers. In these experiments, the U.S.
Postal Service database (LeCun et al., 1990) was used. It contains 7,300
training patterns and 2,000 test patterns collected from real-life zip-codes.
The resolution of the database is 16 x 16 pixels, therefore the dimensionality
5.7. Experiments with SV Machines 143

Classifier Rawerror%
Human performance 2.5
Decision tree, C4.5 16.2
Best two-layer neural network 5.9
Five-layer network (LeNet 1) 5.1

TABLE 5.1. Human performance and performance of the various learning ma-
chines, solving the problem of digit recognition on U.S. Postal Service data.

of the input space is 256. Figure 5.6 gives examples from this data-base.
Table 5.1 describes the performance of various classifiers, solving this
problem. 9

For constructing the decision rules three types of SV machines were


usedlO:
(i) A polynomial machine with convolution function:
d
(x· Xi)
K(x, Xi) = ( 256 ) ' d = 1, .. , 7.

(ii) A radial basis function machine with convolution function:

(iii) A two layers neural network machine with convolution function:


b(X. Xi) )
K(x, Xi) = tanh ( 256 - c .

All machines constructed ten classifiers, each one separating one class from
the rest. The ten class classification was done by choosing the class with
the largest classifier output value.
The results of these experiments are given in Table 5.2. For different
types of SV machines, Table 5.2 shows: the best parameters for the ma-
chines (column 2), the average (over one classifier) of the number of support
vectors, and the performance of machine.

9The result of human performance was reported by J. Bromley and E.


Sii.ckinger; the result of C4.5 was obtained by C. Cortes; the result for the two
layer neural net was obtained by B. Scholkopf; the results for the special purpose
neural network architecture with five layers (LeNet 1), was obtained by Y. LeCun
et al.
lOThe results were obtained by C. Burges, C. Cortes, and B. Scholkopf.
144 5. Constructing Learning Algorithms

FIGURE 5.6. Examples of patterns (with labels) from the U.S. Postal Service
database.
5.7. Experiments with SV Machines 145

Type of Parameters Number of Raw


SV classifier of classifier support vectors error
Polynomials d=3 274 4.0
REF classifiers (J'2 = 0.3 291 4.1
Neural network b = 2, c = 1 254 4.2

TABLE 5.2. Results of digit recognition experiments with various SV machines


using the U.S. Postal Service database. The number of support vectors means
the average per classifier.

Poly RBF NN Common


total# of sup.vect. 1677 1727 1611 1377
% of common sup. vect. 82 80 85 100

TABLE 5.3. Total number (in ten classifiers) of support vectors for various SV
machines and percentage of common support vectors.

Note that for this problem, all types of SV machines demonstrate ap-
proximately the same performance. This performance is better than the
performance of any other type of learning machine solving the digit recog-
nition problem by constructing the entire decision rules on the basis of the
U.S. Postal Service databaseY

In these experiments one important singularity was observed: different


types of SV machines use approximately the same set of support vectors.
The percentage of common support vectors for three different classifiers
exceeded 80%.
Table 5.3 describes the total number of different support vectors for ten
classifiers of different machines: Polynomial machine (Poly), Radial Basis
Function machine (RBF), and Neural Network machine (NN). It shows also
the number of common support vectors for all machines.

11 Note that using a local approximation approach described in Section 5.7 (that
does not construct entire decision rule but approximates the decision rule in any
point of interest) one can obtain a better result: 3.3% error rate (L. Bottou and
V Vapnik, 1992).
The best results for this database, 2.7% was obtained by P. Simard, Y. LeCun,
and J. Denker without using any learning methods. They suggested a special
method of elastic matching with 7200 templates using a smart concept of distance
(so-called Tangent distance) that takes into account invariance with respect to
small translations, rotations, distortions, and so on (P. Simard, Y. LeCun, and
J. Denker, 1993).
146 5. Constructing Learning Algorithms

Poly RBF NN
Poly 100 84 94
RBF 87 100 88
NN 91 82 100

TABLE 5.4. Percentage of common (total) support vectors for two SV machines.

Table 5.4 describes the percentage of support vectors of the classifier


given in the columns contained in the support vectors of the classifier given
in the rows.
This fact, if it holds true for a wide class of real-life problems, is very
important.

5.7.3 Some Important Details


In this subsection we give some important details on solving the digit recog-
nition problem using a polynomial SV machine.
The training data are not linearly separable. The total number of mis-
classifications on the training set for linear rules is equal to 340 (~ 5%
errors). For second degree polynomial classifiers the total number of mis-
classifications on the training set is down to four. These four mis-classified
examples (with desired labels) are shown in Fig. 5.7.Starting with polyno-
mials of degree three, the training data are separable.
Table 5.5 describes the results of experiments using decision polynomials
(ten polynomials, one per classifier in one experiment) of various degrees.
The number of support vectors shown in the table is a mean value per
classifier.

[lJ[I][JJB
4 4 8 5
FIGURE 5.7. Labeled examples of training errors for the second degree polyno-
mials.
5.7. Experiments with SV Machines 147

degree of dimensionality of support raw


polynomial feature space vectors error
1 256 282 8.9
2 "" 33000 227 4.7
3 "" 1 X 106 274 4.0
4 rv 1 X 109 321 4.2
5 "" 1 X 10 12 374 4.3
6 rv 1 X 10 14 377 4.5
7 rv 1 X 10 16 422 4.5
TABLE 5.5. Results of experiments with polynomials of the different degrees.

Note that the number of support vectors increases slowly with the degree
of the polynomials. The seventh degree polynomial has only 50% more
support vectors than the third degree polynomial. 12
The dimensionality of the feature space for a seventh degree polynomial
is however 1010 times larger than the dimensionality of the feature space
for a third degree polynomial classifier. Note that the performance does
not change significantly with increasing dimensionality of the space - in-
dicating no overfitting problems.
To choose the degree of the best polynomials for one specific classifier we
estimate the VC dimension (using the estimate [R2 A2]) for all constructed
polynomials (from degree two up to degree seven) and choose the one with
the smallest estimate of the VC dimension. In this way we found the ten
best classifiers (with different degrees of polynomials) for the ten two-class
problems. These estimates are shown on Fig. 5.8 where for all ten two-class
decision rules, the estimated VC dimension, is plotted versus the degree of
the polynomials. The question is:

Do the polynomials with the smallest estimate of the VC dimension pro-


vide the best classifier?
To answer this question we constructed Table 5.6 which describes the
performance of the classifiers for each degree of polynomial.
Each row describes one two--class classifier separating one digit (stated
in the first column) from the all ather digits.
The remaining columns contain:

deg.: the degree of the polynomial as chosen (from two up to seven)


by the described procedure,

12The relatively high number of support vectors for the linear separator is
due to nonseparability: the number 282 includes both support vectors and miss-
classified data.
148 5. Constructing Learning Algorithms

3200
Digits 0 ~ 4
3000 I - - - r-
"3"

2400 /J "2"

~~
J"on
c 2000
o
'iii
c
8" 1600
[1\

o
>
1200 ~ / ~I ./ P"4"
~ / ..PV
800 r- ~ ...--. V .)-
-
/
~~
400 I - - -
t--.-.
"1"
o I I

2 3 4 5 6 7
Degree of Polynomial

3200
.' . I
Digits 5 9
3000
~
b"5"
2400 II P"S)I

- \ /I
\
c 2000

tf
o
'00
.,c
CiE 1600

~~
o P"gll
>
1200
.---' /
r--
/
"6"

00 r- ~
V
""~ ......
.-I
.-I "7)}
400

o
2 3 4 5 6 7 8
Degree of Polynomial

FIGURE 5.8. The estimate of the VC dimension of the best element of the struc-
ture (defined on the set of canonical hyperplanes in the corresponding feature
space) versus the degree of polynomial for various two-class digit recognition
problems (denoted digit versus the rest) .
5.8. Remarks on SV Machines 149

Chosen classifier II N umber of test errors


Digit deg. dim. hest. II 1 2 3 4 5 6 7
0 3 rv 106 530 36 14 ll1J 11 11 12 17
1 7 rv 1016 101 17 15 14 11 10 10 1101
2 3 rv 106 842 53 32 l28J 26 28 27 32
3 3 rv 106 1157 57 25 l22J 22 22 22 23
4 4 rv 109 962 50 32 32 130 I 30 29 33
5 3 rv 106 1090 37 20 1221 24 24 26 28
6 4 rv 109 626 23 12 12 l15J 17 17 19
7 5 rv 10 12 530 25 15 12 10 1111 13 14
8 4 rv 109 1445 71 33 28 1241 28 32 34
9 5 rv 10 12 1226 51 18 15 11 ll1 J 12 15

TABLE 5.6. Experiments on choosing the best degree of polynomial.

dim.: the dimensionality of the corresponding feature space, which is


also the maximum possible VC dimension for linear classifiers in that
space,

h est .: the VC dimension estimate for the chosen polynomial, (which


is much smaller than the number of free parameters),

Number of test errors: the number of test errors, using the constructed
polynomial of corresponding degree; the boxes show the number of
errors for the chosen polynomial.

Thus, Table 5.5 shows that for the SV polynomial machine there are no
overfitting problems with increasing degree of polynomials, while Table 5.6
shows that even in situations where the difference between the best and
the worst solutions is small (for polynomials starting from degree two up
to degree seven), the theory gives a method for approximating the best
solutions (finding the best degree of the polynomial).
Note also that Table 5.6 demonstrates that the problem is essentially
nonlinear. The difference in the number of errors between the best polyno-
mial classifier and the linear classifier can be as much as a factor of four
(for digit 9).

5.8 REMARKS ON SV MACHINES

The quality of any learning machine is characterized by three main com-


ponents:
150 5. Constructing Learning Algorithms

(i) How universal is the learning machine?


How rich is the set of functions that it can approximate?
(ii) How well can the machine generalize?
How close is the upper bound on the error rate that this machine
achieves (implementing a given set of functions and a given structure
on this set of functions) to the smallest possible?
(iii) How fast does the learning process for this machine converge?
How many operations does it take to find the decision rule, using a
given number of observations?
We address these in turn below.
(i) SV machines implement the sets of functions

(5.35)

where N is any integer (N < t), ai, i = 1, ... , N are any scalars and
Wi, i = 1, .. ,N are any vectors. The kernel K (x, w) can be any symmetric
function satisfying the conditions of Theorem 5.3.
As was demonstrated, the best guaranteed risk for these sets of functions
is achieved when the vectors of weights WI, ... W N are equal to some of the
vectors x from the training data (support vectors).
Using the set of functions

/(x,a,w) =
support vectors
with convolutions of polynomial, radial basis function, or neural network
type, one can approximate a continuous function to any degree of accuracy.
Note that for the SV machine one does not need to construct the archi-
tecture of the machine by choosing a priori the number N (as is necessary
in classical neural networks or in classical radial basis function machines).
Furthermore, by changing only the function K(x, w) in the SV machine
one can change the type of learning machine (the type of approximating
functions) .
(ii) SV machines minimize the upper bound on the error rate for the
structure given on a set of functions in a feature space. For the best solution
it is necessary that the vectors Wi in Eq. (5.35) coincide with some vectors
of the training data (support vectors I3 ). SV machines find the functions

13This assertion is a direct corollary of the necessity of the Kiihn-Tucker con-


ditions for solving the quadratic optimization problem described in Section 5.4.
The Kiihn-Tucker conditions are necessary and sufficient for the solution of this
problem.
5.9. SV Machines for the Regression Estimation Problem 151

from the set (5.35) that separate the training data and belong to the subset
with the smallest bound of the VC dimension. (In the more general case
they minimize the bound of the risk (5.1).)

(iii). Finally, to find the desired function, the SV machine has to max-
imize a non-positive quadratic form in the non-negative quadrant. This
problem is a particular case of a special quadratic programming problem:
to maximize a non-positive quadratic form Q(x) with bounded constraints

where Xi, i 1, ... ,n are the coordinates of the vector x and ai, bi
are given constants. For this specific quadratic programming problem fast
algorithms exist.

5.9 SV MACHINES FOR THE REGRESSION


ESTIMATION PROBLEM

5.9.1 c-Insensitive Loss-Function


In Section 1.3.2 to describe the problem of approximation of the super-
visor's rule F(ylx) for the case where y is real valued we considered a
quadratic loss-function

L(y, f(x, a)) = (y-f(x,a))2. (5.36)

Using the ERM inductive principle and this loss-function one obtains a
function that gives the best least squares approximation to the data. Un-
der conditions where y is the result of measuring a function with normal
additive noise (see Section 1.7.3) (and for the ERM principle) this loss-
function provides also the best approximation to the regression.
It is known, however, that if additive noise is generated by another law,
the optimal approximation to the regression (for the ERM principle) gives
another loss-function (associated with this law).
In 1964 Huber developed a theory that allows us to define the best loss-
function for the problem of regression estimation on the basis of the ERM
principle if one has only general information about the model of the noise. In
particular, he showed that if one only knows that the density p( x) describing
the noise is a symmetric convex function possessing second derivatives, then
the best minimax approximation to regression (the best approximation for
152 5. Constructing Learning Algorithms

the worst possible p(x)) provides the 10ss-function 14

L(y,J(x, a» = Iy - f(x, a)l· (5.37)

Minimizing the empirical risk with respect to this loss-function is called the
least modulus method. It defines the so-called robust regression function.
We consider a slightly more general type of loss-function than (5.37), the
so-called linear loss-function with insensitive zone:
€, if Iy - f(x,a)1 ::; €
Iy - f(x,a)le = { Iy - f(x,a)l, otherwise. ' (5.38)

This loss-function describes the €-insensitive model: the loss is equal to € if


the discrepancy between the predicted and actual value is less than € and is
equal to the discrepancy otherwise. The loss-function (5.37) is a particular
case of this loss-function for € = o.

5.9.2 Minimizing the Risk Using Convex Optimization


Procedure
The support vector type approximation to regression takes place if:
(i) One estimates the regression in the set of linear functions

f(x,a) = (w· x) + b.
(ii) One defines the problem of regression estimation as the problem
of risk minimization with respect to an €-insensitive (€ 2: 0) loss-
function (5.38).
(iii) One minimizes the risk using the SRM principle, where elements of
the structure Sn are defined by inequality

(w· w) ::; cn . (5.39)

1. Indeed, suppose we are given training data

(Xl, yd, ... , (Xl, Yl).


Then the problem of finding the Wi and bi that minimize the empirical risk

1
L IYi - (w· Xi) - bl
i
Remp(w,b) = l e
,;=1

14This is an extreme case where one has minimal information about an un-
known density. Huber described also the intermediate cases where the unknown
density is a mixture of some given density and any density from a described set
of densities, taken in proportion € and 1 - € (Huber, 1964).
5.9. SV Machines for the Regression Estimation Problem 153

under constraint (5.39) is equivalent to the problem of finding the pair w, b


that minimizes the quantity defined by slack variables ~i' ~i, i = 1, ... , i

(5.40)

under constraints

Yi - (w . Xi) - b < c +~i, i=l, ... ,i,


(w . Xi) + b - Yi < c +~i' i = 1, ... ,i,
(5.41)
~i > 0, i = 1, ... , i,
~i > 0, i = 1, ... , i,

and constraint (5.39).


As before to solve the optimization problem with constraints of inequality
type One has to find a saddle point of the Lagrange functional
l l
L(w,C,~;a*,a,C*,I,,*) = ~:)~i+~i)- Lai [Yi - (w· Xi) - b+ c +~i]-
i=l i=l
l C* l
La; [(w· Xi) +b- Yi + c + C]- T(Cn - (w· w)) - LbiC + li~i).
i=l i=l
(5.42)
(Minimum with respect to elements w, b, ~i, and ~; and maximum with
respect to Lagrange multipliers C* 2': 0, ai 2': 0, ai 2': 0, Ii 2': 0, and
Ii 2': 0, i = 1, ... ,i.)
Minimization with respect to w, b and ~i, ei
implies the following three
conditions:
(5.43)

l l

La; = Lai, (5.44)


i=l i=l
°: ; a; ::; 1, i = 1, ... , i,

O::;ai::;l, i=l, ... ,i, (5.45)


Putting (5.43) and (5.44) into (5.42) One obtains that, for solution of this
optimization problem, One has to find the maximum of the COnvex func-
tional
l l
W(a,a*,C*)=-cL(a;+ai) + LYi(a;-ai)
i=l i=l
154 5. Constructing Learning Algorithms

__1_ ~ (o:~ _ 0:')(0:": _ 0: ·)(x .. x.) _ CnC* (5.46)


2C* L..J. • J J • J 2
i,j=l
subject to constraints (5.44), (5.45) and constraint

C* 2: o.
As in pattern recognition here only some of the parameters in expansion
(5.43)
0:; - O:i
i3i = i = 1, ... , i
C*
differ from zero. They define the support vectors of the problem.
2. One can reduce the convex optimization problem of finding the vec-
tor w to a quadratic optimization problem, if instead of minimizing the
functional (5.40), subject to constraints (5.41) and (5.39), one minimizes

(with given value C) subject to constraints (5.41). In this case to find the
desired vector
e
w = ~)o:; - O:i)Xi,
i=l

one has to find coefficients 0:;, O:i, i = 1, ... , i that maximize the quadratic
form
e e e
W(o:,o:*) = -c; L(O:;+O:i) + LYi(O:;-O:i) - ~ L (O:;-O:i)(o:j-O:j)(Xi'Xj)
i=l i=l i,j=l
(5.47)
subject to constraints
e e
Lo:i = LO:i,
i=l i=l
o ~ o:i ~ C, i = 1, ... , i,
o ~ O:i ~ C, i = 1, ... , i.
As in the pattern recognition case, the solution to these two optimization
problems coincide if C = C*.
One can show that for any i = 1, ... , i the equality
5.9. SV Machines for the Regression Estimation Problem 155

holds true. Therefore, for the particular case where c = 0 and Yi E {-I, I}
the considered optimization problems coincide with those described for
pattern recognition in Section 5.5.1.
To derive the bound on the generalization of the SV machine, suppose
that distribution F(x, y) = F(ylx)F(x) is such that for any fixed w, b the
corresponding distribution of the random variable Iy - (w· x) - bl e has a
"light tail" (see Section 3.4):

ytEly - (w· x) - blf


sup
w,d
EI Y - (W·X ) - bl e ~ r, p > 2.
Then according to equation (3.30) one can assert that the solution Wi, bi
of the optimization problem, provides a risk (with respect to loss function
(5.38)) such that with probability at least 1 -", the bound

R( b)< Remp(Wi, bi) - c


Wi, i _ c + ( ) ,
1 - a(p )rV£ +

holds true, where

a(p)=
~ (p _1)P-l
2 p-2
and
hn (In !! + 1) -In(,,,j4)
£=4 £ .
Here h n is the VC dimension of the set of functions
Sn={ly-(w·x)-bl e : (w·w)~c..}.

5.9.3 BV Machine with Convolved Inner Product


Constructing the best approximation of the form
N
!(Xi v, {3) = L:{3iK(x, Vi) +b (5.48)
i=l

where {3i, i = 1, .. ,N are scalars, Vi, ,i = 1, ... ,N are vectors, and K(.,·) is
a given function satisfying Mercer's conditions, is analogous to construct-
ing a linear approximation. It can be conducted both by solving a convex
optimization problem and by solving a quadratic optimization problem.
1. Using the convex optimization approach one evaluates coefficients
{3i, i = 1, ... , £ in (5.48) as

(3,. -- "C* O!i '


O!i - £
i = 1, ... , ,
156 5. Constructing Learning Algorithms

where at, ai, G are parameters that maximize the function


l l
W = -€ z)a; + ai) + LYi(a; - ai) -
i=l i=l

_1_ ~ (a* _ a.)(a* _ a .)K(x . . x.) _ enG*


2G* L...J. • J J • J 2
i,j=l
subject to the constraint
I. l

La; = Lai
i=l i=l
and to constraints
o :::; a; :::; 1, 0:::; ai :::; 1, i = 1, ... , l,
o :::; ai :::; 1, i = 1, ... , l,
and
G* ~ O.

2. Using the quadratic optimization approach one evaluates vector w


(5.48) with coordinates
f3i = a; - ai, i = 1, ... , l,
where a;, ai are parameters that maximize the function
l I. I.
W = -€ L(a;+ai) + LYi(a;-ai) - ~ L (a;-ai)(aj-aj)K(xi·Xj)
i=l i=l i,j=l
subject to the constraint
I. l

La; = Lai
i=l i=l
and to constraints
o :::; a; :::; G, i = 1, ... , l,
o :::; ai :::; G, i = 1, ... , i.

Choosing different kernels K (., .) satisfying Mercer's condition one con-


structs different types of learning machine. In particular, the kernel

gives a polynomial machine.


By controlling two parameters en and € (G and € in the quadratic opti-
mization approach) one can control the generalization ability, even for high
degree polynomials in a high-dimensional space.
Informal Reasoning and Comments
5

5.10 THE ART OF ENGINEERING VERSUS FORMAL


INFERENCE

The existence of neural networks can be considered a challenge for theo-


reticians.
From the formal point of view one cannot guarantee that neural networks
generalize well, since according to theory, in order to control generalization
ability one should control two factors: the value of the empirical risk and the
value of the confidence interval. Neural networks, however, cannot control
either of the two.
Indeed, to minimize the empirical risk, a neural network must minimize a
functional that has many local minima. Theory offers no constructive way
to prevent ending up with unacceptable local minima. In order to control
the confidence interval one has first to construct a structure on the set of
functions that the neural network implements and then to control capacity
using this structure. There are no accurate methods to do this for neural
networks.

Therefore from the formal point of view it seems that there should be
no question as to what type of machine should be used for solving real-life
problems.

The reality however is not so straightforward. The designers of neural


networks compensate the mathematical shortcomings with the high art
of engineering. Namely, they incorporate various heuristic algorithms that
158 5. Informal Reasoning and Comments - 5

make it possible to attain reasonably local minima using a reasonably small


number of calculations.
Moreover, for given problems they create special network architectures
which both have an appropriate capacity and contain "useful" functions for
solving the problem. Using these heuristics, neural networks demonstrate
surprisingly good results.
In Chapter 5, describing the best results for solving the digit recognition
problem using the U.S. Postal Service database by constructing an entire
(not local) decision rule we gave two figures:
5.1% error rate for the neural network LeNet 1 (designed by Y. Le-
Cun),
4.0% error rate for a polynomial SV machine.
We also mentioned the two best results:
3.3% error rate for the local learning approach, and the record

2.7% error rate for tangent distance matching to templates given by


the training set.

In 1993, responding to the community's need for benchmarking, the U.S.


National Institute of Standard and Technology (NIST) provided a database
of handwritten characters containing 60,000 training images and 10,000 test
data, where characters are described as vectors in 20 x 20 = 400 pixel space.
For this database a special neural network (LeNet 4) was designed. The
following is how the article reporting the benchmark studies (Leon Bottou
et al, 1994) describes the construction of LeNet 4:
"For quite a long time, LeNet 1 was considered the state of
the art. The local learning classifier, the SV classifier, and tan-
gent distance classifier were developed to improve upon LeNet
1 - and they succeeded in that. However, they in turn mo-
tivated a search for an improved neural network architecture.
This search was guided in part by estimates of the capacity of
various learning machines, derived from measurements of the
training and test error (on the large NIST database) as a func-
tion of the number of training examples. 15 We discovered that
more capacity was needed. Through a series of experiments in
architecture, combined with an analysis of the characteristics
of recognition errors, LeNet 4 was crafted."
In these benchmarks, two learning machines that construct entire deci-
sion rules:

15V. Vapnik, E. Levin, and Y. LeCun (1994) "Measuring the VC dimension of


a learning machine," Neuml Computation, 6(5), pp. 851-876.
5.10. The Art of Engineering Versus Formal Inference 159

(i) LeNet 4,
(ii) Polynomial SV machine (polynomial of degree four)
provided the same performance: 1.1% test error 16 .
The local learning approach and tangent distance matching to 60,000
templates also gave the same performance: 1.1 % test error.
Recall that for a small (U.S. Postal Service) database the best result (by
far) was obtained by the tangent distance matching method which uses a
priori information about the problem (incorporated in the concept of tan-
gent distance). As the number of examples increases to 60,000 the advan-
tage of a priori knowledge decreased. The advantage of the local learning
approach also decreased with the increasing number of observations.
LeNet 4, crafted for the NIST database demonstrated remarkable im-
provement in performance comparing to LeNet 1 (which has 1.7% test
errors for the NIST database 17 ).
The standard polynomial SV machine also did a good job. We continue
the quotation (Leon Bottou, et al, 1994):
"The SV machine has excellent accuracy, which is most remark-
able, because unlike the other high performance classifiers it
does not include knowledge about the geometry of the problem.
In fact this classifier would do just as well if the image pixel
were encrypted, e.g., by a fixed random permutation."
However, the performance achieved by these learning machines is not
the record for the NIST database. Using models of characters (the same
that was used for constructing the tangent distance) and 60,000 examples
of training data, H. Drucker, R. Schapire, and P. Simard generated more
than 1,000,000 examples which they used to train three LeNet 4 neural
networks, combined in the special "boosting scheme" (Drucker, Schapire,
and Simard, 1993) which achieved a 0.7% error rate.
Now the SV machines have a challenge - to cover this gap (between
1.1% to 0.7%). Probably the use of only brute force SV machines and
60,000 training examples will not be sufficient to cover the gap. Probably
one has to incorporate some a priori information about the problem at
hand.
There are several ways to do this. The simplest one is use the same
1,000,000 examples (constructed from the 60,000 NISTs prototypes). How-
ever, it is more interesting to find a way for directly incorporating the

16Unfortunately one cannot compare these results to the results described in


Chapter 5. The digits from the NIST database are "easier" for recognition than
the ones from U.S. Postal Service database.
17Note that LeNet 4 has an advantage for large 60,000 training examples
(NIST) database. For a small (U.S. Postal Service) database containing 7,000
training examples, the network with smaller capacity, LeNet 1, is better.
160 5. Informal Reasoning and Comments - 5

invariants that were used for generating the new examples. For example,
for polynomial machines one can incorporate a priori information about in-
variance by using the convolution of an inner product in the form (x T Ax*)d,
where x and x* are input vectors and A is a symmetric positive definite
matrix reflecting the invariants of the models. 18
One can also incorporate another (geometrical) type of a priori infor-
mation using only features (monomials) XiXjXk formed by pixels which are
close each to other (this reflects our understanding of the geometry of the
problem - important features are formed by pixels that are connected to
each other, rather than pixels far from each other). This essentially reduces
(by a factor of millions) the dimensionality of feature space.
Thus, although the theoretical foundations of Support Vector machines
look more solid than those of Neural Networks, the practical advantages of
the new type of learning machines still needs to be proved. 18a

5.11 WISDOM OF STATISTICAL MODELS


In this chapter we introduced the Support Vector machines which realize
the Structural Risk Minimization inductive principle by
(i) Mapping the input vector into a high-dimensional feature space using
a nonlinear transformation.
(ii) Constructing in this space a structure on the set of linear decision
rules according to the increasing norm of weights of canonical hyper-
planes.
(iii) Choosing the best element of the structure and the best function
within this element in order to minimize the bound on error proba-
bility.

The implementation of this scheme in the algorithms described in this


chapter, however, contained one violation of the SRM principle. To define
the structure on the set of linear functions we use the set of canonical
hyperplanes constructed with respect to vectors x from the training data.

lSB. Sch6lkopf considered an intermediate way: he constructed an SV machine,


generated new examples by transforming the SV images (translating them in the
four principal directions), and retrained on the support vectors and the new
examples. For the U.S. Postal Service database, this improves the performance
from 4.0% to 3.2%.
18a In connection with heuristics incorporated in Neural Networks let me recall
the following remark by R. Feynman: "We must make it clear from the beginning
that if a thing is not a science, it is not necessarily bad. For example, love is not
science. So, if something is said not to be a science it does not mean that there
is something wrong with it; it just means that it is not a science." The Feynman
Lectures on Physics, Addison-Wesley, 3-1, 1975.
5.11. Wisdom of Statistical Models 161

According to the the SRM principle, the structure has to be defined a priori
before the training data appear.
The attempt to implement the SRM principle in toto brings us to a new
statement of the learning problem which forms a new type of inference. For
simplicity we consider this model for the pattern recognition problem.
Let the learning machine that implements a set of functions linear in
feature space be given l + k vectors

XI, .•• ,X£+k (5.49)

drawn randomly and independently according to some distribution func-


tion.
Suppose now that these l + k vectors are randomly divided into two
subsets: the subset

for which the string


Yl, ... ,Y£, YE{-l,+l}
describing classification of these vectors is given (the training set), and the
subset
X£+I, ••• , X£+k

for which the classification string should be found by the machine (test
set). The goal of the machine is to find the rule that gives the string with
the minimal number of errors on the given test set.
In contrast to the model of function estimation considered in this book,
this model looks for the rule that minimizes the number of errors on the
given test set rather than for the rule minimizing the probability of error
on the admissible test set. We call this problem the estimation of the values
of the function at given points. For the problem of estimating the values
of function at given points the SV machines will realize the SRM principle
in toto if one defines the canonical hyperplanes with respect to all l + k
vectors (5.49). (One can consider the data (5.49) as a priori information.
A posteriori information is any information about separating this set into
two subsets.)
Estimating the values of a function at given points has both a solution
and a method of solution, that differ from those based on estimating an
unknown function.
Consider for example the five digit zip-code recognition problem. 19 The
existing technology based on estimating functions suggests recognizing the
five digits Xl, ... , X5 of the zip-code independently: first one uses the rules
constructed during the learning procedures to recognize digit Xl, then one
uses the same rules to recognize digit X2 and so on.

19For simplicity we do not consider the segmentation problem. We suppose


that all five digits of a zip-code are segmented.
162 5. Informal Reasoning and Comments - 5

The technology of estimating the values of a function suggests to recog-


nizing all five digits jointly: the recognition of one digit, say Xl, depends
not only on the training data and vector Xl, but also on vectors X2, .•. ,X5.
In this technology one uses the rules that are in a special way adapted to
solving a given specific task. One can prove that this technology gives more
accurate solutions. 2o
It should be noted that for the first time this new view of the learning
problem was found due to attempts to justify a structure defined on the
set of canonical hyperplanes for the SRM principle.

5.12 WHAT CAN ONE LEARN FROM DIGIT


RECOGNITION EXPERIMENTS?

Three observations should be discussed in connection with the experiments


described in this chapter:
(i) The structure constructed in the feature space reflects real-life prob-
lems well.
(ii) The quality of decision rules obtained does not strongly depend on
the type of SV machine (polynomial machine, RBF machine, two-
layer NN). It does, however, strongly depend on the accuracy of the
VC dimension (capacity) control.
(iii) Different types of machines use the same elements of training data as
support vectors.

5.12.1 Influence of the Type of Structures and Accuracy of


Capacity Control
The classical approach to estimating multidimensional functional depen-
dencies is based on the following belief:
Real-life problems are such that there exists a small number of "strong
features," simple functions of which (say linear combinations) approximate
well the unknown function. Therefore, it is necessary to carefully choose a
low-dimensional feature space and then to use regular statistical techniques
to construct an approximation.

20 Note that the local learning approach described in Section 4.5 can be consid-
ered as an intermediate model between function estimation and estimation of the
values of a function at points of interest. Recall that for a small (Postal Service)
database the local learning approach gave significantly better results (3.3% error
rate) than the best result based on entire function estimation approach (5.1%
obtained by LeNet 1, and 4.0% obtained by the polynomial SV machine).
5.12. What Can One Learn from Digit Recognition Experiments? 163

This approach stresses: be careful on the stage of feature selection (this


is an informal operation) and then use routine statistical techniques.
The new technique is based on a different belief:
Real-life problems are such that there exist a large number of "weak fea-
tures" whose "smart" linear combination approximates the unknown depen-
dency well. Therefore, it is not very important what kind of "weak feature"
one uses, it is more important to form "smart" linear combinations.
This approach stresses: choose any reasonable "weak feature space" (this
is an informal operation), but be careful at the point of making "smart"
linear combinations. From the perspective of SV machines, "smart" linear
combinations corresponds to the capacity control method.
This belief in the structure of real-life problems has been expressed many
times both by theoreticians and by experimenters.
In 1940, Church made a claim that is known as the TUring-Church
Thesis 21 :
All (sufficiently complex) computers compute the same family of func-
tions.
In our specific case we discuss the even stronger belief that linear func-
tions in various feature spaces associated with different convolutions of the
inner product, approximate the same set of functions if they possess the
same capacity.
Church made his claim on the basis of pure theoretical analysis. However
as soon as computer experiments became widespread, researchers were un-
expectedly faced a situation that could be described in the spirit of Church's
claim.
In the 1970s and in the 1980s a considerable amount of experimental
research was conducted in solving various operator equations that formed
ill-posed problems, in particular in density estimation. A common obser-
vation was that the choice of the type of regularizers flU) in (4.32) (de-
termining a type of structure) is not as important as choosing the correct
regularization constant ')'(8) (determining capacity control).
In particular in density estimation using the Parzen window
I.
p(x) = ~"~K (X-Xi),
£ L...t ')'n ')'
t=l

a common observation was: if the number of observations are not "very


small" , the type of kernel function K (u) in the estimator is not as important

21Note that the thesis does not reflect some proved fact. It reflects the belief
in the existence of some law that is hard to prove (or formulate in exact terms).
164 5. Informal Reasoning and Comments - 5

as the value of the constant 'Y. (Recall that the kernel K(u) in Parzen's
estimator is determined by the functional nu), and 'Y is determined by the
regularization constant.)
The same was observed in the regression estimation problem where one
tries to use expansions in different series to estimate the regression function:
if the number of observations is not "very small" the type of series used is
not as important as the number of terms in the approximation. All these
observations were done solving low-dimensional (mostly one-dimensional)
problems.
In the described experiments we observed the same phenomena in very
high-dimensional space.

5.12.2 SRM Principle and the Problem of Feature


Construction
The "smart" linear combination of the large number of features used in
the SV machine has an important common structure: the set of support
vectors. We can describe this structure as follows: along with the set of
weak features (weak feature space) there exists a set of complex features
associated with support vectors. Let us denote this space

where

are the support vectors. In the space of complex features U, we constructed


a linear decision rule. Note that in the bound obtained in Theorem 5.2
the expectation of the number of complex features plays the role of the
dimensionality of the problem. Therefore one can describe the difference
between the support vector approach and the classical approach in the
following way:
To perform the classical approach well requires the human selection (con-
struction) of a relative small number of "smart features" while the support
vector approach selects (constructs) a small number of "smart features' au-
tomatically.
Note that the SV machines construct the Optimal hyperplane in the
space Z (space of weak features) but not in the space of complex features.
It is easy, however, to find the coefficients that provide optimality for the
hyperplane in the space U (after the complex features are chosen). Moreover
one can construct in the U space a new SV machine (using the same training
data). Therefore one can construct two (or several) layers SV machine. In
5.12. What Can One Learn from Digit Recognition Experiments? 165

other words one can suggest multi-stage selection of "smart features". As


we remarked in Section 4.10, the problem of feature selection is however
quite delicate (recall the difference between constructing sparse algebraic
polynomials and sparse trigonometric polynomials).

5.12.3 Is the Set of Support Vectors a Robust Characteristic of


the Data?
In our experiments we observed an important phenomenon: different types
of SV machines optimal in parameters use almost the same support vectors:
there exists a small subset of the training data (in our experiments less than
3% to 5% of data) that for the problem of constructing the best decision rule
is equivalent to the complete set of training data, and that this subset of the
training data is almost the same for different types of optimal SV machines
(polynomial machine with the best degree of polynomials, RBF machine
with the best parameter 'Y, and NN machine with the best parameter b.)
The important question is whether this is true for a wide set of real-
life problems. There exists indirect theoretical evidence that this is quite
possible. One can show that if a majority vote scheme, based on various
support vector machines, does not improve performance, then the percent-
age of common support vectors of these machines must be high.

It is too early to discuss the properties of SV machines: the analysis of


these properties now just started. 22 Therefore I would like to finish these
comments with the following remark.

22 After this book had been completed, C. Burges demonstrated that one can
approximate the obtained decision rule

f(x) = sign {t aiK(x, Xi) + aD}


by the much simpler decision rules

r<x) ~ .... {t,!l.K(X,T;) +110}, M«N,

using the, so-called, generalized support vectors TI, ... , TM (a specially con-
structed set of vectors).
To obtain approximately the same performance for the digit recognition prob-
lem, described in Section 5.7, it was sufficient to use an approximation based on
M = 11 generalized support vectors per classifier instead of N = 270 (initially
obtained) support vectors per classifier.
This means that for Support Vector machines there exists a regular way to
synthesize the decision rules possessing the optimal complexity.
166 5. Informal Reasoning and Comments - 5

The SV machine is a very suitable object for theoretical analysis. It


unifies various conceptual models:
(i) The SRM model. (That is how the SV machine initially was obtained.
Theorem 5.1.)
(ii) The Data Compression model. (The bound in Theorem 5.2 can be
described in terms of the compression coefficient.)
(iii) A universal model for constructing complex features. (The convolu-
tion of the inner product in Hilbert space can be considered as a
standard way for feature construction.)
(iv) A model of real-life data. (A small set of support vectors might be suf-
ficient to characterize the whole training set for different machines.)
In a few years it will be clear if such unification of models reflects some
intrinsic properties of learning mechanisms, or if it is the next cul-de-sac.
Conclusion:
What is Important in Learning
Theory?

6.1 WHAT IS IMPORTANT IN THE SETTING OF THE


PROBLEM?

In the beginning of this book we postulated (without any discussion) that


learning is a problem of function estimation on the basis of empirical data.
To solve this problem we used a classical inductive principle -the ERM
principle. Later, however, we introduced a new principle - the SRM princi-
ple. Nevertheless, the general understanding of the problem remains based
on the statistics of large samples: the goal is to derive the rule that pos-
sesses the lowest risk. The goal of obtaining the "lowest risk" reflects the
philosophy of large sample size statistics: the rule with low risk is good
because if we use this rule for a large test set, with high probability, the
means of losses will be small.
Mostly, however, we face another situation. We are simultaneously given
training data (pairs (Xi, Yi)) and test data (vectors xj) and the goal is to
use the learning machine with a set offunctions f(x, ex), ex E A, to find the
yj for the given test data. In other words, we face problem of estimating
the values of the unknown function at given points.
Why should the problem of estimating the values of an unknown function
at given points of interest be solved in two stages: first estimating the func-
tion and second estimating the values of the function using the estimated
function? In this two-stage scheme one actually tries to solve a relatively
simple problem (estimating the values of a function at given points of in-
terest) by first solving (as an intermediate problem) a much more difficult
168 Conclusion: What is Important in Learning Theory?

one (estimating the function). Recall that estimating a function requires


estimating the values of the function at all (infinite) points of the domain
where the function is defined including the points of interest. Why should
one first estimate the values of the function at all points of the domain to
estimate the values of the function at the points of interest?
It can happen that one does not have enough information (training data)
to estimate the function well, but one does have enough data to estimate
the values of the function at a given finite number of points of interest.
Moreover, in human life decision making problems play an important
role. For learning machines these can be formulated as follows: given the
training data
(Xl, yd, ... , (Xi, Yi),
the machine with functions f(x, ex), ex E A has to find among the test data

xi,·· .,xi;,
x:
the one which belongs to the first class with highest probability (decision
making problem in the pattern recognition form!). To solve this problem
one does not need to even estimate the values of the function at all given
points; therefore it can be solved in situations where one does not have
enough information (not enough training data) to estimate the value of a
function at given points.
The key to the solution of these problems is the following observation,
which for simplicity we will describe for the pattern recognition problem.
The learning machine (with a set of indicator functions Q(z, ex), ex E A)
is simultaneously given two strings: the string of i + k vectors x from the
training and the test sets, and the string of i values Y from the training
set. In pattern classification the goal of the machine is to define the string
containing k values Y for the test data.
For the problem of estimating the values of a function at the given points
the set of functions implemented by the learning machine can be factorized
into a finite set of equivalence classes. (Two indicator functions fall in the
same equivalence class if they coincide on the string Xl, . .. ,Xi+k)' These
equivalence classes can be characterized by their cardinality (how many
functions they contain).
The cardinality of equivalence classes is a concept that makes the theory
of estimating the function at the given points differ from the theory of
estimating the function. This concept (as well as the theory of estimating
the function at given points) was considered in the 1970s (Vapnik, 1979).
For the set of linear functions it was found that the bound on generalization
ability, in the sense of minimizing the number of errors only on the given

lOr to find one that with the most probability possesses the largest value of
y. (decision making in regression form).
6.1. What is Important in the Setting of the Problem? 169

Approximating
function

Values of the
Examples function at points
of interest

FIGURE 6.1. Different types of inference. Induction, deriving the function from
the given data. Deduction, deriving the values of the given function for points of
interest. Transduction, deriving the values of the unknown function for points of
interest from the given data. The classical scheme suggests to derive the values of
the unknown function for points of interest in two steps: first using the inductive
step, and then using the deduction step, rather than obtaining the direct solution
in one step.

test data, (along with the factors considered in this book), depends also
on a new factor - the cardinality of equivalence classes. Therefore, since to
minimize a risk one can minimize the obtained bound over a larger number
of factors, one can find a lower minimum. Now the problem is to construct
a general theory for estimating a function at the given points. This brings
us to a new concept of learning.
Classical philosophy usually considers two types of inference: deduction,
describing the movement from general to particular, and induction, describ-
ing the movement from particular to general.
The model of estimating the value of a function at a given point of
interest describes a new concept of inference: moving from particular to
particular. We call this type of inference transductive inference. (Fig. 6.1)
Note that this concept of inference appears when one would like to get
the best result from a restricted amount of information. The main idea in
this case was described in Section 1.9 as follows:

If you are limited to a restricted amount of information, do not solve the


particular problem you need by solving a more general problem.
170 Conclusion: What is Important in Learning Theory?

We used this idea for constructing a direct method of estimating the


functions. Now we would like to continue developing this idea: do not solve
the problem of estimating the values of a function at given points by esti-
mating the entire function, and do not solve a decision making problem by
estimating the values of a function at a given points etc.
The problem of estimating the values of a function at a given points
addresses the question that has been discussed in philosophy for more than
2000 years:
What is the basis of human intelligence: knowledge of laws (rules) or the
culture of direct access to the truth (intuition, ad-hoc inference)'?
There are several different models embracing the statements of the learn-
ing problem, but from the conceptual point of view none can compare to
the problem of estimating the values of the function at given points. This
model can provide the strongest contribution to the 2000 years of discus-
sions about the essence of human reason.

6.2 WHAT IS IMPORTANT IN THE THEORY OF


CONSISTENCY OF LEARNING PROCESSES?

This part of the theory is well developed. It answers almost all questions
toward understanding the conceptual model of learning processes realizing
the ERM principle. The only remaining open question is the necessary
and sufficient conditions for a fast rate of convergence. In Chapter 2 we
considered the sufficient condition described using Annealed entropy

lim H:nn(l!) =0
£-+00 £
for the pattern recognition case. It also can be shown that the conditions

· H:nn(cj£) - 0
11m V'c >0
0 -,
£-+00 .[.

in terms of the Annealed entropy H:nn(cj£) = lnENA(cjzl, ... ,Zt) de-


fine the sufficient conditions for fast convergence in the case of regression
estimation.
The question remains:
Do these equalities form the necessary conditions as well? If not, what
are the necessary and sufficient conditions?
Why is it important to find the concept which describes the necessary
and sufficient conditions for fast rate of convergence?
As was demonstrated, this concept plays a key role in the theory of
bounds. In our constructions we used the Annealed entropy for finding both
6.4. What is Important in the Theory of Generalization Ability? 171

(nonconstructive) distribution independent bounds and (nonconstructive)


distribution dependent bounds. On the basis of the Annealed entropy, we
constructed both the Growth function and the Generalized Growth func-
tion. Proving necessity of the Annealed entropy for the fast rate of conver-
gence would amount to showing that this is the best possible construction
for deriving bounds on the generalization ability of learning machines. If
the necessary and sufficient conditions are described by another function,
the constructions can be reconsidered.

6.3WHAT IS IMPORTANT IN THE THEORY OF


BOUNDS?

The theory of bounds contains two parts: the theory of nonconstructive


bounds, which are obtained on the basis of the concepts of the Growth
function and the Generalized Growth function, and the theory of construc-
tive bounds, where the main problem is estimating these functions using
some constructive concept.
The main problem in the theory of bounds is in the second part. One
has to introduce some constructive concept by means of which one can
estimate the Growth function or the Generalized Growth function. In 1968
we introduced the concept of the VC dimension and found the bound for
the Growth function (Vapnik and Chervonenkis, 1968, 1971). We proved
that the value NA(i) is either 2l or polynomially bounded2

h
A
N (Zl' ... , Zl)::;
(
hei )

Note that the polynomial on the right-hand side depends on one free pa-
rameter h. This bound (which depend on one capacity parameter) cannot
be improved (there exist examples where equality is achieved).
The challenge is to find refined concepts containing more than one pa-
rameter (say two parameters) that describe some properties of capacity
(and the set of distribution functions F(z) E 1'), using which one can
obtain better bounds. 3
This is a very important question, and the answer would have immediate
impact on the bounds of the generalization ability of learning machines.

2In 1972 this bound was also published by Sauer (Sauer, 1972).
3Recall the MDL bound: even such a refined concept as the coefficient of
compression provides a worse bound than one based on three (actually rough)
concepts such as the value of the empirical risk, the number of observations, and
the number of functions in a set.
172 Conclusion: What is Important in Learning Theory?

6.4 WHAT IS IMPORTANT IN THE THEORY FOR


CONTROLLING THE GENERALIZATION ABILITY OF
LEARNING MACHINES?

The most important problem in the theory for controlling the generaliza-
tion ability of learning machines is finding a new inductive principle for
small sample sizes. In the mid-1970s, several techniques were suggested to
improve the classical methods of function estimating. Among these are the
various rules for choosing the degree of a polynomial in the polynomial
regression problem, various regularization techniques for multidimensional
regression estimation, the regularization method for solving ill-posed prob-
lems, etc. All these techniques are based on the same idea: to provide the
set of functions with a structure and then minimize the risk along the el-
ements of the structure. In the 1970s the crucial role of capacity control
was discovered. We call this general idea SRM to stress the importance of
minimizing the risk in the element of the structures.
In SRM, one tries to control simultaneously two parameters: the value
of the empirical risk and the capacity of the element of the structure.
In the 1970s the MDL principle was proposed. Using this principle, one
can control the coefficient of compression.
The most important question is:
Does there exist a new inductive principle for estimating dependency from
small sample sizes?
In studies of inductive principles it is crucial to find new concepts which
affect the bounds of the risk, and which therefore can be used in mini-
mizing these bounds. To use an additional concept, we introduced a new
statement of the learning problem: the local risk minimization problem.
In this statement, in the framework of the SRM principle, one can control
three parameters: empirical risk, capacity, and locality.
In the problem of estimating the values of a function at the given points
one can use an addition concept: the cardinality of equivalence classes. This
aids in controlling the generalization ability: by minimizing the bound over
four parameters, one can get smaller minima than by minimizing the bound
over fewer parameters. The problem is to find a new concept which can
affect the upper bound of the risk. This will immediately lead to a new
learning procedure, and even to a new type of reasoning (as in the case of
transductive inference).
Finally, it is important to find new structures on the set of functions. It
is interesting to find structures with elements containing functions which
are described by large numbers of parameters, but nevertheless have low
VC dimension. We have found only one such structure and this brought us
to SV machines. New structures of this kind will probably result in new
types of learning machines.
6.5. What is Important in the Theory for Constructing Learning Algorithms? 173

6.5 WHAT IS IMPORTANT IN THE THEORY FOR


CONSTRUCTING LEARNING ALGORITHMS?

The algorithms for learning should be well controlled. This means that one
has to control two main parameters responsible for generalization ability:
the value of the empirical risk and the VC dimension of the smallest element
of the structure that contains a chosen function.
The SV technique can be considered as an effective tool for control-
ling these two parameters if structures are defined on the sets of linear
functions in some high dimensional feature space. This technique is not
restricted only to the sets of indicator functions (for solving pattern recog-
nition problems). At the end of Chapter 5 we described the generalization
of the SV method for solving regression problems. In the framework of
this generalization using a special convolution function one can construct
high dimensional spline-functions belonging to the subset of splines with
a chosen VC dimension. Using different convolution functions for the in-
ner product one can also construct different type offunctions nonlinear in
input space. 4
Moreover, the SV technique goes beyond the framework of learning the-
ory. It admits a general point of view as a new type of parametrization of
sets of functions.
The matter is that in solving the function estimation problems in both
computational statistics (say pattern recognition, regression, density esti-
mation) and in computational mathematics (say, obtaining approximations
to the solution to multidimensional (operator) equations of different types)
the first step is describing (parametrizing) a set of functions in which one
is looking for a solution.
In the first half of this century the main idea of parametrization (after
Weierstrass theorem) was series expansion. However even in the one dimen-
sional case sometimes one needs a few dozen terms for accurate function
approximation. To treat such a series for solving many problems the accu-
racy of existing computers can be insufficient.
Therefore in the middle of the 1950s a new type of functions parametriza-
tion was suggested, the so-called spline functions (picewise polynomial func-
tions). This type of parametrization allowed us to get an accurate solution

4Note once more that advanced estimation techniques in statistics developed


in 1980s such as Projection Pursuit Regression, MARS, Hinging Hyperplanes,
etc, in fact consider some special approximations in the sets of function

Y= L O:jK{(x· Wj)} + b,
j=l

where 0:1, ... ,O:N are scalars and WI, .•. ,WN are vectors.
174 Conclusion: What is Important in Learning Theory?

for most one-dimensional (sometimes two-dimensional) problems. However


it often fails in, say, the four-dimensional case.
The SV parametrization of functions can be used in high-dimensional
space (recall that for this parametrization the complexity of approximation
depends on the number of support vectors rather than on the dimensional-
ity of the space). By controlling the "capacity" of the set of functions one
can control the "smoothness" properties of the aproximation.
This type of parametrization should be taken into account whenever
one consideres multidimensional problems of function estimation (function
approximation) .
Currently we have only experience in using the SV technique for solving
pattern recognition problems.
However, theoretically there is no obstacle to obtain, using this technique
the same high level of accuracy in solving dependency estimation problems
that arise in different areas of statistics (such as regression estimation, den-
sity estimation, conditional density estimation) and computational mathe-
matics (such as solving some multi-dimensional linear operator equations).
One can consider the SV technique as a new type of parametrization of
multidemensional functions that in many cases allows us to overcome the
"curse of dimensionality" .

6.6 WHAT IS THE MOST IMPORTANT?

The learning problem belongs to the problems of natural science: there ex-
ists a phenomenon for which one has to construct a model. In the attempts
to construct this model, theoreticians can choose one of two different po-
sitions depending on which part of Hegel's formula (describing the general
philosophy of nature) they prefer:

Whatever is real, is rational and whatever is rational, is real. 5

The interpretation of the first part of this formula can be as follows.


Somebody (sayan experimenter) knows a model that describes reality,
and the problem of the theoretician is to prove that this model is rational
(he should define, as well, what it means to be rational). For example, if
somebody believes and can convince the theoretician, that neural networks

5In Hegel's original assertion, the meaning of the words "real" and "rational"
does not coincide with the common meaning of these words. Nevertheless, ac-
cording to a remark of B. Russell, the identification of the real and the rational
in a common sense leads to the belief that "whatever is, is right." Russell did not
accept this idea (see B. Russell,A History of Western Philosophy). However, we
do interpret Hegel's formula as: "whatever exists, is right and whatever is right,
exists."
6.6. What is the Most Important? 175

are good models of real brains, then the goal of the theoretician is to prove
that this model is rational.
Suppose that the theoretician considers the model to be "rational" if it
possesses some remarkable asymptotic properties. In this case, the theo-
retician succeeds if he or she proves (as has been done) that the learning
process in neural networks asymptotically converges to local extrema and
that a sufficiently large neural network can approximate well any smooth
function. The conceptual part of such a theory will be complete if one can
prove that the achieved local extremum is close to the global one.
The second position is a heavier burden for the theoretician: the theo-
retician has to define what a rational model is, then has to find this model,
and finally must convince the experimenters to prove that this model is
real (describes reality).
Probably, a rational model is one that not only has remarkable asymp-
totic properties but also possesses some remarkable properties in dealing
with a given finite number of observations. 6 In this case, the small sample
size philosophy is a useful tool for constructing rational models.
The rational models can be so unusual that one needs to overcome prej-
udices of common sense in order to find them. For example, we saw that
the generalization ability of learning machines depends on the VC dimen-
sion of the set of functions, rather than on the number of parameters that
define the functions within a given set. Therefore, one can construct high-
degree polynomials in high-dimensional input space with good generaliza-
tion ability. Without the theory for controlling the generalization ability
this opportunity would not be clear. Now the experimenters have to an-
swer the question: Does generalization, as performed by real brains, include
mechanisms similar to the technology of support vectors?7
That is why the role of theory in studies of learning processes can be
more constructive than in many other branches of natural science.
This, however, depends on the choice of the general position in studies
of learning phenomenon. The choice of the position reflects the belief of
which in this specific area of natural science, is the main discoverer of
truth: experiment or theory.

6Maybe it has to possess additional properties. Which?


7The idea that the generalization, the definition of the importance of the
observed facts, and storage of the important facts, are different aspects of the
same brain mechanism, is very attractive.
References

Remarks on References
One of the greatest mathematicians of the century, A.N. Kolmogorov, once
noted that an important difference between mathematical sciences and his-
torical sciences is that facts once found in mathematics hold forever while
the facts found in history are reconsidered by every generation of historians.
In statistical learning theory as in mathematics the importance of results
obtained depends on new facts about learning phenomenon, whatever they
reveal, rather than a new description of already known facts. Therefore, I
tried to refer to the works that reflect the following sequence of the main
events in developing the statistical learning theory described in this book:
1958-1962. Constructing the Perceptron.
1962-1964. Proving the first theorem on learning processes.
1958-1963. Discovery of the nonparametric statistics.
1962-1963. Discovery of the methods for solving ill-posed prob-
lems.
1960-1965. Discovery of the Algorithmic Complexity concept and
its relation to the inductive inference.
1968-1971. Discovery of the Law of Large Numbers for the space
of indicator functions and its relation to the pattern
recognition problem.
178 References

1965-1973. Creation of a general asymptotic learning theory for the


Stochastic Approximation inductive inference.

1965-1972. Creation of a general nonasymptotic theory of pattern


recognition for the ERM principle.

1974. Formulation of the SRM principle.

1978. Formulation of the MOL principle.

1974-1979. Creation of the general nonasymptotic learning theory


based on both the ERM and the SRM principles.

1981. Generalization of the Law of Large Numbers for the


space of real functions.

1986. Construction of NN based on the Back-Propagation


method.

1989. Discovery of the necessary and sufficient conditions for


consistency of the ERM principle and the ML method.

1989-1993. Discovery of the universality of function approximation


by a sequence of superpositions of sigmoid functions.

1992-1995. Constructing the SV machines.

REFERENCES

M. A. Aizerman, E. M. Braverman, and L. I. Rozonoer (1964), "Theoreti-


cal foundations of the potential function method in pattern recognition
learning," Automation and Remote Control 25, pp. 821-837.

M. A. Aizerman, E. M. Braverman, and L. I. Rozonoer (1965), "The


Robince-Monroe process and the method of potential functions," Au-
tomation and Remote Control, 28, pp. 1882-1885.

M. A. Aizerman, E. M. Braverman, and L. I. Rozonoer (1970), Method of


Potential Functions in the Theory of Pattern Recognition, (in Russian),
Nauka, Moscow, p.384.

H. Akaike (1970), "Statistical predictor identification," Annals of the In-


stitute of Statistical Mathematics, pp.202-217.

S. Amari (1967), "A theory of adaptive pattern classifiers,", IEEE Trans.


Elect. Comp., EC-16, pp. 299-307.
References 179

T. W. Anderson and RR Bahadur (1966), "Classification into two mul-


tivariate normal distributions with different covariance matrices." The
Annals of Mathematical Statistics, 133, (2).
A.R Barron (1993), "Universal approximation bounds for superpositions
of a sigmoid function," IEEE Transaction on Information Theory 39,
(3), pp. 930-945.

J. Berger (1985), Statistical Decision Theory and Bayesian Analysis,


Springer.

B. Boser, 1. Guyon, and V. N. Vapnik (1992), "An training algorithm for


optimal margin classifiers," Fifth Annual Workshop on Computational
Learning Theory, Pittsburgh ACM, pp. 144-152.
L. Bottou, C Cortes, J. Denker, H. Drucker, 1. Guyon, L. Jackel, Y. LeCun
U. Miller, E S8.ckinger, P. Simard, and V. Vapnik (1994), "Comparison
of classifier methods: A case study in handwritten digit recognition, Pro-
ceeding 12th IAPR International Conference on Pattern Recognition, 2,
IEEE Computer Society Press Los Alamos, California, pp. 77-83.

L. Bottou and V. Vapnik (1992), "Local learning algorithms," Neural Com-


putation 4 (6), pp. 888-901.
L. Breiman (1993), "Higing hyperplanes for regression, classification and
function approximation," IEEE Transaction on Information Theory 39
(3), pp. 999--1013

L. Breiman, J.H. Friedman, RA. Olshen, and C.J. Stone (1984), Classifi-
cation and regression trees, Wadsworth, Belmont, CA.
A. Bryson, W. Denham, and S. Dreyfuss (1963), "Optimal programming
problem with inequality constraints. I: Necessary conditions for extremal
solutions" AIAA Journal, 1, pp. 25-44.
F. P. Cantelli (1933), "Sulla determinatione empirica della leggi di proba-
bilita," Giornale dell' Institute Italiano degli Attuari, (4).

G.J. Chait in (1966) ," On the length of programs for computing finite binary
sequences," J. Assoc. Comput Mach., 13, pp. 547-569.

N. N. Chentsov (1963), "Evaluation of an unknown distribution density


from observations," Soviet Math. 4, pp. 1559--1562.

C. Cortes and V Vapnik (1995), "Support Vector Networks," Machine


Learning, 20, pp 1-25.
R Courant and D. Hilbert (1953), Methods of Mathematical Physics, J.
Wiley, New York.
180 References

G. Cybenko (1989), "Approximation by superpositions of sigmoidal func-


tion," Mathematics of Control, Signals, and Systems, 2, pp. 303-314.
L. Devroye (1988), "Automatic pattern recognition: A Study of the prob-
ability of error," IEEE Transaction on Pattern Analysis and Machine
Intelligence 10, (4), pp. 530-543.
L.Devroye and L. Gyorfi (1985), Nonparametric density estimation in Ll
view, J. Wiley, New York.
H. Drucker, R. Schapire, and P. Simard (1993), "Boosting performance
in neural networks," International Journal in Pattern Recognition and
Artificial Intelligence, 7, (4), pp. 705-719.
R. M. Dudley (1978), "Central limit theorems for empirical measures,"
Ann. Prob. 6, (6), pp. 899--929.
R. M. Dudley (1984), Course on empirical processes, Lecture Notes in
Mathematics, Vol. 1097, pp.2-142, Springer, New York.
R. M. Dudley (1987), "Universal Donscer classes and metric entropy," Ann.
Prob. 15, (4), pp. 1306--1326.
R. A. Fisher (1952), Contributions to Mathematical Statistics, J. Wiley,
New York.
J.H. Friedman and W. Stuetzle (1981), "Projection pursuit regression,"
JASA, 76, pp. 817-823.
F.Girosi, and G. Anzellotti (1993), "Rate of convergence for radial basis
functions and neural networks," Artificial Neural Networks for Speech
and Vision, Chapter & Hall, pp. 97-113.
V. I. Glivenko (1933), "Sulla determinatione empirica di probabilita," Gior-
nale dell' Institute Italiano degli Attuari, (4).
U. Grenander (1981), Abstract inference, J. Wiley, New York.
A. E. Hoerl and R. W. Kennard (1970), "Ridge regression: Biased estima-
tion for non-orthogonal problems," Technometrics, 12, pp. 55--67.
P. Huber (1964), "Robust estimation of location parameter," Annals of
Mathematical Statistics, 35, (1).
L. K. Jones (1992), "A simple lemma on greedy approximation in Hilbert
space and convergence rates for Projection Pursuit Regression," The An-
nals of Statistics, 20, (1), pp. 608-613.
V.V. Ivanov (1962), "On linear problems which are not well-posed," Soviet
Math. Doel., 3, (4), pp. 981-983.
References 181

V.V. Ivanov (1976), The theory of approximate methods and their appli-
cation to the numerical solution of singular integral equation, Leyden,
Nordhoff International.
M. Karpinski and T. Werther (1989), "vc dimension and uniform learnabil-
ity of sparse polynomials and rational functions," SIAM J. Computing,
to appear. (Preprint 8537-CS, Bonn University, 1989.)
A. N. Kolmogoroff (1933), "Sulla determinatione empirica di una leggi di
distributione," Giornale dell' Institute Italiano degli Attuari, (4).
A.N. Kolmogorov (1933), GrundbegriJJe der Wahrscheinlichkeitsrechnung,
Springer.
(English translation: A.N. Kolmogorov (1956), Foundation of the Theory
of Probability, Chelsia.)
A.N. Kolmogorov (1965), "Three approaches to the quantitative definitions
of information, Problem of Inform. Transmission, 1, (1), pp. 1-7.
Y. LeCun (1986), "Learning processes in an asymmetric threshold net-
work," Disordered systems and biological organizations, Les Houches,
France, Springer, pp. 233-240.
Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hub-
bard, and L. J. Jackel (1990), "Handwritten digit recognition with back-
propagation network," Advances in Neural Information Processing Sys-
tems, 2, Morgan Kaufman, pp. 396-404.
G.G. Lorentz (1966), Approximation of /unctions, Holt-Rinehart-Winston,
Ney York.
H.N. Mhaskar (1993)," Approximation properties of a multi-layer feed-
forward artificial neural network," Advances in Computational Mathe-
matics, 1, pp. 61-80.
C. A. Micchelli (1986), "Interpolation of scattered data: distance matrices
and conditionally positive definite functions," Constructive Approxima-
tion, 2, pp. 11-22.
M.L. Miller (1990), Subset selection in regression, London, Chapman and
Hall.
J.J. More and G Toraldo (1991)," On the solution of large quadratic pro-
gramming problems with bound constraints," SIAM Optimization, 1,
(1), pp. 93-113.
A. B. J. Novikoff (1962), "On convergence proofs on perceptrons," Pro-
ceedings of the Symposium on the Mathematical Theory of Automata,
Polytechnic Institute of Brooklyn, Vol. XII, pp. 615--622.
182 References

J. M. Parrondo and C Van den Broeck (1993), "Vapnik-Chervonenkis


bounds for generalization," J. Phys. A, 26, pp. 2211-2223.

E. Parzen (1962), "On estimation of probability function and mode." An-


nals of Mathematical Statistics, 33, (3).

D.Z. Phillips (1962), "A technique for numerical solution of certain integral
equation of the first kind," J. Assoc. Comput. Mach., 9, pp. 84-96.

T. Poggio and F. Girosi (1990), "Networks for Approximation and Learn-


ing," Proceedings of the IEEE, Vol. 78, (9).

D.Pollard (1984), Convergence of stochastic processes, Springer, New York.

K. Popper (1968), The logic of Scientific Discovery, 2nd ed., Harper Torch
Book, New York.

M.J.D. Powell (1992), "The theory of radial basis functions approxima-


tion in 1990," W.A. Light ed., Advances in Numerical Analysis Volume
II: Wavelets, Subdivision algorithms and radial basis functions, "Oxford
University," pp. 105-210.

J. Rissanen (1978), "Modeling by shortest data description," Automatica,


14, pp. 465-471.

J. Rissanen (1989), Stochastic complexity and statistical inquiry, World


Scientific.

H. Robbins and H Monroe (1951), "A stochastic approximation method,"


Annals of Mathematical Statistics, 22, pp. 400--407.

F. Rosenblatt (1962), Principles of neurodinamics: Perceptron and theory


of brain mechanisms, Spartan Books, Washington D.C ..

M. Rosenblatt (1956), "Remarks on some nonparametric estimation of den-


sity function," Annals of Mathematical Statistics, 27, pp. 642-669.

D. E. Rumelhart, G. E. Hinton, and R. J. Williams (1986), Learning inter-


nal representations by error propagation. Parallel distributed processing:
Explorations in macrostructure of cognition, Vol. I, Badford Books, Cam-
bridge, MA., pp. 318-362.

B. Russell (1989), A History of western philosophy, Unwin, London.

N. Sauer (1972), "On the density of families of sets," J. Combinatorial


Theory (AJ, 13, pp.145-147.

G. Schwartz (1978), "Estimating the dimension of a model," Annals of


Statistics, 6, pp. 461-464.
References 183

P. Y. Simard, Y. LeCun, and J. Denker (1993), "Efficient pattern recogni-


tion using a new transformation distance," Neural Information Process-
ing Systems, 5, pp. 50-58.
N.V. Smirnov (1970), Theory of probability and mathematical statistics (Se-
lected works), Nauka, Moscow.
RJ. Solomonoff (1960), "A preliminary report on general theory of induc-
tive inference," Technical Report ZTB-138, Zator Company, Cambridge,
MA.

RJ. Solomonoff (1964), "A formal theory of inductive inference," Parts 1


and 2, Inform. Contr.,7, pp.1-22, pp.224-254.

G. Schwartz (1978), "Estimating the dimension of a model," Annals of


Statistics, 6, pp. 461-464.
R A. Tapia and J. R Thompson (1978), Nonparametric probability density
estimation, The John Hopkins University Press, Baltimore.
A. N. Tikhonov (1963), "On solving ill-posed problem and method of reg-
ularization," Doklady Akademii Nauk USSR, 153, pp. 501-504.

A. N. Tikhonov and V. Y. Arsenin (1977), Solution of ill-posed problems,


W. H. Winston, Washington, DC.

Ya. Z. Tsypkin (1971), Adaptation and learning in automatic systems, Aca-


demic, New York.

Ya. Z. Tsypkin (1973), Foundation of the theory of learning systems, Aca-


demic, New York.

V. N. Vapnik (1979), Estimation of dependencies based on empirical Data,


(in Russian), Nauka, Moscow.
(English translation: Vladimir Vapnik (1982), Estimation of dependen-
cies based on empirical data, Springer, New York.)
V. N. Vapnik (1993), "Three fundamental concepts of the capacity of learn-
ing machines," Physica A, 200, pp. 538-544.

V. N. Vapnik (1988), "Inductive principles of statistics and learning the-


ory" Yearbook of the Academy of Sciences of the USSR on Recognition,
Classification, and Forecasting, 1, Nauka, Moscow
(English translation: (1995), "Inductive principles of statistics and learn-
ing theory" , in book: Smolensky, Moser, Rumelhart ed., Mathematical
perspectives on neural networks, Lawrence Erlboum Associates, Inc.

V .N. Vapnik (1996), Statistical learning theory, J. Wiley, New York.


184 References

V. N. Vapnik and L. Bottou (1993), "Local Algorithms for pattern recog-


nition and dependencies estimation," Neural Computation, 5, (6), pp
893-908.
V. N. Vapnik and A. Ja. Chervonenkis (1968), "On the uniform convergence
of relative frequencies of events to their probabilities," Doklady Akademii
Nauk USSR, 181, (4). (English transl. Sov. Math. DokL)
V. N. Vapnik and A. Ja. Chervonenkis (1971), "On the uniform convergence
of relative frequencies of events to their probabilities" Theory Probab.
Apl., 16, pp. 264-280
V. N. Vapnik and A. Ja. Chervonenkis (1974), Theory of Pattern Recogni-
tion (in Russian), Nauka, Moscow.
(German translation: W. N. Wapnik, A. Ja. Tschervonenkis (1979),The-
one der Zeichenerkennung, Akademia, Berlin.)
V. N. Vapnik and A. Ja. Chervonenkis (1981), "Necessary and sufficient
conditions for the uniform convergence of the means to their expecta-
tions," Theory Probab. Apl., 26, pp. 532-553.
V. N. Vapnik and A. Ja. Chervonenkis (1989), "The necessary and sufficient
conditions for consistency of the method of empirical risk minimization"
(in Russian) , Yearbook of the Academy of Sciences of the USSR on Recog-
nition, Classification, and Forecasting, 2, pp. 217-249, Nauka, Moscow
pp 207-249.
(English translation: (1991), "The necessary and sufficient conditions
for consistency of the method of empirical risk minimization," Pattern
Recogn. and Image Analysis, 1, (3), pp. 284-305.)
V. N. Vapnik and A. R. Stefanyuk (1978)," Nonparametric methods for
estimating probability densities," Autom. and Remote Contr., (8).
R.S. Wenocur and R. M. Dudley (1981), "Some special Vapnik-
Chervonenkis classes," Discrete Math., 33, pp. 313-318.
Index

Entries are here indicated for the first occurrence of the given term.

a posteriori information 161 canonical separating hyperplanes


admissible structure 91 128
algorithmic complexity 10 capacity control problem 112
annealed entropy 53 cause-effect relation 9
approximation rate 94 choosing the best sparse
artificial intelligence 13 algebraic polynomial
axioms of probability theory 56 113
choosing the degree of
back propagation method 122 polynomial 112
basic problem of probability classification error 17
theory 58
code-book 102
basic problem of statistics 59
complete (Popper's)
bayesian approach 115
nonfalsifiability 48
Bayesian inference 32
compression coefficient 103
boosting recognition scheme 159
bound on the distance to the consistency of inference 34
smallest risk 73 constructive distribution-
bound on the value of achieved independent bound on
risk 73 the rate of convergence
bounds on generalization ability 65
of a learning machine convolution of inner product 136
72 criterion of nonfalsifiability 45
186 Index

decision making problem 168 handwritten digit recognition


decision trees 7 142
deductive inference 46 hard threshold vicinity function
density estimation problem 99
(Fisher-Wald) setting hidden markov models 7
18 hidden units 97
discrepancy 16
discriminant analysis 22 ill-posed problems 9
discriminant function 23 independent trials 58
distribution-dependent bound on inductive inference 46
the rate of convergence inner product in Hilbert space
65 136
distribution-independent
bound on the rate of kernel function 25
convergence 65 Kolmogorov-Smirnov
distribution 83
empirical distribution function Kulback-Leibler distance 30
26, Kiihn-Thcker conditions 130
empirical processes 38
empirical risk functional 18 lagrangian 130
empirical risk minimization law of large number in the
inductive principle 18 functional space 39
entropy of the set of functions 40 law of large numbers 39
entropy on the set of indicator law of large numbers in vector
functions 40 space 39
c-insensitivity 152 learning machine 15
equivalence classes 168 learning matrices 7
estimation of the values of a least-squares method 19
function at the given linear discriminant function 29
points, 161 linearly nonseparable case 131
expert systems 7 local approximation 100
local risk minimization 99
feature selection problem 114 locality parameter 99
function approximation 94 loss-function 16
function estimation model 15 low bounds on the rate of
generalized Glivenko-Cantelli uniform convergence 81
problem 62
generalized growth function 81
generator of random vectors 15 madaline 7
glivenko-cantelli problem 62 main principle for small sample
growth function 53 sizes problems 28
maximum likelihood method 22
Hamming distance 102 McCulloch-Pitts neuron model 2
Index 187

measurements with the additive radial basis function machine


noise 23 140
metric c-entropy 42 random entropy 40
minimum description length random string 10
principle 100 randomness concept 10
mixture of normal densities 24 regression estimation problem 17
regression function 17
national institute of standard regularization theory 9
and technology (NIST) regularized functional 9
digit database 158 rigorous (distribution-dependent)
neural networks 122 bounds 81
non-trivially consistent inference risk functional 16
36 risk minimization from empirical
nonparametric density data problem 18
estimation 25 robust regression 24
normal discriminant function 29 Rosenblatt's algorithm 5

one-sided empirical process 38 set of indicators 69


optimal separating hyperplane set of unbounded functions 73
127 a-algebra 56
overfitting phenomenon 14 sigmoid function 121
small sample size 89
parametric methods of density smoothing kernel 98
estimation 22 smoothness of functions 96
partial nonfalsifiability 49 soft threshold vicinity function
Parzen's windows 25 99
pattern recognition problem 17 stochastic approximation
perceptron 1 stopping rule 32
perceptron's stopping rule 6 stochastic ill-posed problems 109
polynomial approximation of strong mode estimating a
regression 111 probability measure 59
polynomial machine 139 structural risk minimization
potential functions 136 principle 90
potential nonfalsifiability 51 structure 90
probability measure 57 structure of growth function 75
probably approximately correct supervisor 15
(PAC) model 13 support vector machines 133
problem of demarcation 45 support vectors 130
pseudo-dimension 86
tails of distribution 74
quadratic programming problem tangent distance 145
129 training set 16
quantization of parameters 106 transductive inference 169
quasi-solution 108 TUring-Church thesis 163
188 Index

two layer neural networks VC dimension of a set of


machine 141 indicator functions 75
two-sided empirical process 38 VC dimension of a set of real
functions 77
U.S. postal service digit database VC entropy 42
142 VC subgraph 86
uniform one-sided convergence
37 weak mode estimating a
uniform two-sided convergence probability measure 59
37 weight decay procedure 98

You might also like