Nils J. Nilsson - Introduction To Machine Learning
Nils J. Nilsson - Introduction To Machine Learning
Nils J. Nilsson - Introduction To Machine Learning
TO
MACHINE LEARNING
AN EARLY DRAFT OF A PROPOSED
TEXTBOOK
Nils J. Nilsson
Robotics Laboratory
Department of Computer Science
Stanford University
Stanford, CA 94305
e-mail: nilsson@cs.stanford.edu
September 26, 1996
v
vi
Preface
These notes are in the process of becoming a textbook. The process is quite
un
nished, and the author solicits corrections, criticisms, and suggestions
from students and other readers. Although I have tried to eliminate errors,
some undoubtedly remain|caveat lector. Many typographical infelicities
will no doubt persist until the
nal version. More material has yet to
be added. Please let me have your suggestions about topics that are too Some of my
important to be left out. I hope that future versions will cover Hop
eld plans for
additions and
nets, Elman nets and other recurrent nets, radial basis functions, grammar other
and automata learning, genetic algorithms, and Bayes networks . I am reminders are
:::
mentioned in
also collecting exercises and project suggestions which will appear in future marginal notes.
versions. Yes, the
nal version will have a good index.
My intention is to pursue a middle ground between a theoretical text-
book and one that focusses on applications. The book concentrates on the
important ideas in machine learning. I do not give proofs of many of the
theorems that I state, but I do give plausibility arguments and citations to
formal proofs. And, I do not treat many matters that would be of practical
importance in applications the book is not a handbook of machine learn-
ing practice. Instead, my goal is to give the reader sucient preparation
to make the extensive literature on machine learning accessible.
Students in my Stanford courses on machine learning have already made
several useful suggestions, as have my colleague, Pat Langley, and my teach-
ing assistants, Ron Kohavi, Karl Peger, Robert Allen, and Lise Getoor.
vii
Chapter 1
Preliminaries
1.1 Introduction
1.1.1 What is Machine Learning?
Learning, like intelligence, covers such a broad range of processes that it is
dicult to dene precisely. A dictionary denition includes phrases such as
\to gain knowledge, or understanding of, or skill in, by study, instruction,
or experience," and \modication of a behavioral tendency by experience."
Zoologists and psychologists study learning in animals and humans. In
this book we focus on learning in machines. There are several parallels
between animal and machine learning. Certainly, many techniques in ma-
chine learning derive from the eorts of psychologists to make more precise
their theories of animal and human learning through computational mod-
els. It seems likely also that the concepts and techniques being explored by
researchers in machine learning may illuminate certain aspects of biological
learning.
As regards machines, we might say, very broadly, that a machine learns
whenever it changes its structure, program, or data (based on its inputs
or in response to external information) in such a manner that its expected
future performance improves. Some of these changes, such as the addition
of a record to a data base, fall comfortably within the province of other dis-
ciplines and are not necessarily better understood for being called learning.
But, for example, when the performance of a speech-recognition machine
improves after hearing several samples of a person's speech, we feel quite
justied in that case to say that the machine has learned.
1
2 CHAPTER 1. PRELIMINARIES
Machine learning usually refers to the changes in systems that perform
tasks associated with arti cial intelligence (AI). Such tasks involve recog-
nition, diagnosis, planning, robot control, prediction, etc. The \changes"
might be either enhancements to already performing systems or ab initio
synthesis of new systems. To be slightly more specic, we show the archi-
tecture of a typical AI \agent" in Fig. 1.1. This agent perceives and models
its environment and computes appropriate actions, perhaps by anticipating
their eects. Changes made to any of the components shown in the gure
might count as learning. Dierent learning mechanisms might be employed
depending on which subsystem is being changed. We will study several
dierent learning methods in this book.
Perception
Model
Planning and
Reasoning
Action
Computation
Actions
Training Set:
Ξ = {X1, X2, . . . Xi, . . ., Xm}
x1
.
.
. h(X)
X= xi h
.
.
.
xn h∈H
learning a function the value of the function is the name of the subset
to which an input vector belongs.) Unsupervised learning methods have
application in taxonomic problems in which it is desired to invent ways to
classify data into meaningful categories.
We shall also describe methods that are intermediate between super-
vised and unsupervised learning.
We might either be trying to nd a new function, h, or to modify an
existing one. An interesting special case is that of changing an existing
function into an equivalent one that is computationally more ecient. This
type of learning is sometimes called speed-up learning. A very simple exam-
ple of speed-up learning involves deduction processes. From the formulas
A B and B C , we can deduce C if we are given A. From this deductive
process, we can create the formula A C |a new formula but one that
does not sanction any more conclusions than those that could be derived
from the formulas that we previously had. But with this new formula we
can derive C more quickly, given A, than we could have done before. We
can contrast speed-up learning with methods that create genuinely new
functions|ones that might give dierent results after learning than they
did before. We say that the latter methods involve inductive learning. As
opposed to deduction, there are no correct inductions|only useful ones.
Introduction to Machine Learning c 1996 Nils J. Nilsson. All rights reserved.
8 CHAPTER 1. PRELIMINARIES
h sample f-value
1500
0
1000
00 10
500
00 5
0
-10 0
-5
0 -5 x2
5
10-10
x1
1.2.3 Outputs
The output may be a real number, in which case the process embodying
the function, h, is called a function estimator, and the output is called an
output value or estimate.
Alternatively, the output may be a categorical value, in which case
the process embodying h is variously called a classi er, a recognizer, or a
categorizer, and the output itself is called a label, a class, a category, or a
decision. Classiers have application in a number of recognition problems,
for example in the recognition of hand-printed characters. The input in
that case is some suitable representation of the printed character, and the
classier maps this input into one of, say, 64 categories.
Vector-valued outputs are also possible with components being real
numbers or categorical values.
An important special case is that of Boolean output values. In that
case, a training pattern having value 1 is called a positive instance, and a
training sample having value 0 is called a negative instance. When the input
is also Boolean, the classier implements a Boolean function. We study the
Boolean case in some detail because it allows us to make important general
points in a simplied setting. Learning a Boolean function is sometimes
called concept learning, and the function is called a concept.
1.2.5 Noise
Sometimes the vectors in the training set are corrupted by noise. There are
two kinds of noise. Class noise randomly alters the value of the function
attribute noise randomly alters the values of the components of the input
vector. In either case, it would be inappropriate to insist that the hypothe-
sized function agree precisely with the values of the samples in the training
set.
log2|Hv|
2n − j
2n
(generalization is not possible)
0
0 2n
j = no. of labeled
patterns already seen
log2|Hv|
2n
depends on order
of presentation
log2|Hc|
0
0 2n
j = no. of labeled
patterns already seen
x3
x2
x1
some ordering scheme over all hypotheses. For example, if we had some way
of measuring the complexity of a hypothesis, we might select the one that
was simplest among those that performed satisfactorily on the training set.
The principle of Occam's razor, used in science to prefer simple explanations
to more complex ones, is a type of preference bias. (William of Occam,
1285-?1349, was an English philosopher who said: \non sunt multiplicanda
entia praeter necessitatem," which means \entities should not be multiplied
unnecessarily.")
1.5 Sources
Besides the rich literature in machine learning (a small part of which is ref-
erenced in the Bibliography), there are several textbooks that are worth
mentioning Hertz, Krogh, & Palmer, 1991, Weiss & Kulikowski, 1991,
Natarjan, 1991, Fu, 1994, Langley, 1996]. Shavlik & Dietterich, 1990,
Introduction to Machine Learning c 1996 Nils J. Nilsson. All rights reserved.
1.6. BIBLIOGRAPHICAL AND HISTORICAL REMARKS 15
Buchanan & Wilkins, 1993] are edited volumes containing some of the most
important papers. A survey paper by Dietterich, 1990] gives a good
overview of many important topics. There are also well established confer-
ences and publications where papers are given and appear including:
The Annual Conferences on Advances in Neural Information Process-
ing Systems
The Annual Workshops on Computational Learning Theory
The Annual International Workshops on Machine Learning
The Annual International Conferences on Genetic Algorithms
(The Proceedings of the above-listed four conferences are published
by Morgan Kaufmann.)
The journal Machine Learning (published by Kluwer Academic Pub-
lishers).
There is also much information, as well as programs and datasets, available
over the Internet through the World Wide Web.
Boolean Functions
2.1 Representation
2.1.1 Boolean Algebra
Many important ideas about learning of functions are most easily presented
using the special case of Boolean functions. There are several important
subclasses of Boolean functions that are used as hypothesis classes for func-
tion learning. Therefore, we digress in this chapter to present a review of
Boolean functions and their properties. (For a more thorough treatment
see, for example, Unger, 1989].)
A Boolean function, f(x1 x2 : : : xn) maps an n-tuple of (0,1) values to
f 0 1 . Boolean algebra is a convenient notation for representing Boolean
g
x2 x2
x1x2 x1 + x2
x1 x1
and or
x2 x3
x1x2x3 + x1x2x3
x1x2 + x1x2
+ x1x2x3 + x1x2x3
x1 x2
xor (exclusive or)
a Boolean function that has value 1 if there are an even number of its argu-
ments that have value 1 otherwise it has value 0.) Note that all adjacent
cells in the table correspond to inputs diering in only one component. Also describe
general logic
diagrams,
Wnek, et al., 1990].
2.2 Classes of Boolean Functions
2.2.1 Terms and Clauses
To use absolute bias in machine learning, we limit the class of hypotheses.
In learning Boolean functions, we frequently use some of the common sub-
classes of those functions. Therefore, it will be important to know about
these subclasses.
One basic subclass is called terms. A term is any function written
in the form l1 l2 lk , where the li are literals. Such a form is called a
conjunction of literals. Some example terms are x1 x7 and x1x2x4 . The size
of a term is the number of literals it contains. The examples are of sizes 2
and 3, respectively. (Strictly speaking, the class of conjunctions of literals
Introduction to Machine Learning c 1996 Nils J. Nilsson. All rights reserved.
20 CHAPTER 2. BOOLEAN FUNCTIONS
x3,x4
00 01 11 10
00 1 0 1 0
x1,x2 01 0 1 0 1
11 1 0 1 0
10 0 1 0 1
x3
0, 0, 1 1, 0, 1
1, 1, 1
1, 0, 0
x2
x1
f = x2x3 + x1x3 + x2x1x3
= x2x3 + x1x3
xi f1 + xi f2 = xi f1 + xi f2 + f1 f2
where f1 and f2 are terms such that no literal appearing in f1 appears
complemented in f2 . f1 f2 is called the consensus of xi f1 and xi f2.
Readers familiar with the resolution rule of inference will note that
consensus is the dual of resolution.
Examples: x1 is the consensus of x1x2 and x1x2 . The terms x1 x2
and x1 x2 have no consensus since each term has more than one literal
appearing complemented in the other.
Subsumption:
xi f1 + f1 = f1
Introduction to Machine Learning c 1996 Nils J. Nilsson. All rights reserved.
2.2. CLASSES OF BOOLEAN FUNCTIONS 23
x3
0, 0, 1 1, 0, 1
1, 1, 1
1, 0, 0
x2
x1
f = x2x3 + x1x3 + x1x2
= x1x2 + x1x3
of f,
2. compute the consensus of a pair of terms in and add the result to
T
T ,
3. eliminate any terms in that are subsumed by other terms in .
T T
of f.
Example: Let f = x1 x2 +x1 x2 x3 +x1 x2 x3 x4 x5. We show a derivation
of a set of prime implicants in the consensus tree of Fig. 2.5. The circled
numbers adjoining the terms indicate the order in which the consensus and
subsumption operations were performed. Shaded boxes surrounding a term
indicate that it was subsumed. The nal form of the function in which all
terms are prime implicants is: f = x1 x2 + x1x3 + x1 x4x5. Its terms are all
of the non-subsumed terms in the consensus tree.
2 4
x1x2 x1x2x3 x1x2x3x4x5
1
x1x3
6
3 x1x2x4x5
f = x1x2 + x x + x x x
1 3 1 4 5
5
x1x4x5
(ti vi )
(t2 v2)
(T v1 )
where the vi are either 0 or 1, the ti are terms in (x1 : : : xn), and T is a
term whose value is 1 (regardless of the values of the xi). The value of a
decision list is the value of vi for the rst ti in the list that has value 1. (At
least one ti will have value 1, because the last one does v1 can be regarded
as a default value of the decision list.) The decision list is of size k, if the
size of the largest term in it is k. The class of decision lists of size k or less
is called k-DL.
An example decision list is:
f=
(x1 x2 1)
(x1 x2x3 0)
x2 x3 1)
(1 0)
f has value 0 for x1 = 0, x2 = 0, and x3 = 1. It has value 1 for x1 = 1,
x2 = 0, and x3 = 1. This function is in 3-DL.
It has been shown that the class k-DL is a strict superset of the union of
k-DNF and k-CNF. There are 2Onk k log(n)] functions in k-DL Rivest, 1987].
Interesting generalizations of decision lists use other Boolean functions
in place of the terms, ti. For example we might use linearly separable
functions in place of the ti (see below and Marchand & Golea, 1993]).
Introduction to Machine Learning c 1996 Nils J. Nilsson. All rights reserved.
26 CHAPTER 2. BOOLEAN FUNCTIONS
2.2.5 Symmetric and Voting Functions
A Boolean function is called symmetric if it is invariant under permutations
of the input variables. For example, any function that is dependent only on
the number of input variables whose values are 1 is a symmetric function.
The parity functions, which have value 1 depending on whether or not
the number of input variables with value 1 is even or odd is a symmetric
function. (The exclusive or function, illustrated in Fig. 2.1, is an odd-parity
function of two dimensions. The or and and functions of two dimensions
are also symmetric.)
An important subclass of the symmetric functions is the class of voting
functions (also called m-of-n functions). A k-voting function has value 1 if
and only if k or more of its n inputs has value 1. If k = 1, a voting function
is the same as an n-sized clause if k = n, a voting function is the same as
an n-sized term if k = (n + 1)=2 for n odd or k = 1 + n=2 for n even, we
have the majority function.
f = thresh(
Xn w x
i i )
i=1
where wi , i = 1 : : : n, are real-valued numbers called weights, is a real-
valued number called the threshold, and thresh( ) is 1 if and 0
2.3 Summary
The diagram in Fig. 2.6 shows some of the set inclusions of the classes of
Boolean functions that we have considered. We will be confronting these
classes again in later chapters.
The sizes of the various classes are given in the following table (adapted
from Dietterich, 1990, page 262]):
Class Size of Class
terms 3n
clauses 3n
k-term DNF 2 (kn)
O
k-clause CNF 2O(knk )
k-DNF 2O(n )
k-CNF 2Ok (nk)
k-DL 2On k log( n)]
lin sep O
2 n (n 2)
DNF 22
Introduction to Machine Learning c 1996 Nils J. Nilsson. All rights reserved.
28 CHAPTER 2. BOOLEAN FUNCTIONS
k-size-
k-DNF k-DL
terms
terms
DNF
lin sep (All)
hypotheses in H that are not consistent with the values in the training set
are ruled out by the training set.
We could imagine (conceptually only!) that we have devices for imple-
menting every function in H. An incremental training procedure could then
be de ned which presented each pattern in to each of these functions and
then eliminated those functions whose values for that pattern did not agree
with its given value. At any stage of the process we would then have left
some subset of functions that are consistent with the patterns presented so
far this subset is the version space for the patterns already presented. This
idea is illustrated in Fig. 3.1.
Consider the following procedure for classifying an arbitrary input pat-
tern, X: the pattern is put in the same class (0 or 1) as are the majority of
the outputs of the functions in the version space. During the learning pro-
cedure, if this majority is not equal to the value of the pattern presented,
29
30 CHAPTER 3. USING VERSION SPACES FOR LEARNING
A Subset, H, of all
Boolean Functions
h1 1 or 0
h2
hj
hK
K = |H|
more than 1 585 mistakes before learning , and otherwise we would make
: n f
no more than that number of mistakes before being able to decide that f
is not a term.
Even if we do not have sucient training patterns to reduce the ver-
sion space to a single function, it may be that there are enough training
patterns to reduce the version space to a set of functions such that most
of them assign the same values to most of the patterns we will see hence-
forth. We could select one of the remaining functions at random and be
reasonably assured that it will generalize satisfactorily. We next discuss a
computationally more feasible method for representing the version space.
has value 1 for all of the arguments for which 2 has value 1, and 1 6= 2.
f f f
For example, 3 is more general than 2 3 but is not more general than
x x x
x 3 + 2.
x
x3
x1 x1 x2 (k = 1)
x2 x3
x3
(k = 2)
x1 x3 x1x2
x1x2 x3
(k = 3)
0
(for simplicity, only some arcs in the graph are shown)
shown shaded in Fig. 3.3. We also show there the three-dimensional cube
representation in which the vertex (1, 0, 1) has value 0.
In a version graph, there are always a set of hypotheses that are max-
imally general and a set of hypotheses that are maximally speci c. These
are called the general boundary set (gbs) and the speci c boundary set (sbs),
respectively. In Fig. 3.4, we have the version graph as it exists after learn-
ing that (1,0,1) has value 0 and (1, 0, 0) has value 1. The gbs and sbs are
shown.
Boundary sets are important because they provide an alternative to
representing the entire version space explicitly, which would be impractical.
Given only the boundary sets, it is possible to determine whether or not
any hypothesis (in the prescribed class of Boolean functions we are using)
Introduction to Machine Learning c 1996 Nils J. Nilsson. All rights reserved.
3.2. VERSION GRAPHS 33
x3
1, 0, 1 has
value 0
x2
x1
1
New Version Graph
ruled out nodes
x1 x2 x3
x1 x2 x3
x1x2 x2x3
x1x3
x1x3 x1x2
x1x2 x3
x1x2x3
0
(only some arcs in the graph are shown)
x3
1, 0, 1 has
value 0
1, 0, 0 has x2
value 1
x1
1
more specific than gbs, general boundary set
more general than sbs (gbs)
x3
x1 x1 x2 x2 x3
x1x3 x2x3
x1x2
x1x2 x
3
specific boundary set (sbs)
Figure 3.4: The Version Graph Upon Seeing (1, 0, 1) and (1, 0, 0)
(1, 0, 0) but does not contain (1, 0, 1) is the bottom face of the cube|
corresponding to the function 3 . In Figs. 3.2 through 3.4 the sbs is always
x
singular. Version spaces for terms always have singular speci c boundary
sets. As seen in Fig. 3.3, however, the gbs of a version space for terms need
not be singular.
than some member of the new general boundary set. It might be that
h g = . Also, least generalizations of two dierent functions in the
h
than some member of the new speci c boundary set. Again, it might
be that s = , and least specializations of two dierent functions in
h h
We start with general boundary set, \1", and speci c boundary set,
\0." After seeing the rst sample, (1, 0, 1), labeled with a 0, the speci c
boundary set stays at \0" (it is necessary), and we change the general
boundary set to f 1 2 3g. Each of the functions, 1, 2 , and 3, are
x x x x x x
least specializations of \1" (they are necessary, \1" is not, they are more
general than \0", and there are no functions that are more general than
they and also necessary).
Then, after seeing (1, 0, 0), labeled with a 1, the general boundary set
changes to f 3g (because 1 and 2 are not sucient), and the speci c
x x x
eralization of \0" (it is sucient, \0" is more speci c than it, no function
(including \0") that is more speci c than it is sucient, and it is more
speci c than some member of the general boundary set.
When we see (1, 1, 1), labeled with a 0, we do not change the speci c
boundary set because its function is still necessary. We do not change the
general boundary set either because 3 is still necessary.
x
Finally, when we see (0, 0, 1), labeled with a 0, we do not change the
speci c boundary set because its function is still necessary. We do not
change the general boundary set either because 3 is still necessary.
x Maybe I'll put
in an example
of a version
graph for
3.5 Bibliographical and Historical Remarks non-Boolean
functions.
The concept of version spaces and their role in learning was rst investigated
by Tom Mitchell Mitchell, 1982]. Although these ideas are not used in
practical machine learning procedures, they do provide insight into the
nature of hypothesis selection. In order to accomodate noisy data, version
spaces have been generalized by Hirsh, 1994] to allow hypotheses that are
not necessarily consistent with the training set. More to be
added.
Neural Networks
In chapter two we de ned several important subsets of Boolean functions.
Suppose we decide to use one of these subsets as a hypothesis set for su-
pervised function learning. We next have the question of how best to im-
plement the function as a device that gives the outputs prescribed by the
function for arbitrary inputs. In this chapter we describe how networks of
non-linear elements can be used to implement various input-output func-
tions and how they can be trained using supervised learning methods.
Networks of non-linear elements, interconnected through adjustable
weights, play a prominent role in machine learning. They are called neural
networks because the non-linear elements have as their inputs a weighted
sum of the outputs of other elements|much like networks of biological
neurons do. These networks commonly use the threshold element which
we encountered in chapter two in our study of linearly separable Boolean
functions. We begin our treatment of neural nets by studying this thresh-
old element and how it can be used in the simplest of all networks, namely
ones composed of a single threshold element.
x1
W threshold θ = 0
w1
x
2
Σ
w2
X x
i wi
f
xn wn
threshold weight
wn+1
xn+1 = 1 n+1
f = thresh( Σ wi xi, 0)
i=1
Equations of hyperplane:
X W + wn+1 = 0
wn+1
X n+ =0
|W| X
X.W + w n+1 > 0
on this side
wn+1
X W + wn+1
W W
X.W + wn+1 < 0
on this side
Origin
n= W
|W|
Unit vector normal
to hyperplane
Clauses
The negation of a clause is a term. For example, the negation of the clause
f = x1 + x2 + x3 is the term f = x1 x2 x3. A hyperplane can be used to
implement this term. If we \invert" the hyperplane, it will implement the
clause instead. Inverting a hyperplane is done by multiplying all of the TLU
weights|even wn+1|by ;1. This process simply changes the orientation
of the hyperplane|ipping it around by 180 degrees and thus changing its
\positive side." Therefore, linearly separable functions are also a superset
of clauses. We show an example in Fig. 4.4.
(1,1,1)
x2
(1,1,0)
x1
Equation of plane is:
x1 + x2 - 3/2 = 0
x3 f = x1x2x3
x2
x1
Equation of plane is:
x1 + x2 + x3 − 1/2 = 0
3
2
1
V
3
2 2
3 1
1
0
4
V
4
3
1
2
pattern labels and the dot product computed by a TLU. For this purpose,
the pattern labels are assumed to be either +1 or ;1 (instead of 1 or 0).
The squared error for a pattern, Xi , with label di (for desired output) is:
nX
+1
"i = (di ; xij wj )2
j =1
where xij is the j-th component of Xi . The total squared error (over all
patterns in a training set, , containing m patterns) is then:
m
X nX
+1
"= (di ; xij wj )2
i=1 j =1
We want to choose the weights wj to minimize this squared error. One
way to nd such a set of weights is to start with an arbitrary weight vector
and move it along the negative gradient of " as a function of the weights.
Since " is quadratic in the wj , we know that it has a global minimum, and
thus this steepest descent procedure is guaranteed to nd the minimum.
Each component of the gradient is the partial derivative of " with respect
to one of the weights. One problem with taking the partial derivative of "
is that " depends on all the input vectors in . Often, it is preferable to use
an incremental procedure in which we try the TLU on just one element, Xi,
Introduction to Machine Learning c 1996
Nils J. Nilsson. All rights reserved.
48 CHAPTER 4. NEURAL NETWORKS
3
2
V1
V5
V2 V6 V
V3 V4
4
W1.X
W1 Σ
X ... ARGMAX
WR Σ
WR.X
R1
R2 R3
R4
R5
In this region:
X.W4 ≥ X.Wi for i ≠ 4
and
Wv ; Wv ; ci Xi
and all other weight vectors are not changed.
This correction increases the value of the u-th dot product and decreases
the value of the v-th dot product. Just as in the 2-category xed increment
procedure, this procedure is guaranteed to terminate, for constant ci , if
there exists weight vectors that make correct separations of the training
set. Note that when R = 2, this procedure reduces to the ordinary TLU
error-correction procedure. A proof that this procedure terminates is given
in Nilsson, 1990, pp. 88-90] and in Duda & Hart, 1973, pp. 174-177].
x1 1 1.5 1
-1 1
0.5 f
x2 -0.5 1
-1
output units
hidden units
x
disjunct
disjunction
of terms
conjunctions
conjuncts
of literals
(terms)
A Feedforward, 2-layer Network
x2
x1
W1
1
Σ
1
... MAX
Σ
W1
N1
ARG
X ... MAX
WR
1
Σ
... R
MAX
Σ
WR
NR
L1
L2
output
L3
L2
L1
L2
given by mj . The vector Wi(j ) has components wl(ji) for l = 1 m(j ;1) + 1.
W(k)
... ... ...
... Wi(1)
f
Wi(j) Wi(k-1) wl(k)
where wli(j ) is the l-th component of Wi(j ). This vector partial derivative of
is called the gradient of with respect to W and is sometimes denoted
by rW .
Since "'s dependence on Wi(j ) is entirely through s(ij ) , we can use the
chain rule to write:
@" = @" @s(ij )
@ Wi(j ) @s(ij ) @ Wi(j )
Introduction to Machine Learning c 1996
Nils J. Nilsson. All rights reserved.
4.4. TRAINING FEEDFORWARD NETWORKS BY BACKPROPAGATION61
(j )
Because s( ) = X( ;1) Wi(j), @@s
i
j j
W(j) = X . Substituting yields:
i (j ;1)
i
i(j ) = (d ; f) @f(j )
@si
We have a problem, however, in attempting to carry out the partial deriva-
tives of f with respect to the s's. The network output, f, is not continuously
dierentiable with respect to the s's because of the presence of the thresh-
old functions. Most small changes in these sums do not change f at all,
and when f does change, it changes abruptly from 1 to 0 or vice versa.
A way around this diculty was proposed by Werbos Werbos, 1974]
and (perhaps independently) pursued by several other researchers, for ex-
ample Rumelhart, Hinton, & Williams, 1986]. The trick involves replacing
Introduction to Machine Learning c 1996
Nils J. Nilsson. All rights reserved.
62 CHAPTER 4. NEURAL NETWORKS
all the threshold functions by dierentiable functions called sigmoids.1 The
output of a sigmoid function, superimposed on that of a threshold function,
is shown in Fig. 4.18. Usually, the sigmoid function used is f(s) = 1+1e;s ,
where s is the input and f is the output.
f (s)
threshold function
sigmoid
f (s) = 1/[1 + e−s]
s
Figure 4.18: A Sigmoid Function
We show the network containing sigmoid units in place of TLUs in Fig.
4.19. The output of the i-th sigmoid unit in the j-th layer is denoted by
fi(j ) . (That is, fi(j ) = ;1 s(j) .)
1+e i
1 Russell & Norvig 1995, page 595] attributes the use of this idea to
Bryson & Ho 1969].
W(k)
... ... ...
f(k)
... Wi(1) Wi(k-1) (k-1) w (k)
fi(1) Wi(j) fi(j) l
fi
δ(k)
... δi(1) w (j) δi(j) s(k)
li δi(k-1)
si(1) s (j) s (k-1)
... . .i . . .i .
(k)
(k) = (d ; f (k) ) @f
@s(k)
Given the sigmoid function that we are using, namely f(s) = 1
1+e;s , we
have that @f
@s = f(1 ; f). Substituting gives us:
(k) = (d ; f (k) )f (k) (1 ; f (k) )
Rewriting our general rule for weight vector changes, the weight vector
in the nal layer is changed according to the rule:
W(k) W(k) + c(k)
(k)X(k;1)
where
(k) = (d ; f (k) )f (k) (1 ; f (k) )
It is interesting to compare backpropagation to the error-correction rule
and to the Widrow-Ho rule. The backpropagation weight adjustment for
the single element in the nal layer can be written as:
W ; W + c(d ; f)f(1 ; f)X
Introduction to Machine Learning c 1996
Nils J. Nilsson. All rights reserved.
64 CHAPTER 4. NEURAL NETWORKS
Written in the same format, the error-correction rule is:
W ; W + c(d ; f)X
and the Widrow-Ho rule is:
W ; W + c(d ; f)X
The only dierence (except for the fact that f is not thresholded in Widrow-
Ho) is the f(1 ; f) term due to the presence of the sigmoid function.
With the sigmoid function, f(1 ; f) can vary in value from 0 to 1. When
f is 0, f(1 ; f) is also 0 when f is 1, f(1 ; f) is 0 f(1 ; f) obtains
its maximum value of 1/4 when f is 1/2 (that is, when the input to the
sigmoid is 0). The sigmoid function can be thought of as implementing a
\fuzzy" hyperplane. For a pattern far away from this fuzzy hyperplane,
f(1 ; f) has value close to 0, and the backpropagation rule makes little
or no change to the weight values regardless of the desired output. (Small
changes in the weights will have little eect on the output for inputs far
from the hyperplane.) Weight changes are only made within the region of
\fuzz" surrounding the hyperplane, and these changes are in the direction of
correcting the error, just as in the error-correction and Widrow-Ho rules.
i(j ) = (d ; f) @f(j )
@si
Again we use a chain rule. The nal output, f, depends on s(ij ) through
each of the summed inputs to the sigmoids in the (j + 1)-th layer. So:
i(j ) = (d ; f) @f(j )
@si
" #
= (d ; f) @f @s(1j +1) + + @f @s(l j +1) + + @f @s(mj +1)
j +1
(
@s1 j +1) @si ( j ) (
@sl j +1) (
@si j ) @smj+1 @sij )
( j +1) (
mX
=
j +1
(d ; f) @f @s(l j +1) = mXj +1
(j +1)
(j +1) @sl
l
l=1 @s(l j +1) @s(ij ) l=1 @s(ij )
(j+1)
It remains to compute the @s@sl (j) 's. To do that we rst write:
i
sl
(j +1)
= X( ) Wl(j+1)
j
mXj +1
= f(j ) wl(j +1)
=1
And then, since the weights do not depend on the s's:
hP
@ mj +1 f (j ) w(j +1) i mXj +1
@s(l j +1) =
=1 l
=
(j )
wl(j +1) @f(j )
@s(j )
i @s(ij ) =1 @si
(j) @f (j) = f (j ) (1 ;
Now, we note that @f
@si( j ) = 0 unless = i, in which case @s(j)
( j )
f ). Therefore:
@s(l j +1) = w(j +1)f (j ) (1 ; f (j ) )
il i i
@s(ij )
We use this result in our expression for
i(j ) to give:
mXj +1
i(j ) = fi(j ) (1 ; fi(j ) )
l(j +1)wil(j +1)
l=1
The above equation is recursive in the
's. (It is interesting to note that
this expression is independent of the error function the error function ex-
plicitly aects only the computation of
(k).) Having computed the
i(j +1)'s
for layer j + 1, we can use this equation to compute the
i(j ) 's. The base
case is
(k), which we have already computed:
(k) = (d ; f (k) )f (k) (1 ; f (k) )
Introduction to Machine Learning c 1996
Nils J. Nilsson. All rights reserved.
66 CHAPTER 4. NEURAL NETWORKS
We use this expression for the
's in our generic weight changing rule,
namely:
Wi(j ) Wi(j) + c(ij)
i(j ) X(j ;1)
Although this rule appears complex, it can be simply implemented
by \backpropagating" the
's through the weights in order to adjust the
amount by which X vectors are added to or subtracted from weight vectors
Add additional (thus, the name backprop for this algorithm).
intuitive
explanation.
4.4.5 Variations on Backprop
To be written: problem of local minima, simulated annealing, momem-
tum (Plaut, et al., 1986, see Hertz, Krogh, & Palmer, 1991]), quickprop,
regularization methods]
Simulated Annealing
To apply simulated annealing, the value of the learning rate constant is
gradually decreased with time. If we fall early into an error-function valley
that is not very deep (a local minimum), it typically will neither be very
broad, and soon a subsequent large correction will jostle us out of it. It
is less likely that we will move out of deep valleys, and at the end of the
process (with very small values of the learning rate constant), we descend
to its deepest point. The process gets its name by analogy with annealing
in metallurgy, in which a material's temperature is gradually decreased
allowing its crystalline structure to reach a minimal energy state.
sharp left
centroid
960 inputs of outputs
30 x 32 retina ... steers
vehicle
straight ahead
...
5 hidden
units connected
to all 960 inputs
sharp right
30 output units
connected to all
hidden units
a linear order and control the van's steering angle. If a unit near the top of
the array of output units has a higher output than most of the other units,
the van is steered to the left if a unit near the bottom of the array has a
high output, the van is steered to the right. The \centroid" of the responses
of all of the output units is computed, and the van's steering angle is set
at a corresponding value between hard left and hard right.
The system is trained by a modi ed on-line training regime. A driver
drives the van, and his actual steering angles are taken as the correct labels
for the corresponding inputs. The network is trained incrementally by
backprop to produce the driver-speci ed steering angles in response to each
visual pattern as it occurs in real time while driving.
This simple procedure has been augmented to avoid two potential prob-
lems. First, since the driver is usually driving well, the network would never
get any experience with far-from-center vehicle positions and/or incorrect
vehicle orientations. Also, on long, straight stretches of road, the network
would be trained for a long time only to produce straight-ahead steering
angles this training would swamp out earlier training to follow a curved
road. We wouldn't want to try to avoid these problems by instructing the
driver to drive erratically occasionally, because the system would learn to
Introduction to Machine Learning c 1996
Nils J. Nilsson. All rights reserved.
68 CHAPTER 4. NEURAL NETWORKS
mimic this erratic behavior.
Instead, each original image is shifted and rotated in software to create
14 additional images in which the vehicle appears to be situated dierently
relative to the road. Using a model that tells the system what steering angle
ought to be used for each of these shifted images, given the driver-speci ed
steering angle for the original image, the system constructs an additional
14 labeled training patterns to add to those encountered during ordinary
driver training.
Decision Trees
6.1 Denitions
A decision tree (generally dened) is a tree whose internal nodes are tests
(on input patterns) and whose leaf nodes are categories (of patterns). We
show an example in Fig. 6.1. A decision tree assigns a class number (or
output) to an input pattern by ltering the pattern down through the tests
in the tree. Each test has mutually exclusive and exhaustive outcomes. For
example, test T2 in the tree of Fig. 6.1 has three outcomes the left-most
one assigns the input pattern to class 3, the middle one sends the input
pattern down to test T4 , and the right-most one assigns the pattern to
class 1. We follow the usual convention of depicting the leaf nodes by the
class number.1 Note that in discussing decision trees we are not limited to
implementing Boolean functions|they are useful for general, categorically
valued functions.
There are several dimensions along which decision trees might dier:
1. The tests might be multivariate (testing on several features of the
input at once) or univariate (testing on only one of the features).
2. The tests might have two outcomes or more than two. (If all of the
tests have two outcomes, we have a binary decision tree.)
1 One of the researchers who has done a lot of work on learning decision trees is Ross
Quinlan. Quinlan distinguishes between classes and categories. He calls the subsets of
patterns that lter down to each tip categories and subsets of patterns having the same
label classes. In Quinlan's terminology, our example tree has nine categories and three
classes. We will not make this distinction, however, but will use the words \category"
and \class" interchangeably to refer to what Quinlan calls \class."
81
82 CHAPTER 6. DECISION TREES
T1
T2 T4 T3
3 T4
1 2 3
1
T4
2
1
3 2
4. We might have two classes or more than two. If we have two classes
and binary inputs, the tree implements a Boolean function, and is
called a Boolean decision tree.
x3
0 1
x2 x4
0 0 1
1
0
0
x3x4 x1
x3x2
1 0 1
x3x2
1 0
x3x4x1 x3x4x1
f = x3x2 + x3x4x1
cq
vn
cq-1
vn-1
ci
vi
v1
{a} {g}
2 1
H(j ) = ;
X p(ijj) log p(ijj)
2
i
and the reduction in uncertainty (beyond knowing only that the pattern
was in ) would be:
Introduction to Machine Learning c 1996 Nils J. Nilsson. All rights reserved.
86 CHAPTER 6. DECISION TREES
H() ; H(j )
Of course we cannot say that the test T is guaranteed always to produce
that amount of reduction in uncertainty because we don't know that the
result of the test will be the j-th outcome. But we can estimate the average
uncertainty over all the j , by:
EHT ()] =
X p(j)H(j)
j
where by HT () we mean the average uncertainty after performing test T
on the patterns in , p(j ) is the probability that the test has outcome j,
and the sum is taken from 1 to k. Again, we don't know the probabilities
p(j ), but we can use sample values. The estimate p^(j ) of p(j ) is just
the number of those patterns in that have outcome j divided by the total
number of patterns in . The average reduction in uncertainty achieved by
test T (applied to patterns in ) is then:
RT () = H() ; EHT ()]
An important family of decision tree learning algorithms selects for the
root of the tree that test that gives maximum reduction of uncertainty, and
then applies this criterion recursively until some termination condition is
met (which we shall discuss in more detail later). The uncertainty calcu-
lations are particularly simple when the tests have binary outcomes and
when the attributes have binary values. We'll give a simple example to
illustrate how the test selection mechanism works in that case.
Suppose we want to use the uncertainty-reduction method to build a
decision tree to classify the following patterns:
pattern class
(0, 0, 0) 0
(0, 0, 1) 0
(0, 1, 0) 0
(0, 1, 1) 0
(1, 0, 0) 0
(1, 0, 1) 1
(1, 1, 0) 0
(1, 1, 1) 1
Introduction to Machine Learning c 1996 Nils J. Nilsson. All rights reserved.
6.2. SUPERVISED LEARNING OF UNIVARIATE DECISION TREES87
x3
x2
The test x1
x1
What single test, x1, x2, or x3, should be performed rst? The illustration
in Fig. 6.5 gives geometric intuition about the problem.
The initial uncertainty for the set, , containing all eight points is:
H() = ;(6=8) log2(6=8) ; (2=8) log2(2=8) = 0:81
Next, we calculate the uncertainty reduction if we perform x1 rst. The
left-hand branch has only patterns belonging to class 0 (we call them the
set l ), and the right-hand-branch (r ) has two patterns in each class. So,
the uncertainty of the left-hand branch is:
Hx1 (l ) = ;(4=4) log2 (4=4) ; (0=4) log2 (0=4) = 0
And the uncertainty of the right-hand branch is:
Hx1 (r ) = ;(2=4) log2 (2=4) ; (2=4) log2 (2=4) = 1
Half of the patterns \go left" and half \go right" on test x1. Thus, the
average uncertainty after performing the x1 test is:
1=2Hx1 (l ) + 1=2Hx1 (r ) = 0:5
Introduction to Machine Learning c 1996 Nils J. Nilsson. All rights reserved.
88 CHAPTER 6. DECISION TREES
Therefore the uncertainty reduction on achieved by x1 is:
x3 f = x3x2 + x3x4x1
x3x4x1
0 1
x1 -1 1.5
x2 x4 +1
x2
0 0 1 +1
1 X x3 f
0
0 x1 disjunction
x3x4 x4
+1
0 1 -1
x3x2 1
x3x2 0.5
terms
1
0 x3x2
x3x4x1
x3x4x1
L1L2
L1
L1
−
0 1
L2
+
L2 L3
0 X
0 1 L3 + f
1 +
0
0 L4 L4 disjunction
1 0 1 −
1 L1 L3 L4
0
conjunctions
Leave-one-out validation is the same as cross validation for the special case
in which K equals the number of patterns in , and each i consists of a
single pattern. When testing on each i, we simply note whether or not
a mistake was made. We count the total number of mistakes and divide
by K to get the estimated error rate. This type of validation is, of course,
more expensive computationally, but useful when a more accurate estimate
of the error rate for a classier is needed. Describe
\bootstrap-
ping" also
6.4.3 Avoiding Overtting in Decision Trees Efron, 1982].
Near the tips of a decision tree there may be only a few patterns per node.
For these nodes, we are selecting a test based on a very small sample, and
thus we are likely to be overtting. This problem can be dealt with by ter-
minating the test-generating procedure before all patterns are perfectly split
into their separate categories. That is, a leaf node may contain patterns
of more than one class, but we can decide in favor of the most numerous
class. This procedure will result in a few errors but often accepting a small
number of errors on the training set results in fewer errors on a testing set.
This behavior is illustrated in Fig. 6.8.
Introduction to Machine Learning c 1996 Nils J. Nilsson. All rights reserved.
92 CHAPTER 6. DECISION TREES
0.8
training errors
0.6
validation errors
0.4
0.2
0
0 1 2 3 4 5 6 7 8 9
Number of Terminal
Nodes
(From Weiss, S., and Kulikowski, C., Computer Systems that Learn,
Morgan Kaufmann, 1991)
x1
x3 x2
x3 1
0
x4
0
x4
0 1
0 1
x1
x2
x3 1
0
x4
0 1
for x1x2 and a test for x3x4, the decision tree could be much simplied,
as shown in Fig. 6.11. Several researchers have proposed techniques for
learning decision trees in which the tests at each node are linearly separable
functions. John, 1995] gives a nice overview (with several citations) of
learning such linear discriminant trees and presents a method based on
\soft entropy."
A third method for dealing with the replicated subtree problem involves
extracting propositional \rules" from the decision tree. The rules will have
as antecedents the conjunctions that lead down to the leaf nodes, and as
consequents the name of the class at the corresponding leaf node. An ex-
ample rule from the tree with the repeating subtree of our example would
be: x1 ^ :x2 ^ x3 ^ x4 1. Quinlan Quinlan, 1987] discusses methods
for reducing a set of rules to a simpler set by 1) eliminating from the an-
tecedent of each rule any \unnecessary" conjuncts, and then 2) eliminating
\unnecessary" rules. A conjunct or rule is determined to be unnecessary
if its elimination has little eect on classication accuracy|as determined
by a chi-square test, for example. After a rule set is processed, it might be
the case that more than one rule is \active" for any given pattern, and care
must be taken that the active rules do not con"ict in their decision about
the class of a pattern.
Introduction to Machine Learning c 1996 Nils J. Nilsson. All rights reserved.
96 CHAPTER 6. DECISION TREES
x1x2
1
x3x4
1
0
Inductive Logic
Programming
There are many di erent representational forms for functions of input vari-
ables. So far, we have seen (Boolean) algebraic expressions, decision trees,
and neural networks, plus other computational mechanisms such as tech-
niques for computing nearest neighbors. Of course, the representation most
important in computer science is a computer program. For example, a Lisp
predicate of binary-valued inputs computes a Boolean function of those in-
puts. Similarly, a logic program (whose ordinary application is to compute
bindings for variables) can also be used simply to decide whether or not
a predicate has value True (T) or False (F). For example, the Boolean
exclusive-or (odd parity) function of two variables can be computed by the
following logic program:
We follow Prolog syntax (see, for example, Mueller & Page, 1988]), except
that our convention is to write variables as strings beginning with lower-
case letters and predicates as strings beginning with upper-case letters. The
unary function \True" returns T if and only if the value of its argument is
T . (We now think of Boolean functions and arguments as having values of
T and F instead of 0 and 1.) Programs will be written in \typewriter"
font.
97
98 CHAPTER 7. INDUCTIVE LOGIC PROGRAMMING
In this chapter, we consider the matter of learning logic programs given
a set of variable values for which the logic program should return T (the
positive instances) and a set of variable values for which it should return F
(the negative instances). The subspecialty of machine learning that deals
with learning logic programs is called inductive logic programming (ILP)
Lavrac & Dzeroski, 1994]. As with any learning problem, this one can be
quite complex and intractably dicult unless we constrain it with biases of
some sort. In ILP, there are a variety of possible biases (called language bi-
ases). One might restrict the program to Horn clauses, not allow recursion,
not allow functions, and so on.
As an example of an ILP problem, suppose we are trying to induce a
function Nonstop(x,y), that is to have value T for pairs of cities connected
by a non-stop air ight and F for all other pairs of cities. We are given a
training set consisting of positive and negative examples. As positive ex-
amples, we might have (A,B), (A, A1), and some other pairs as negative
examples, we might have (A1, A2), and some other pairs. In ILP, we usu-
ally have additional information about the examples, called \background
knowledge." In our air-ight problem, the background information might
be such ground facts as Hub(A), Hub(B), Satellite(A1,A), plus others.
(Hub(A) is intended to mean that the city denoted by A is a hub city, and
Satellite(A1,A) is intended to mean that the city denoted by A1 is a
satellite of the city denoted by A.) From these training facts, we want to
induce a program Nonstop(x,y), written in terms of the background re-
lations Hub and Satellite, that has value T for all the positive instances
and has value F for all the negative instances. Depending on the exact set
of examples, we might induce the program:
:- Satellite(x,y)
:- Satellite(y,x)
which would have value T if both of the two cities were hub cities or if
one were a satellite of the other. As with other learning problems, we
want the induced program to generalize well that is, if presented with
arguments not represented in the training set (but for which we have the
needed background knowledge), we would like the function to guess well.
Introduction to Machine Learning c 1996 Nils J. Nilsson. All rights reserved.
7.1. NOTATION AND DEFINITIONS 99
7.1 Notation and Denitions
In evaluating logic programs in ILP, we implicitly append the background
facts to the program and adopt the usual convention that a program has
value T for a set of inputs if and only if the program interpreter returns
T when actually running the program (with background facts appended)
on those inputs otherwise it has value F . Using the given background
facts, the program above would return T for input (A, A1), for example.
If a logic program, , returns T for a set of arguments X, we say that
the program covers the arguments and write covers( X). Following our
terminology introduced in connection with version spaces, we will say that
a program is sucient if it covers all of the positive instances and that
it is necessary if it does not cover any of the negative instances. (That
is, a program implements a sucient condition that a training instance
is positive if it covers all of the positive training instances it implements
a necessary condition if it covers none of the negative instances.) In the
noiseless case, we want to induce a program that is both sucient and
necessary, in which case we will call it consistent. With imperfect (noisy)
training sets, we might relax this criterion and settle for a program that
covers all but some fraction of the positive instances while allowing it to
cover some fraction of the negative instances. We illustrate these denitions
schematically in Fig. 7.1.
π1 is a necessary program
A positive instance
covered by π2 and π3
−
+
+ + + − −
+ − −
+ −
+ −
+ −
+ +
−
π2 is a sufficient program
π3 is a consistent program
Hub(x)
Hub(y)
Hub(z)
Satellite(x,y)
(If recursive programs are allowed, we could also add the literals
Nonstop(x,z) and Nonstop(z,y).) These possibilities are among those il-
lustrated in the renement graph shown in Fig. 7.2. Whatever restrictions
on additional literals are imposed, they are all syntactic ones from which
the successors in the renement graph are easily computed. ILP programs
that follow the approach we are discussing (of specializing clauses by adding
a literal) thus have well dened methods of computing the possible literals
to add to a clause.
Nonstop(x,y) :-
. . .
Nonstop(x,y) :- . . .
Hub(x)
Nonstop(x,y) :-
Satellite(x,y)
. . .
Nonstop(x,y) :-
(x = y)
. . . . . .
. . .
. . .
Initialize cur := .
Initialize := empty set of clauses.
repeat The outer loop works to make sucient.]
Initialize c := : ; .
repeat The inner loop makes c necessary.]
Select a literal l to add to c. This is a nondeterministic choice point.]
Assign c := c l.
until c is necessary. That is, until c covers no negative instances in cur.]
Assign := c. We add the clause c to the program.]
Assign cur := cur ; (the positive instances in cur covered by ).
until is sucient.
(The termination tests for the inner and outer loops can be relaxed as
appropriate for the case of noisy instances.)
7.3 An Example
We illustrate how the algorithm works by returning to our example of airline
ights. Consider the portion of an airline route map, shown in Fig. 7.3.
Cities A, B, and C are \hub" cities, and we know that there are nonstop
ights between all hub cities (even those not shown on this portion of the
route map). The other cities are \satellites" of one of the hubs, and we know
that there are nonstop ights between each satellite city and its hub. The
learning program is given a set of positive instances, + , of pairs of cities
between which there are nonstop ights and a set of negative instances, ;,
Introduction to Machine Learning c 1996 Nils J. Nilsson. All rights reserved.
104 CHAPTER 7. INDUCTIVE LOGIC PROGRAMMING
of pairs of cities between which there are not nonstop ights. + contains
just the pairs:
f< A B > < A C > < B C > < B A > < C A > < C B >
< A A1 > < A A2 > < A1 A > < A2 A > < B B 1 > < B B 2 >
< B 1 B > < B 2 B > < C C 1 > < C C 2 > < C 1 C > < C 2 C >g
For our example, we will assume that ; contains all those pairs of cities
shown in Fig. 7.3 that are not in + (a type of closed-world assumption).
These are:
f< A B 1 > < A B 2 > < A C 1 > < A C 2 > < B C 1 > < B C 2 >
< B A1 > < B A2 > < C A1 > < C A2 > < C B 1 > < C B 2 >
< B 1 A > < B 2 A > < C 1 A > < C 2 A > < C 1 B > < C 2 B >
< A1 B > < A2 B > < A1 C > < A2 C > < B 1 C > < B 2 C >g
There may be other cities not shown on this map, so the training set does
not necessarily exhaust all the cities.
B1 B2
A1 C1
A C
A2 C2
Hub
All other cities mentioned in the map are assumed not in the relation Hub.
We will use the notation Hub(x) to express that the city named x is in the
relation Hub.
Satellite
f< A1 A > < A2 A > < B 1 B > < B 2 B > < C 1 C > < C 2 C >g
All other pairs of cities mentioned in the map are not in the relation
Satellite. We will use the notation Satellite(x,y) to express that the
pair < x y > is in the relation Satellite.
Knowing that the predicate Nonstop is a two-place predicate, the inner
loop of our algorithm initializes the rst clause to Nonstop(x,y) :- .
This clause is not necessary because it covers all the negative examples
(since it covers all examples). So we must add a literal to its (empty) body.
Suppose (selecting a literal from the renement graph) the algorithm adds
Hub(x). The following positive instances in are covered by Nonstop(x,y)
:- Hub(x):
f< A B 1 > < A B 2 > < A C 1 > < A C 2 > < C A1 > < C A2 >
< C B 1 > < C B 2 > < B A1 > < B A2 > < B C 1 > < B C 2 >g
Thus, the clause is not yet necessary and another literal must be added.
Suppose we next add Hub(y). The following positive instances are covered
by Nonstop(x,y) :- Hub(x), Hub(y):
f< A B > < A C > < B C > < B A > < C A > < C B >g
f< A A1 > < A A2 > < A1 A > < A2 A > < B B 1 > < B B 2 >
< B 1 B > < B 2 B > < C C 1 > < C C 2 > < C 1 C > < C 2 C >g
These instances are removed from cur for the next pass through the inner
loop. The program now contains two clauses:
This program is not yet sucient since it does not cover the following
positive instances:
f< A A1 > < A A2 > < B B 1 > < B B 2 > < C C 1 > < C C 2 >g
During the next pass through the inner loop, we add the clause
Nonstop(x,y) :- Satellite(y,x). This clause is necessary, and since
the program containing all three clauses is now sucient, the procedure
terminates with:
Since each clause is necessary, and the whole program is sucient, the
program is also consistent with all instances of the training set. Note that
this program can be applied (perhaps with good generalization) to other
cities besides those in our partial map|so long as we can evaluate the
relations Hub and Satellite for these other cities. In the next section, we
show how the technique can be extended to use recursion on the relation
we are inducing. With that extension, the method can be used to induce
more general logic programs.
B3
B1
B2
C1
C
C3
C2
f< B 1 B > < B 1 B 2 > < B 1 C > < B 1 C 1 > < B 1 C 2 >
< B B 1 > < B 2 B 1 > < C B 1 > < C 1 B 1 > < C 2 B 1 >
< B 2 B > < B 2 C > < B 2 C 1 > < B 2 C 2 > < B B 2 >
< C B 2 > < C 1 B 2 > < C 2 B 2 > < B C > < B C 1 >
< B C 2 > < C B > < C 1 B > < C 2 B > < C C 1 >
< C C 2 > < C 1 C > < C 2 C > < C 1 C 2 > < C 2 C 1 >g
f< B 3 B 2 > < B 3 B > < B 3 B 1 > < B 3 C > < B 3 C 1 >
< B 3 C 2 > < B 3 C 3 > < B 2 B 3 > < B B 3 > < B 1 B 3 >
< C B 3 > < C 1 B 3 > < C 2 B 3 > < C 3 B 3 > < C 3 B 2 >
< C 3 B > < C 3 B 1 > < C 3 C > < C 3 C 1 > < C 3 C 2 >
< B 2 C 3 > < B C 3 > < B 1 C 3 > < C C 3 > < C 1 C 3 >
< C 2 C 3 >g
We will induce Canfly(x,y) using the extensionally dened background
relation Nonstop given earlier (modied as required for our reduced airline
map) and Canfly itself (recursively).
As before, we start with the empty program and proceed to the inner
loop to construct a clause that is necessary. Suppose that the inner loop
adds the background literal Nonstop(x,y). The clause Canfly(x,y) :-
Nonstop(x,y) is necessary it covers no negative instances. But it is not
sucient because it does not cover the following positive instances:
f< B 1 B 2 > < B 1 C > < B 1 C 1 > < B 1 C 2 > < B 2 B 1 >
< C B 1 > < C 1 B 1 > < C 2 B 1 > < B 2 C > < B 2 C 1 >
< B 2 C 2 > < C B 2 > < C 1 B 2 > < C 2 B 2 > < B C 1 >
< B C 2 > < C 1 B > < C 2 B > < C 1 C 2 > < C 2 C 1 >g
Thus, we must add another clause to the program. In the inner loop, we rst
create the clause Canfly(x,y) :- Nonstop(x,z) which introduces the new
variable z . We digress briey to describe how a program containing a clause
with unbound variables in its body is interpreted. Suppose we try to inter-
pret it for the positive instance Canfly(B1,B2). The interpreter attempts
to establish Nonstop(B1,z) for some z . Since Nonstop(B1, B), for exam-
ple, is a background fact, the interpreter returns T |which means that the
instance < B 1 B 2 > is covered. Suppose now, we attempt to interpret the
clause for the negative instance Canfly(B3,B). The interpreter attempts to
establish Nonstop(B3,z) for some z . There are no background facts that
match, so the clause does not cover < B 3 B >. Using the interpreter, we
see that the clause Canfly(x,y) :- Nonstop(x,z) covers all of the pos-
itive instances not already covered by the rst clause, but it also covers
many negative instances such as < B 2 B 3 >, and < B B 3 >. So the inner
Introduction to Machine Learning c 1996 Nils J. Nilsson. All rights reserved.
110 CHAPTER 7. INDUCTIVE LOGIC PROGRAMMING
loop must add another literal. This time, suppose it adds Canfly(y,z)
to yield the clause Canfly(x,y) :- Nonstop(x,z), Canfly(y,z). This
clause is necessary no negative instances are covered. The program is now
sucient and consistent it is:
Canfly(x,y) :- Nonstop(x,y)
:- Nonstop(x,z), Canfly(y,z)
R1
Node 1 T
F
R2
T
F
R3
(only positive
T instances
F
satisfy all three
Node 2 F
tests)
Ξ2 = Ξ − Ξ 1
T
R4 Ξ1
T
F
R5
T (only positivel
F instances satisfy
these two tests)
F T
(only negative Ξ3
Ξ4= Ξ2 − Ξ3 instances)
f< A B > < A C > < B C > < B A > < C A > < C B >
< A A1 > < A A2 > < A1 A > < A2 A > < B B 1 > < B B 2 >
< B 1 B > < B 2 B > < C C 1 > < C C 2 > < C 1 C > < C 2 C >g
Ξ
Node 1
(top level)
Hub(x) T
F
Hub(y)
T
F
T
{<A,B>, <A,C>,
Node 2 F <B,C>, <B,A>,
(top level) <C,A>, <C,B>}
(Only positive instances)
Satellite(x,y)
F T
T
{<A1,A>, <A2,A>, <B1,B>,
<B2,B>, <C1,C>, <C2,C>}
F
(Only positive instances)
Node 3
(top level)
Satellite(y,x)
F T
T
{<A,A1>, <A,A2>,<B,B1>,
<B,B2>, <C,C1>, <C,C2>}
F (Only positive instances)
Computational Learning
Theory
In chapter one we posed the problem of guessing a function given a set of
sample inputs and their values. We gave some intuitive arguments to sup-
port the claim that after seeing only a small fraction of the possible inputs
(and their values) that we could guess almost correctly the values of most
subsequent inputs|if we knew that the function we were trying to guess
belonged to an appropriately restricted subset of functions. That is, a given
training set of sample patterns might be adequate to allow us to select a
function, consistent with the labeled samples, from among a restricted set
of hypotheses such that with high probability the function we select will
be approximately correct (small probability of error) on subsequent sam-
ples drawn at random according to the same distribution from which the
labeled samples were drawn. This insight led to the theory of probably ap-
proximately correct (PAC) learning|initially developed by Leslie Valiant
Valiant, 1984]. We present here a brief description of the theory for the case
of Boolean functions. Dietterich, 1990, Haussler, 1988, Haussler, 1990]
give nice surveys of the important results. Other
overviews?
That is,
probthere is a bad hypothesis that classies all m patterns correctly]
K(1 ; ")m .
8.2.2 Examples
Terms
Let H be the set of terms (conjunctions of literals). Then, jHj = 3n, and
m (1=")(ln(3n ) + ln(1=))
(1=")(1:1n + ln(1=))
Note that the bound on m increases only polynomially with n, 1=", and
1=.
For n = 50, " = 0:01 and = 0:01, m 5 961 guarantees PAC learn-
ability.
In order to show that terms are properly PAC learnable, we additionally
have to show that one can nd in time polynomial in m and n a hypothesis
h consistent with a set of m patterns labeled by the value of a term. The
following procedure for nding such a consistent hypothesis requires O(nm)
steps (adapted from Dietterich, 1990, page 268]):
We are given a training sequence,
, of m examples. Find the rst
pattern, say X1 , in that list that is labeled with a 1. Initialize a Boolean
function, h, to the conjunction of the n literals corresponding to the values
Introduction to Machine Learning c 1996 Nils J. Nilsson. All rights reserved.
122 CHAPTER 8. COMPUTATIONAL LEARNING THEORY
of the n components of X1 . (Components with value 1 will have corre-
sponding positive literals components with value 0 will have corresponding
negative literals.) If there are no patterns labeled by a 1, we exit with the
null concept (h 0 for all patterns). Then, for each additional pattern, Xi,
that is labeled with a 1, we delete from h any Boolean variables appearing
in Xi with a sign dierent from their sign in h. After processing all the
patterns labeled with a 1, we check all of the patterns labeled with a 0 to
make sure that none of them is assigned value 1 by h. If, at any stage of
the algorithm, any patterns labeled with a 0 are assigned a 1 by h, then
there exists no term that consistently classies the patterns in
, and we
Change this exit with failure. Otherwise, we exit with h.
paragraph if
this algorithm As an example, consider the following patterns, all labeled with a 1
was presented
in Chapter (from Dietterich, 1990]):
Three. (0 1 1 0)
(1 1 1 0)
(1 1 0 0)
After processing the rst pattern, we have h = x1x2 x3x4 after processing
the second pattern, we have h = x2x3x4 nally, after the third pattern, we
have h = x2x4 .
Linearly Separable Functions
Let H be the set of all linearly separable functions. Then, jHj 2n2 , and
;
m (1=") n2 ln 2 + ln(1=)
Again, note that the bound on m increases only polynomially with n, 1=",
and 1=.
For n = 50, " = 0:01 and = 0:01, m 173 748 guarantees PAC
learnability.
To show that linearly separable functions are properly PAC learnable,
we would have additionally to show that one can nd in time polynomial in
m and n a hypothesis h consistent with a set of m labeled linearly separable
Linear patterns.
programming
is polynomial.
8.2.3 Some Properly PAC-Learnable Classes
Some properly PAC-learnable classes of functions are given in the following
table. (Adapted from Dietterich, 1990, pages 262 and 268] which also
gives references to proofs of some of the time complexities.)
Introduction to Machine Learning c 1996 Nils J. Nilsson. All rights reserved.
8.2. PAC LEARNING 123
7
6
2 1
3
= 2m for m n
Introduction to Machine Learning c 1996 Nils J. Nilsson. All rights reserved.
126 CHAPTER 8. COMPUTATIONAL LEARNING THEORY
where C(m ; 1 i) is the binomial coecient (m(m;1;;1)!i)!i! .
The table below shows some values for L (m n).
m n
(no. of patterns) (dimension)
1 2 3 4 5
1 2 2 2 2 2
2 4 4 4 4 4
3 6 8 8 8 8
4 8 14 16 16 16
5 10 22 30 32 32
6 12 32 52 62 64
7 14 44 84 114 126
8 16 58 128 198 240
8.3.2 Capacity
Let Pmn = L2(mn )
m = the probability that a randomly selected dichotomy
m
(out of the 2 possible dichotomies of m patterns in n dimensions) will be
linearly separable. In Fig. 8.2 we plot P(n+1)n versus and n, where
= m=(n + 1).
Note that for large n (say n > 30) how quickly Pmn falls from 1 to
0 as m goes above 2(n + 1). For m < 2(n + 1), any dichotomy of the
m points is almost certainly linearly separable. But for m > 2(n + 1), a
randomly selected dichotomy of the m points is almost certainly not linearly
separable. For this reason m = 2(n + 1) is called the capacity of a TLU
Cover, 1965]. Unless the number of training patterns exceeds the capacity,
the fact that a TLU separates those training patterns according to their
labels means nothing in terms of how well that TLU will generalize to new
patterns. There is nothing special about a separation found for m < 2(n+1)
patterns|almost any dichotomy of those patterns would have been linearly
separable. To make sure that the separation found is forced by the training
set and thus generalizes well, it has to be the case that there are very few
linearly separable functions that would separate the m training patterns.
Introduction to Machine Learning c 1996 Nils J. Nilsson. All rights reserved.
8.3. THE VAPNIK-CHERVONENKIS DIMENSION 127
Pλ(n + 1), n
1
0.75
75 50
0.5
.5
0.25
25 40
0 30
0
1 20
2
10
3
4
λ
Unsupervised Learning
9.1 What is Unsupervised Learning?
Consider the various sets of points in a two-dimensional space illustrated
in Fig. 9.1. The rst set (a) seems naturally partitionable into two classes,
while the second (b) seems dicult to partition at all, and the third (c) is
problematic. Unsupervised learning uses procedures that attempt to nd
natural partitions of patterns. There are two stages:
Form an R-way partition of a set
of unlabeled training patterns
(where the value of R, itself, may need to be induced from the pat-
terns). The partition separates
into R mutually exclusive and ex-
haustive subsets,
1 : : :
R, called clusters.
Design a classier based on the labels assigned to the training patterns
by the partition.
We will explain shortly various methods for deciding how many clusters
there should be and for separating a set of patterns into that many clusters.
We can base some of these methods, and their motivation, on minimum-
description-length (MDL) principles. In that setting, we assume that we
want to encode a description of a set of points,
, into a message of mini-
mal length. One encoding involves a description of each point separately
other, perhaps shorter, encodings might involve a description of clusters of
points together with how each point in a cluster can be described given
the cluster it belongs to. The specic techniques described in this chapter
do not explicitly make use of MDL principles, but the MDL method has
131
132 CHAPTER 9. UNSUPERVISED LEARNING
a) two clusters
b) one cluster
c) ?
Ξ11 ∪ Ξ12 = Ξ1
Ξ23
Ξ12
Ξ1 ∪ Ξ2 ∪ Ξ3 = Ξ
Ξ31 ∪ Ξ32 = Ξ3
Ξ31
Ξ32
Ξ1 Ξ3 Ξ2
set,
, to the algorithm one-by-one. For each pattern, Xi , presented, we
nd that cluster seeker, Cj , that is closest to Xi and move it closer to Xi:
Cj ; (1 ; j )Cj + j Xi
where j is a learning rate parameter for the j-th cluster seeker it deter-
mines how far Cj is moved toward Xi .
Renements on this procedure make the cluster seekers move less far
as training proceeds. Suppose each cluster seeker, Cj , has a mass, mj ,
equal to the number of times that it has moved. As a cluster seeker's mass
increases it moves less far towards a pattern. For example, we might set
j = 1=(1+mj ) and use the above rule together with mj ; mj +1. With
this adjustment rule, a cluster seeker is always at the center of gravity
(sample mean) of the set of patterns toward which it has so far moved.
Intuitively, if a cluster seeker ever gets within some reasonably well clustered
set of patterns (and if that cluster seeker is the only one so located), it will
converge to the center of gravity of that cluster.
Once the cluster seekers have converged, the classier implied by the
now-labeled patterns in
can be based on a Voronoi partitioning of the
Introduction to Machine Learning c 1996 Nils J. Nilsson. All rights reserved.
9.2. CLUSTERING METHODS 135
space (based on distances to the various cluster seekers). This kind of
classication, an example of which is shown in Fig. 9.4, can be implemented
by a linear machine. Georgy
Fedoseevich
Voronoi, was a
Russian
mathematician
who lived from
Separating boundaries 1868 to 1909.
C2
C1
C3
1 7
3
2
9
5
6
can compute the sample statistics p(xi jCk) which give probability values for
each component given the class assigned to it by the partitioning. Suppose
each component xi of X can take on the values vij , where the index j steps
over the domain of that component. We use the notation pi (vij jCk ) =
probability(xi = vij jCk).
Suppose we use the following probabilistic guessing rule about the values
of the components of a vector X given only that it is in class k. Guess that
xi = vij with probability pi(vij jCk ). Then, the probability that we guess
the i-th component correctly is:
X probability(guess is vij )pi(vij Ck) = X pi(vij Ck)]
j j 2
j j
The average number of (the n) components whose values are guessed cor-
rectly by this method is then given by the sum of these probabilities over
all of the components of X:
X X pi(vij Ck)] j 2
i j
p3 (v31 = 1jC1) = 1
Summing over the values of the components (0 and 1) gives (1)2 + (0)2 = 1
for component 1, (1=2)2 +(1=2)2 = 1=2 for component 2, and (1)2 +(0)2 = 1
for component 3. Summing over the three components gives 2 1=2 for class
1. A similar calculation also gives 2 1=2 for class 2. Averaging over the
two clusters also gives 2 1=2. Finally, dividing by the number of clusters
produces the nal Z value of this partition, Z(P2 ) = 1 1=4, not quite as
high as Z(P1 ).
Similar calculations yield Z(P3 ) = 1 and Z(P4 ) = 3=4, so this method
of evaluating partitions would favor placing all patterns in a single cluster.
Introduction to Machine Learning c 1996 Nils J. Nilsson. All rights reserved.
9.3. HIERARCHICAL CLUSTERING METHODS 141
x3
d c
b
a x2
x1
. The successors of the root node will contain mutually exclusive and
exhaustive subsets of
. In general, the successors of a node, , are labeled
by mutually exclusive and exhaustive subsets of the pattern set labelling
node . The tips of the tree will contain singleton sets. The method uses Z
values to place patterns at the various nodes sample statistics are used to
update the Z values whenever a pattern is placed at a node. The algorithm
is as follows:
1. We start with a tree whose root node contains all of the patterns in
and a single empty successor node. We arrange that at all times
during the process every non-empty node in the tree has (besides any
other successors) exactly one empty successor.
2. Select a pattern Xi in
(if there are no more patterns to select,
terminate).
Introduction to Machine Learning c 1996 Nils J. Nilsson. All rights reserved.
142 CHAPTER 9. UNSUPERVISED LEARNING
3. Set to the root node.
4. For each of the successors of (including the empty successor!), cal-
culate the best host for Xi . A best host is determined by tentatively
placing Xi in one of the successors and calculating the resulting Z
value for each one of these ways of accomodating Xi . The best host
corresponds to the assignment with the highest Z value.
5. If the best host is an empty node, , we place Xi in , generate an
empty successor node of , generate an empty sibling node of , and
go to 2.
6. If the best host is a non-empty, singleton (tip) node, , we place Xi in
, create one successor node of containing the singleton pattern that
was in , create another successor node of containing Xi , create an
empty successor node of , create empty successor nodes of the new
non-empty successors of , and go to 2.
7. If the best host is a non-empty, non-singleton node, , we place Xi in
, set to , and go to 4.
This process is rather sensitive to the order in which patterns are pre-
sented. To make the nal classication tree less order dependent, the COB-
WEB procedure incorporates node merging and splitting.
Node merging:
It may happen that two nodes having the same parent could be merged
with an overall increase in the quality of the resulting classication per-
formed by the successors of that parent. Rather than try all pairs to merge,
a good heuristic is to attempt to merge the two best hosts. When such a
merging improves the Z value, a new node containing the union of the pat-
terns in the merged nodes replaces the merged nodes, and the two nodes
that were merged are installed as successors of the new node.
Node splitting:
A heuristic for node splitting is to consider replacing the best host
among a group of siblings by that host's successors. This operation is
performed only if it increases the Z value of the classication performed by
a group of siblings.
N0
soybean
diseases
N1 N2
Diaporthe N3
Charcoal
Stem Canker Rot
N31 N32
Rhizoctonia Phytophthora
Rot Rot
Temporal-Dierence
Learning
10.1 Temporal Patterns and Prediction Prob-
lems
In this chapter, we consider problems in which we wish to learn to pre-
dict the future value of some quantity, say z, from an n-dimensional input
pattern, X. In many of these problems, the patterns occur in temporal
sequence, X1 , X2 , . . ., Xi , Xi+1 , : : :, Xm , and are generated by a dynam-
ical process. The components of Xi are features whose values are available
at time, t = i. We distinguish two kinds of prediction problems. In one,
we desire to predict the value of z at time t = i + 1 based on input Xi for
every i. For example, we might wish to predict some aspects of tomorrow's
weather based on a set of measurements made today. In the other kind
of prediction problem, we desire to make a sequence of predictions about
the value of z at some xed time, say t = m + 1, based on each of the Xi,
i = 1 : : : m. For example, we might wish to make a series of predictions
about some aspect of the weather on next New Year's Day, based on mea-
surements taken every day before New Year's. Sutton Sutton, 1988] has
called this latter problem, multi-step prediction, and that is the problem we
consider here. In multi-step prediction, we might expect that the prediction
accuracy should get better and better as i increases toward m.
145
146 CHAPTER 10. TEMPORAL-DIFFERENCE LEARNING
10.2 Supervised and Temporal-Dierence Meth-
ods
A training method that naturally suggests itself is to use the actual value of
z at time m +1 (once it is known) in a supervised learning procedure using
a sequence of training patterns, fX1, X2 , : : :, Xi , Xi+1 , : : :, Xm g. That
is, we seek to learn a function, f, such that f(Xi ) is as close as possible
to z for each i. Typically, we would need a training set, , consisting of
several such sequences. We will show that a method that is better than
supervised learning for some important problems is to base learning on the
dierence between f(Xi+1 ) and f(Xi ) rather than on the dierence between
z and f(Xi ). Such methods involve what is called temporal-dierence (TD)
learning.
We assume that our prediction, f(X), depends on a vector of modiable
weights, W. To make that dependence explicit, we write f(X W). For
supervised learning, we consider procedures of the following type: For each
Xi , the prediction f(Xi W) is computed and compared to z, and the
learning rule (whatever it is) computes the change, (Wi ), to be made
to W. Then, taking into account the weight changes for each pattern in a
sequence all at once after having made all of the predictions with the old
weight vector, we change W as follows:
; W + X(W)i
m
W
i=1
Whenever we are attempting to minimize the squared error between
z and f(Xi W) by gradient descent, the weight-changing rule for each
pattern is:
(W)i = c(z ; fi ) @@fWi
where c is a learning rate parameter, fi is our prediction of z, f(Xi W),
at time t = i, and @@fW is, by denition, the vector of partial derivatives
i
(z ; fi ) =
X
m
(fk ; fk )
+1
k=i
where we dene fm+1 = z. Substituting in our formula for (W)i yields:
Xm
= c @@fWi (fk+1 ; fk )
k=i
In this form, instead of using the dierence between a prediction and the
value of z, we use the dierences between successive predictions|thus the
phrase temporal-dierence (TD) learning.
In the case when f(X W) = X W, the temporal dierence form of
the Widrow-Ho rule is:
(W)i = cXi
X
m
(fk ; fk )
+1
k=i
One reason for writing (W)i in temporal-dierence form is to permit
an interesting generalization as follows:
(W)i = c @@fWi
X
m
(k;i)(fk+1 ; fk )
k=i
where 0 < 1. Here, the term gives exponentially decreasing weight
to dierences later in time than t = i. When = 1, we have the same
rule with which we began|weighting all dierences equally, but as ! 0,
we weight only the (fi+1 ; fi ) dierence. With the term, the method is
called TD().
It is interesting to compare the two extreme cases:
For TD(0):
; W + X c @@fWi X (k;i)(fk+1 ; fk )
m m
W
i=1 k=i
Interchanging the order of the summations yields:
; W + X c X (k;i)(fk+1 ; fk ) @@fWi
m k
W
k=1 i=1
=W+
X
m Xk
c(fk ; fk ) k;i @@fWi ( )
+1
k=1 i=1
Introduction to Machine Learning c 1996 Nils J. Nilsson. All rights reserved.
10.3. INCREMENTAL COMPUTATION OF THE (W)I 149
Interchanging the indices k and i nally yields:
; W + X c(fi+1 ; fi) X (i;k) @@fWk
m i
W
i=1 k=1
Pmi=1as(earlier,
If, we want to use an expression of the form W ; W+
W)i, we see that we can write:
(W)i = c(fi+1 ; fi )
Xi i;k @fk
( )
k=1 @W
P
Now, if we let ei = ik=1 (i;k) @@fW , we can develop a computationally
k
= @f i+1
@ W + ei
Rewriting (W)i in these terms, we obtain:
(W)i = c(fi+1 ; fi )ei
where:
e1 = @@fW1
e2 = @@fW2 + e1
etc.
Quoting Sutton Sutton, 1988, page 15] (about a dierent equation, but the
quote applies equally well to this one):
\: : : this equation can be computed incrementally, because each
(W)i depends only on a pair of successive predictions and
on the weighted] sum of all past values for @@fW . This saves
i
1 0 0 0 0
0 1 0 0 0
0 0 1 0 0
0 0 0 1 0
0 0 0 0 1
XB XC XD XE XF
z=0 z=1
Typical Sequences:
XDXCXDXEXF 1
XDXCXBXCXDXEXDXEXF 1
XDXEXDXCXB 0
Error using
best c
Widrow-Hoff
0.20 TD(1)
0.18
0.16
0.14
TD(0)
0.12
0.10
λ
(Adapted from Sutton, p. 20, 1988)
W
X
m X
i
; W + c(fi ; fi) i;k @@fWk ( )
+1
i=1 k=1
where the weight update occurs after an entire sequence is observed. To
make the method truly incremental (in analogy with weight updating rules
for neural nets), it would be desirable to change the weight vector after
every pattern presentation. The obvious extension is:
Wt = c(dt+1 ; dt )
Xt t;k @dk
k=1 @W
where Wt is a vector of all weights in the network at time t, and @@dW is the
k
1 estimated payoff:
d = p1 + 2p2 − p3 − 2p4
2
no. of white
on cell 1
3
estimated probabilities:
#>3 p1 = pr(white wins)
p2 = pr(white gammons)
4 output units
no. on bar,
off board,
and who
moves
up to 40 hidden units
198 inputs
Delayed-Reinforcement
Learning
11.1 The General Problem
Imagine a robot that exists in an environment in which it can sense and
act. Suppose (as an extreme case) that it has no idea about the eects
of its actions. That is, it doesn't know how acting will change its sensory
inputs. Along with its sensory inputs are \rewards," which it occasionally
receives. How should it choose its actions so as to maximize its rewards
over the long run? To maximize rewards, it will need to be able to predict
how actions change inputs, and in particular, how actions lead to rewards.
We formalize the problem in the following way: The robot exists in an
environment consisting of a set, S , of states. We assume that the robot's
sensory apparatus constructs an input vector, X, from the environment,
which informs the robot about which state the environment is in. For
the moment, we will assume that the mapping from states to vectors is
one-to-one, and, in fact, will use the notation X to refer to the state of
the environment as well as to the input vector. When presented with an
input vector, the robot decides which action from a set, A, of actions to
perform. Performing the action produces an eect on the environment|
moving it to a new state. The new state results in the robot perceiving
a new input vector, and the cycle repeats. We assume a discrete time
model the input vector at time = is Xi , the action taken at that time
t i
taken and on the state, that is i = (Xi i). The learner's goal is to
r r a
159
160 CHAPTER 11. DELAYED-REINFORCEMENT LEARNING
nd a policy, (X), that maps input vectors to actions in such a way that
(state)
Xi
(action)
Learner
ri
(reward)
Environment ai
11.2 An Example
A \grid world," such as the one shown in Fig. 11.2 is ofEleven used to
illustrate reinforcement learning. Imagine a robot initially in cell (2,3). The
robot receives input vector ( 1 2) telling it what cell it is in it is capable
x x
of four actions, n e s w moving the robot one cell up, right, down, or left,
respectively. It is rewarded one negative unit whenever it bumps into the
wall or into the blocked cells. For example, if the input to the robot is (1,3),
and the robot chooses action , the next input to the robot is still (1,3)
w
and it receives a reward of ;1. If the robot lands in the cell marked (for G
goal), it receives a reward of +10. Let's suppose that whenever the robot
lands in the goal cell and gets its reward, it is immediately transported out
to some random cell, and the quest for reward continues.
A policy for our robot is a specication of what action to take for every
one of its inputs, that is, for every one of the cells in the grid. For example,
Introduction to Machine Learning c 1996 Nils J. Nilsson. All rights reserved.
11.3. TEMPORAL DISCOUNTING AND OPTIMAL POLICIES 161
8
7 G
6
5
4
3 R
2
1
1 2 3 4 5 6 7
into actions, and let i (X) be the reward that will be received on the -th
r i
time step after one begins executing policy starting in state X. Then the
total reward accumulated over all time steps by policy beginning in state
X is:
1
X (X)
V (X) = i
ri
i=0
8
7 G
6
5
4
3 R
2
1
1 2 3 4 5 6 7
One reason for using a temporal discount factor is so that the above sum
will be nite. An optimal policy is one that maximizes (X) for all inputs,
V
X.
In general, we want to consider the case in which the rewards, i , are r
pXj jXi ]. Then, we will want to maximize expected future reward and
a
"1 #
X i ( ) X
V (X) = E r
i
i=0
X
V (X) = X (X)] +
r p X0jX (X)] (X0 )
V
X 0
In other words, the value of state X under policy is the expected value
mality equation:"
2 3
X
(X) = max 4 (X ) + X0jX ] (X 0 ) 5
V r a p a V
a
X0
least one , assuming that we know the average rewards and the tran-
V
r a 0 p a V
2 3
(X) = arg max 4 (X ) + X
X0jX ] (X0 )5
r a p a V
a
X 0
But, of course, we are assuming that we do not know these average rewards
nor the transition probabilities, so we have to nd a method that eectively
learns them.
If we had a model of actions, that is, if we knew for every state, X, and
action , which state, X0 resulted, then we could use a method called value
a
On the -th step of the process, suppose we are at state Xi (that is, our input
i
on the -th step is Xi ), and that the estimated value of state Xi on the -th
i i
step is ^i (Xi ). We then select that action that maximizes the estimated
V a
h i
^ (X) = (1 ; i) ^i;1 (X) +
Vi c V ci ri + ^i;1 (X0i )
V
if X = Xi ,
= ^i;1 (X)
V
otherwise.
We see that this adjustment
h moves the ^
i value of i (Xi ) an increment V
estimate for i (X0i ), then this adjustment helps to make the two estimates
V
innitely often, this process of value iteration will converge to the optimal
Discuss values.
synchronous
dynamic
programming,
asynchronous
dynamic
programming,
11.4 Q-Learning
and policy
iteration. Watkins Watkins, 1989] has proposed a technique that he calls incremental
dynamic programming. Let a stand for the policy that chooses action a
once, and thereafter chooses actions according to policy . We dene:
Q (X ) = a V
a (X)
Then the optimal value from state X is given by:
(X) = max (X )
V Q a
a
This equation holds only for an optimal policy, . The optimal policy is
given by:
(X) = arg max Q
(X )
a
a
Q (X ) = (X ) +
a r a E V (X0 )]
where (X ) is the average value of the immediate reward received when
r a
h h ii
Q
(X ) = max
a
a
(X ) + r a E Q
(X0 )a
for all actions, , and states, X. Now, if we had the optimal values (for all
a Q
That is,
(X) = arg max h (X ) + r a E Q
h
(X0 )a
ii
a
We quote (with minor notational changes) from Watkins & Dayan, 1992,
page 281]:
\In -Learning, the agent's experience consists of a sequence of
Q
to:
(X ) = (1 ; i ) i;1(X ) + i i +
Qi a c Q a c r Vi ;1 (X0i )]
if X = Xi and = i , a a
= i;1(X )
Q a
otherwise,
Introduction to Machine Learning c 1996 Nils J. Nilsson. All rights reserved.
166 CHAPTER 11. DELAYED-REINFORCEMENT LEARNING
where
Vi ;1 (X0 ) = max
b
i;1(X0 )] Q b
to the state just exited and the action just taken is adjusted. And that
Q value is adjusted so that it is closer (by an amount determined by i ) c
to the sum of the immediate reward plus the discounted maximum (over
all actions) of the values of the state just entered. If we imagine the
Q
Q (X ) ;
+
a r V (X0 )
where (X ) is the new value for input X and action , is the imme-
diate reward when action is taken in response to input X, (X0 ) is the
Q a Q a r
a V
maximum (over all actions) of the value of the state next reached when
Q
action is taken from state X, and is the fraction of the way toward
which the new value, (X ), is adjusted to equal + (X0 ).
a
Q Q a r V
Watkins and Dayan Watkins & Dayan, 1992] prove that, under certain
conditions, the values computed by this learning procedure converge to
Q
optimal ones (that is, to ones on which an optimal policy can be based).
We dene i(X ) as the index (episode number) of the -th time that
n a i
Theorem 11.1 (Watkins and Dayan) For Markov problems with states
fXg and actions f g, and given bounded rewards j nj , learning rates
a r R
0 n 1, and
c <
1
X 1h
X i2
c i
n (Xa) = 1 c i
n (Xa) < 1
i=0 i=0
are very well described in Barto, Bradtke, & Singh, 1994]. learning is Q
values for all state-action pairs. In the grid world that we described earlier,
such a table would not be excessively large. We might start with random
entries in the table a portion of such an intial table might be as follows:
Introduction to Machine Learning c 1996 Nils J. Nilsson. All rights reserved.
168 CHAPTER 11. DELAYED-REINFORCEMENT LEARNING
X a Q (X )
a r (X ) a
(2,3) w 7 0
(2,3) n 4 0
(2,3) e 3 0
(2,3) s 6 0
(1,3) w 4 -1
(1,3) n 5 0
(1,3) e 2 0
(1,3) s 4 0
Suppose the robot is in cell (2,3). The maximum value occurs for = ,
Q a w
the immediate reward (which was 0 in this case). With a learning rate
parameter = 0 5 and = 0 9, the value of ((2 3) ) is adjusted from
c : : Q Q w
7 to 5.75. No other changes are made to the table at this episode. The
reader might try this learning procedure on the grid world with a simple
computer program. Notice that an optimal policy might not be discovered
if some cells are not visited nor some actions not tried frequently enough.
The learning problem faced by the agent is to associate specic actions
with specic input patterns. learning gradually reinforces those actions
Q
gradually propagate back from states producing rewards toward all of the
other states that the agent frequently visits. With random values to Q
begin, the agent's actions amount to a random walk through its space of
states. Only when this random walk happens to stumble into rewarding
states does learning begin to produce values that are useful, and, even
Q Q
then, the values have to work their way outward from these rewarding
Q
We should note that the learning problem faced by our grid-world robot
could be modied to have several places in the grid that give positive re-
wards. This possibility presents an interesting way to generalize the clas-
sical notion of a \goal" in AI planning systems|even in those that do no
learning. Instead of representing a goal as a condition to be achieved, we
represent a \goal structure" as a set of rewards to be given for achiev-
ing various conditions. Then, the generalized \goal" becomes maximizing
discounted future reward instead of simply achieving some particular con-
dition. This generalization can be made to encompass so-called goals of
maintenance and goals of avoidance. The example presented above in-
cluded avoiding bumping into the grid-world boundary. A goal of mainte-
nance, of a particular state, could be expressed in terms of a reward that
was earned whenever the agent was in that state and performed an action
that transitioned back to that state in one step.
example, we might rst nd that action prescribed by the values and
Q
that suggests itself is to use a neural network. For example, consider the
simple linear machine shown in Fig. 11.4.
trainable weights
Σ Q(a1, X)
Σ Q(a2, X)
...
X Wi
Σ Q(ai, X) = X . Wi
...
Σ Q(aR, X)
from. The values (as a function of the input pattern X and the action
Q
a i ) are computed as dot products of weight vectors (one for each action)
and the input vector. Weight adjustments are made according to a TD(0)
procedure to bring the value for the action last selected closer to the sum
Q
of the immediate reward (if any) and the (discounted) maximum value Q
are more complex than those that can be computed by a linear machine, a
layered neural network might be used. Sigmoid units in the nal layer would
compute values in the range 0 to 1. The TD(0) method for updating
Q Q
too greedy for present rewards and indierent to the future but using
large slows down learning.
Explanation-Based
Learning
175
176 CHAPTER 12. EXPLANATION-BASED LEARNING
Round(Obj5), then we could logically conclude Round(Obj5). Making this
conclusion and saving it is an instance of deductive learning|a topic we
study in this chapter.
Suppose that some logical proposition, , logically follows from some
set of facts, . Under what circumstances might we say that the process
of deducing from results in our learning ? In a sense, we implicitly
knew all along, since it was inherent in knowing . Yet, might not be
obvious given , and the deduction process to establish might have been
arduous. Rather than have to deduce again, we might want to save it,
perhaps along with its deduction, in case it is needed later. Shouldn't that
process count as learning? Dietterich Dietterich, 1990] has called this type
of learning speed-up learning.
Strictly speaking, speed-up learning does not result in a system being
able to make decisions that, in principle, could not have been made before
the learning took place. Speed-up learning simply makes it possible to make
those decisions more eciently. But, in practice, this type of learning might
make possible certain decisions that might otherwise have been infeasible.
To take an extreme case, a chess player can be said to learn chess even
though optimal play is inherent in the rules of chess. On the surface, there
seems to be no real dierence between the experience-based hypotheses
that a chess player makes about what constitutes good play and the kind
of learning we have been studying so far.
As another example, suppose we are given some theorems about geom-
etry and are asked to prove that the sum of the angles of a right triangle
is 180 degrees. Let us further suppose that the proof we constructed did
not depend on the given triangle being a right triangle in that case we
can learn a more general fact. The learning technique that we are going to
study next is related to this example. It is called explanation-based learning
(EBL). EBL can be thought of as a process in which implicit knowledge is
converted into explicit knowledge.
In EBL, we specialize parts of a domain theory to explain a particular
example, then we generalize the explanation to produce another element of
the domain theory that will be useful on similar examples. This process is
illustrated in Fig. 12.1.
Domain
Theory
specialize
Example Prove: X is P
(X is P)
Complex Proof
Process
Explanation
(Proof)
generalize
Y is P
hypothesis set from which we choose functions). The learning methods are
successful only if the hypothesis set is appropriate for the problem. Typi-
cally, the smaller the hypothesis set (that is, the more a priori information
we have about the function being sought), the less dependent we are on
information being supplied by a training set (that is, fewer samples). A
priori information about a problem can be expressed in several ways. The
methods we have studied so far restrict the hypotheses in a rather direct
way. A less direct method involves making assertions in a logical language
about the property we are trying to learn. A set of such assertions is usually
called a \domain theory."
Suppose, for example, that we wanted to classify people according to
whether or not they were good credit risks. We might represent a person
Introduction to Machine Learning c 1996 Nils J. Nilsson. All rights reserved.
178 CHAPTER 12. EXPLANATION-BASED LEARNING
by a set of properties (income, marital status, type of employment, etc.),
assemble such data about people who are known to be good and bad credit
risks and train a classier to make decisions. Or, we might go to a loan
ocer of a bank, ask him or her what sorts of things s/he looks for in
making a decision about a loan, encode this knowledge into a set of rules
for an expert system, and then use the expert system to make decisions.
The knowledge used by the loan ocer might have originated as a set
of \policies" (the domain theory), but perhaps the application of these
policies were specialized and made more ecient through experience with
the special cases of loans made in his or her district.
12.3 An Example
To make our discussion more concrete, let's consider the following fanciful
example. We want to nd a way to classify robots as \robust" or not. The
attributes that we use to represent a robot might include some that are
relevant to this decision and some that are not.
Suppose we have a domain theory of logical sentences that taken to-
gether, help to dene whether or not a robot can be classied as robust.
(The same domain theory may be useful for several other purposes also,
but among other things, it describes the concept \robust.")
In this example, let's suppose that our domain theory includes the sen-
tences:
Fixes(u u) Robust(u)
(An individual that can x itself is robust.)
Sees(x y) ^ Habile(x) F ixes(x y)
(A habile individual that can see another entity can x that entity.)
Robot(w) Sees(w w)
(All robots can see themselves.)
R2D2(x) Habile(x)
(R2D2-class individuals are habile.)
C3PO(x) Habile(x)
Introduction to Machine Learning c 1996 Nils J. Nilsson. All rights reserved.
12.3. AN EXAMPLE 179
(C3PO-class individuals are habile.)
:::
(By convention, variables are assumed to be universally quantied.) We
could use theorem-proving methods operating on this domain theory to
conclude whether certain robots are robust. These methods might be com-
putationally quite expensive because extensive search may have to be per-
formed to derive a conclusion. But after having found a proof for some
particular robot, we might be able to derive some new sentence whose use
allows a much faster conclusion.
We next show how such a new rule might be derived in this example.
Suppose we are given a number of facts about Num5, such as:
Robot(Num5)
R2D2(Num5)
Age(Num5 5)
Manufacturer(Num5 GR)
:::
We are also told that Robust(Num5) is true, but we nevertheless attempt
to nd a proof of that assertion using these facts about Num5 and the
domain theory. The facts about Num5 correspond to the features that we
might use to represent Num5. In this example, not all of them are relevant
to a decision about Robust(Num5). The relevant ones are those used or
needed in proving Robust(Num5) using the domain theory. The proof tree
in Fig. 12.2 is one that a typical theorem-proving system might produce.
In the language of EBL, this proof is an explanation for the fact
Robust(Num5). We see from this explanation that the only facts about
Num5 that were used were Robot(Num5) and R2D2(Num5). In fact, we
could construct the following rule from this explanation:
Robot(Num5) ^ R2D2(Num5) Robust(Num5)
The explanation has allowed us to prune some attributes about Num5 that
are irrelevant (at least for deciding Robust(Num5)). This type of pruning is
the rst sense in which an explanation is used to generalize the classication
problem. (DeJong & Mooney, 1986] call this aspect of explanation-based
Introduction to Machine Learning c 1996 Nils J. Nilsson. All rights reserved.
180 CHAPTER 12. EXPLANATION-BASED LEARNING
Robust(Num5)
Fixes(Num5, Num5)
Sees(Num5,Num5) Habile(Num5)
R2D2(x)
Robot(w)
=> Habile(x)
=> Sees(w,w)
Robot(Num5) R2D2(Num5)
learning feature elimination.) But the rule we extracted from the expla-
nation applies only to Num5. There might be little value in learning that
rule since it is so specic. Can it be generalized so that it can be applied
to other individuals as well?
Examination of the proof shows that the same proof structure, using
the same sentences from the domain theory, could be used independently
of whether we are talking about Num5 or some other individual. We can
generalize the proof by a process that replaces constants in the tip nodes
of the proof tree with variables and works upward|using unication to
constrain the values of variables as needed to obtain a proof.
In this example, we replace Robot(Num5) by Robot(r) and R2D2(Num5)
by R2D2(s) and redo the proof|using the explanation proof as a template.
Note that we use dierent values for the two dierent occurrences of Num5
at the tip nodes. Doing so sometimes results in more general, but never-
theless valid rules. We now apply the rules used in the proof in the forward
direction, keeping track of the substitutions imposed by the most general
Introduction to Machine Learning c 1996 Nils J. Nilsson. All rights reserved.
12.3. AN EXAMPLE 181
uniers used in the proof. (Note that we always substitute terms that are
already in the tree for variables in rules.) This process results in the gener-
alized proof tree shown in Fig. 12.3. Note that the occurrence of Sees(r r)
as a node in the tree forces the unication of x with y in the domain rule,
Sees(x y) ^ Habile(y) Fixes(x y). The substitutions are then applied
to the variables in the tip nodes and the root node to yield the general rule:
Robot(r) ^ R2D2(r) Robust(r).
Robust(r)
Fixes(r, r)
Sees(r,r) Habile(s)
{s/x} R2D2(x)
{r/w} Robot(w)
=> Habile(x)
=> Sees(w,w)
Robot(r) R2D2(s)
12.7 Applications
There have been several applications of EBL methods. We mention two
here, namely the formation of macro-operators in automatic plan generation
and learning how to control search.
Introduction to Machine Learning c 1996 Nils J. Nilsson. All rights reserved.
184 CHAPTER 12. EXPLANATION-BASED LEARNING
12.7.1 Macro-Operators in Planning
In automatic planning systems, eciency can sometimes be enhanced by
chaining together a sequence of operators into macro-operators. We show
an example of a process for creating macro-operators based on techniques
explored by Fikes, et al., 1972].
Referring to Fig. 12.4, consider the problem of nding a plan for a
robot in room R1 to fetch a box, B1, by going to an adjacent room, R2,
and pushing it back to R1. The goal for the robot is INROOM(B1 R1),
and the facts that are true in the initial state are listed in the gure.
R1 R2
D1
B1
D2
R3
Initial State:
INROOM(ROBOT, R1)
INROOM(B1,R2)
CONNECTS(D1,R1,R2)
CONNECTS(D1,R2,R1)
...
INROOM(ROBOT R1)
CONNECT S(D1 R1 R2)
CONNECT S(D1 R2 R1)
INROOM(B1 R2)
Saving this specic plan, valid only for the specic constants it mentions,
would not be as useful as would be saving a more general one. We rst
generalize these preconditions by substituting variables for constants. We
then follow the structure of the specic plan to produce the generalized
plan shown in Fig. 12.6 that achieves INROOM(b1 r4). Note that the
generalized plan does not require pushing the box back to the place where
the robot started. The preconditions for the generalized plan are:
INROOM(ROBOT r1)
CONNECTS(d1 r1 r2)
CONNECTS(d2 r2 r4)
INROOM(b r4)
Another related technique that chains together sequences of oper-
ators to form more general ones is the chunking mechanism in Soar
Laird, et al., 1986].
Introduction to Machine Learning c 1996 Nils J. Nilsson. All rights reserved.
186 CHAPTER 12. EXPLANATION-BASED LEARNING
INROOM(B1,R1)
PUSHTHRU(B1,d,r1,R1)
R1 R2
{R1/r3, D1/d2}
GOTHRU(D1,R1,R2)
PUSHTHRU(B1,D1,R2,R1)
INROOM(ROBOT, R1),
CONNECTS(D1, R1, R2),
CONNECTS(D1, R2, R1),
INROOM(B1, R2)
INROOM(b1,r4)
PUSHTHRU(b1,d2,r2,r4)
INROOM(ROBOT, r2),
CONNECTS(d1, r1, r2),
CONNECTS(d2, r2, r4),
INROOM(b1, r4)
INROOM(ROBOT, r1),
CONNECTS(d1, r1, r2),
CONNECTS(d2, r2, r4),
INROOM(b1, r4)