Ai Lecture Notes
Ai Lecture Notes
(Autonomous)
Dundigal, Hyderabad - 500 043
COURSE DESCRIPTION
Program B.Tech
Semester SIXTH
Theory Practical
3 0 3 3 1.5
COURSE OBJECTIVES:
1
COURSE OUTCOMES:
CO 4 Make use of Search Technique for problem reduction and analysis. Apply
Demonstrate First Order Logic for proving logical consequences
CO 5 Understand
through valid formula.
Explain Rule-Based Knowledge Representation for Natural
CO 6 Understand
Representation of Knowledge.
Choose a search implementation technique for obtaining optimized
CO 7 Apply
solution.
Apply Bayes Theorem for determining the probability of various
CO 8 Apply
data of given hypothesis.
Relate computation of a planning system for building an example
CO 9 Remember
data.
CO 10 Develop Expert Systems for Decision Making and Task Completion. Apply
Apply Knowledge Acquisition techniques for extracting, structuring
CO 11 Apply
and organising knowledge.
Compare AI with human intelligence and traditional information
CO 12 processing and learn its strengths and limitations. Understand
2
UNIT – I INTRODUCTION OF AI AND KNOWLEDGE
REPRESENTATION
3
MODULE - I
INTRODUCTION OF AI AND KNOWLEDGE
REPRESENTATION
4
INTRODUCTION OF AI AND KNOWLEDGE REPRESENTATION
Definition of AI
What exactly is artificial intelligence? Although most attempts to define complex and widely
used terms precisely are exercises in futility, it is useful to draw at least an approximate boundary around
the concept to provide a perspective on the discussion that follows. To do this, we propose the following
by no means universally accepted definition. Artificial intelligence (Al) is the study of how to make
computers do things which, at the moment, people do better. This definition is, of course, somewhat
ephemeral because of its reference to the current state of computer science. And it fails to include some
areas of potentially very large impact, namely problems that cannot now be solved well by either
computers or people. But it provides a good outline of what constitutes artificial intelligence, and it
avoids the philosophical issues that dominate attempts to define the meaning of either artificial or
intelligence. Interestingly, though, it suggests a similarity with philosophy at the same time it is avoiding
it. Philosophy has always been the study of those branches of knowledge that were so poorly understood
that they had not yet become separate disciplines in their own right. As fields such as mathematics or
physics became more advanced, they broke off from philosophy. Perhaps if Al succeeds it can reduce
itself to the empty set. As on date this has not happened. There are signs which seem to suggest that the
newer off-shoots of Al together with their real world applications are gradually overshadowing it. As Al
migrates to the real world we do not seem to be satisfied with just a computer playing a chess game.
Instead we wish a robot would sit opposite to us as an opponent, visualize the real board and make the
right moves in this physical world. Such notions seem to push the definitions of Al to a greater extent. As
we read on, there will be always that lurking feeling that the definitions propounded so far are not
adequate. Only what we finally achieve in the future will help us propound an apt definition for Al! The
feeling of intelligence is a mirage, if you achieve it, it ceases to make yob feel so. As somebody has aptly
put it - Al is Artificial Intelligence till it is achieved; after which the acronym reduces to Already
Implemented.
One must also appreciate the fact that comprehending the concept of Al also aids us in
understanding how natural intelligence works. Though a complete comprehension of its working may
remain a mirage, the very attempt will definitely assist in unfolding mysteries one by one.
The AI Problems
What then are some of the problems contained within Al? Much of the early work in the field
focused on formal tasks, such as game playing and theorem proving. Samuel wrote a checkers-playing
program that not only played games with opponents but also used its experience at those games to
improve its later performance. Chess also received a good deal of attention. The Logic Theorist was an
early attempt to prove mathematical theorems. It was able to prove several theorems from the first chapter
of Whitehead and Russell's Principia Mathematica. Gelemter’s theorem prover explored another area of
mathematics: geometry. Game playing and theorem proving share the property that people who do them
well are considered to be displaying intelligence. Despite this, it appeared initially that computers could
perform well at those tasks simply by being fast at exploring a large number of solution paths and then
selecting the best one. It was thought that this process required very little knowledge and could therefore
be programmed easily. As we will see later, lhis% assumption turned out to be false since no computer is
fast enough to overcome the combinatorial explosion generated by most problems.
5
Another early foray into Al focused on the sort of problem solving that we do every day when we
decide how to get to work in the morning, often called commonsense reasoning. It includes reasoning
about physical objects and their relationships to each other (e.g., an object can be in only one place at a
time), as well as reasoning about actions and their consequences (e.g., if you let go of something, it will
fall to the floor and maybe break). To investigate this sort of reasoning, Newell. Shaw, and Simon built
the General Problem Solver (GPS), which they applied to several commonsense tasks as well as to the
problem of performing symbolic manipulations of logical expressions. Again, no attempt was made to
create a program with a large amount of knowledge about a particular problem domain. Only simple tasks
were selected.
As Al research progressed and techniques for handling larger amounts of world knowledge were
developed, some progress was made on the tasks just described and new tasks could reasonably be
attempted. These include perception (vision and speech), natural language understanding, and problem
solving in specialized domains such as medical diagnosis and chemical analysis.
Perception of the world around us is crucial to our survival. Animals with much less intelligence
than people are capable of more sophisticated visual perception than are current machines. Perceptual
tasks are difficult because they involve analog (rather than digital) signals; the signals are typically very
noisy and usually a large number of things (some of which may be partially obscuring others) must be
perceived at once.
The ability to use language to communicate a wide variety of ideas is perhaps the most important
thing that separates humans from the other animals. The problem of understanding spoken language is a
perceptual problem and is hard to solve for the reasons just discussed. But suppose we simplify the
problem by restricting it to written language. This problem, usually referred to as natural language
understanding, is still extremely difficult. In order to understand sentences about a topic, it is necessary to
know not only a lot about the language itself (its vocabulary and grammar) but also a good deal about the
topic so that unstated assumptions can be recognized.
In addition to these mundane tasks, many people can also perform one or maybe more specialized
tasks in which carefully acquired expertise is necessary. Examples of such tasks include engineering
design, scientific discovery, medical diagnosis, and financial planning. Programs that can solve problems
in these domains also fall under the aegis of artificial intelligence.
A person who knows how to perform tasks from several of the categories shown in the figure
learns the necessary skills in standard order First, perceptual, linguistic, and commonsense skills are
learned. Later (and of course for some people, never) expert skills such as engineering, medicine, or
finance are acquired. It might seem to make sense then that the earlier skills are easier and thus more
amenable to computerized duplication than are the later, more specialized ones. For this reason, much of
the initial Al work was concentrated in those early areas. But it turns out that this naive assumption is not
right. Although expert skills require knowledge that many of us do not have, they often require much less
knowledge than do the more mundane skills and that knowledge is usually easier to represent and deal
with inside programs.
Following is the list of the tasks that are the targets of work in AI
Mundane Tasks
• Perception
6
o Vision
o Speech
• Natural language
o Understanding
o Generation
o Translation
• Commonsense reasoning
• Robot control
Formal Tasks
• Games
o Chess
o Backgammon
o Checkers
• Mathematics
o Geometry
o Logic
o Integral calculus
o Proving properties of programs
Expert Tasks
• Engineering
o Design
o Fault finding
o Manufacturing planning
• Scientific analysis
• Medical diagnosis
• Financial analysis
As a result, the problem areas where Al is now flourishing most as a practical discipline (as
opposed to a purely research one) are primarily the domains that require only specialized expertise
without the assistance of commonsense knowledge. There are now thousands of programs called expert
systems in day-to-day operation throughout all areas of industry and government. Each of these systems
attempts to solve part, or perhaps all, of a practical, significant problem that previously required scarce
human expertise.
At the heart of research in artificial intelligence lies what Newell and Simon [1976] call the physical
symbol system hypothesis. They define a physical symbol system as follows:
A physical symbol system consists of a set of entities, called symbols, which are physical patterns
that can occur as components of another type of entity called an expression (or symbol structure). Thus, a
symbol structure is composed of a number of instances (or tokens) of symbols related in some physical
7
way (such as one token being next to another). At any instant of time the system will contain a collection
of these symbol structures. Besides these structures, the system also contains a collection of processes that
operate on expressions to produce other expressions: processes of creation, modification, reproduction
and destruction. A physical symbol system is a machine that produces through time an evolv ing
collection of symbol structures. Such a system exists in a world of objects wider than just these symbolic
expressions themselves.
They then state the hypothesis as
The Physical Symbol System Hypothesis. A physical symbol system has the necessary and
sufficient means for general intelligent action.
This hypothesis is only a hypothesis. There appears to be no way to prove or disprove it on
logical grounds. So it must be subjected to empirical validation. We may find that it is false. We may find
that the bulk of the evidence says that it is true. But the only way to determine its truth is by
experimentation.
Computers provide the perfect medium for this experimentation since they can be programmed to
simulate any physical symbol system we like. This ability of computers to serve as arbitrary symbol
manipulators was noticed very early in the history of computing. Lady Lovelace made the following
observation about Babbage’s proposed Analytical Engine in 1842.
The operating mechanism can even be thrown into action independently of any object to operate
upon (although of course no result could then be developed). Again, it might act upon other things besides
numbers, were objects found whose mutual fundamental relations could be expressed by those of the
abstract science of operations, and which should be also susceptible of adaptations to the action of the
operating notation and mechanism of the engine. Supposing, for instance, that the fundamental relations
of pitched sounds in the science of harmony and of musical composition were susceptible of such
expression and adaptations, the engine might compose elaborate and scientific pieces of music of any
degree of complexity or extent. [Lovelace, 1961]
As it has become increasingly easy to build computing machines, so it has become increasingly
possible to conduct empirical investigations of the physical symbol system hypothesis. In each such
investigation, a particular task that might be regarded as requiring intelligence is selected. A program to
perform the task is proposed and then tested. Although we have not been completely successful at
creating programs that perform all the selected tasks, most scientists believe that many of the
problems that have been encountered will ultimately prove to be surmountable by more sophisticated
programs than we have yet produced.
Evidence in support of the physical symbol system hypothesis has come not only from areas such
as game playing, where one might most expect to find it, but also from areas such as visual perception,
where it is more tempting to suspect the influence of subsymbolic processes. However, subsymbolic
models (for example, neural networks) are beginning to challenge symbolic ones at such low-level tasks.
Such models are discussed in Chapter 18. Whether certain subsymbolic models conflict with the physical
symbol system hypothesis is a topic still under debate (e.g., Smolensky [1988]). And it is important to
note that even the success of subsymbolic systems is not necessarily evidence against the hypothesis. It is
often possible to accomplish a task in more than one way.
8
One interesting attempt to reduce a particularly human activity, the understanding of jokes, to a
process of symbol manipulation is provided in the book Mathematics and Humor [Paulos, 1980]. It is, of
course, possible that the hypothesis will turn out to be only partially true. Perhaps physical symbol
systems will prove able to model some aspects of human intelligence and not others. Only time and effort
will tell.
The importance of the physical symbol system hypothesis is twofold. It is a significant theory of
the nature of human intelligence and so is of great interest to psychologists. It also forms the basis of the
belief that it is possible to build programs that can perform intelligent tasks now performed by people.
Our major concern here is with the latter of these implications, although as we will soon see, the two
issues are not unrelated.
AI Techniques
Artificial intelligence problems span a very broad spectrum. They appear to have very little in common
except that they are hard. Are there any techniques that are appropriate for the solution of a variety of
these problems? The answer to this question is yes, there are. What, then, if anything, can we say about
those techniques besides the fact that they manipulate symbols? How could we tell if those techniques
might be useful in solving other problems, perhaps ones not traditionally regarded as Al tasks? The rest of
this book is an attempt to answer those questions in detail. But before we begin examining closely the
individual techniques, it is enlightening to take a broad look at them to see what properties they ought to
possess.
One of the few hard and fast results to come out of the first three decades of Al research is that
intelligence requires knowledge. To compensate for its one overpowering asset, indispensability,
knowledge possesses some less desirable properties, including:
• It is voluminous.
• It is hard to characterize accurately.
• It is constantly changing.
• It differs from data by being organized in a way that corresponds to the ways it will be used.
So where does this leave us in our attempt to define Al techniques? We are forced to conclude that an Al
technique is a method that exploits knowledge that should be represented in such a way that:
The knowledge captures generalizations. In other words, it is not necessary to represent
separately each individual situation. Instead, situations that share important properties are
grouped together. If knowledge does not have this property, inordinate amounts of memory and
updating will be required. So we usually call something without this property “data” rather than
knowledge.
It can be understood by people who must provide it. Although for many programs, the bulk of
the data can be acquired automatically (for example, by taking readings from a variety of
instruments), in many Al domains, most of the knowledge a program has must ultimately be
provided by people in terms they understand.
It can easily be modified to correct errors and to reflect changes in the world and in our world
view.
9
It can be used in a great many situations even if it is not totally accurate or complete.
It can be used to help overcome its own sheer bulk by helping to narrow the range of possibilities
that must usually be considered.
Although Al techniques must be designed in keeping with these constraints imposed by Al problems,
there is some degree of independence between problems and problem-solving techniques. It is possible to
solve Al problems without using Al techniques (although, as we suggested above, those solutions are not
likely to be very good). And it is possible to apply Al techniques to the solution of non-AI problems. This
is likely to be a good thing to do for problems that possess many of the same characteristics as do Al
problems. In order to try to characterize Al techniques in as problem-independent a way as possible, let’s
look at two very different problems and a series of approaches for solving each of them.
1. Tic-Tac-Toe
In this section, we present a series of three programs to play tic-tac-toe. The programs in this series
increase in:
• Their complexity
• Their use of generalizations
• The clarity of their knowledge
• The extensibility of their approach. Thus, they move toward being representations of what we call Al
techniques.
Program 1
Data Structures
Board - A nine-element vector representing the board, where the elements of the vector correspond to
the board positions as follows:
An element contains the value 0 if the corresponding square is blank, 1 if it is filled with an X, or 2 if it is
filled with an O.
Movetable - A large vector of 19,683 elements (39), each element of which is a nine-element vector.
The contents of this vector are chosen specifically to allow the algorithm to work.
The Algorithm
To make a move, do the following:
1. View the vector Board as a ternary (base three) number. Convert it to a decimal number.
2. Use the number computed in step 1 as an index into Movetable and access the vector stored there.
3. The vector selected in step 2 represents the way the board will look after the move that should be
made. So set Board equal to that vector.
10
Comments
This program is very efficient in terms of time. And, in theory, it could play an optimal game of tic-tac-
toe. But it has several disadvantages:
• It takes a lot of space to store the table that specifies the correct move to make from each
bor-i position,
» Someone will have to do a 1>h of work spe< dying all the entries h? the moveiable.
• It is very unlikely that all the required move table entries can be detei mined and en-ricu v. id-on;,
errors.
• If we want to extend the game, say to three dimensions, we would have to start from scratch, and in fact
this technique would no longer work at all, since 327 board positions would have to be stored, thus
overwhelming present computer memories.
The technique embodied in this program does not appear to meet any of our requirements for a good Al
Technique. Let's see if we can do better.
11
The second class of programs that attempt to model human performance are those that do things that fall
more clearly within our definition of Al tasks; they do things that are not trivial for the computer. There
are several reasons one might want to model human performance at these sorts of tasks:
1. To test psychological theories of human performance. One example of a program that was written for
this reason is PARRY [Colby, 1975], which exploited a model of human paranoid behavior to simulate
the conversational behavior of a paranoid person. The model was good enough that when several
psychologists were given the opportunity to converse with the program via a terminal, they diagnosed its
behavior as paranoid.
2. To enable computers to understand human reasoning. For example, for a computer to be able to read a
newspaper story and then answer a question, such as “Why did the terrorists kill the hostages?” its
program must be able to simulate the reasoning processes of people.
3. To enable people to understand computer reasoning. In many circumstances, people are reluctant to
rely on the output of a computer unless they can understand how the machine arrived at its result. If the
computer’s reasoning process is similar to that of people, then producing an acceptable explanation is
much easier.
4. To exploit what knowledge we can glean from people. Since people are the best-known performers of
most of the tasks with which we are dealing, it makes a lot of sense to look to them for clues as to how to
proceed.
This last motivation is probably the most pervasive of the four. It motivated several very early
systems that attempted to produce intelligent behavior by imitating people at the level of individual
neurons. For examples of this, see the early theoretical work of McCulloch and Pitts [1943], the work
on perceptrons, originally developed by Frank Rosenblatt but best described in Perceptrons [Minsky and
Papert, 1969] and Design for a Brain [Ashby, 1952]. It proved impossible, however, to produce even
minimally intelligent behavior with such simple devices. One reason was that there were severe
theoretical limitations to the particular neural, net architecture that was being used. More recently, several
new neural net architectures have been proposed. These structures are not subject to the same theoretical
limitations as were perceptrons. These new architectures are loosely called connectionist, and they have
been used as a basis for several learning and problem-solving programs. We have more to say about them
in Chapter 18. Also, we must consider that while human brains are highly parallel devices, most current
computing systems are essentially serial engines. A highly successful parallel technique may be
computationally intractable on a serial computer. But recently, partly because of the existence of the new
family of parallel cognitive models, as well as because of the general promise of parallel computing, there
is now substantial interest in the design of massively parallel machines to support Al programs.
Human cognitive theories have also influenced Al to look for higher-level (i.e., far above the
neuron level) theories that do not require massive parallelism for their implementation. An early example
of this approach can be seen in GPS, which are discussed in more detail in Section 3.6. This same
approach can also be seen in much current work in natural language understanding. The failure of
straightforward syntactic parsing mechanisms to make much of a dent in the problem of interpreting
English sentences has led many people who are interested in natural language understanding by
machine to look seriously for inspiration at what little we know about how people interpret language.
And when people who are trying to build programs to analyze pictures discover that a filter function they
12
have developed is very similar to what we think people use, they take heart that perhaps they are on the
right track.
As you can see, this last motivation pervades a great many areas of Al-research. In fact, it, in
conjunction, with the other motivations we mentioned, tends to make the distinction between the goal of
simulating human performance and the goal of building an intelligent program any way we can seem
much less different than they at first appeared. In either case, what we really need is a good m6del of
the processes involved in intelligent reasoning. The field of cognitive science, in which psychologists,
linguists, and computer scientists all work together, has as its goal the discovery of such a model. For a
good survey of the variety of approaches contained within the field, see Norman [1981], Anderson
[1985], and Gardner [1985].
Criteria for Success
One of the most important questions to answer in any scientific or engineering research project is
“How will we know if we have succeeded?” Artificial intelligence is no exception. How will we know if
we have constructed a machine that is intelligent? That question is at least as hard as the unanswerable
question “What is intelligence?” But can we do anything to measure our progress?
In 1950, Alan Turing proposed the following method for determining whether a machine can
think. His method has since become known as the Turing Test. To conduct this test, we need two people
and the machine to be evaluated. One person plays the role of the interrogator, who is in a separate room
from the computer and the other person. The interrogator can ask questions of either the person or the
computer by typing questions and receiving typed responses. However, the interrogator knows them only
as A and B and aims to determine which the person is and which are the machine. The goal of the
machine is to fool the interrogator into believing that it is the person. If the machine succeeds at this, then
we will conclude that the machine can think. The machine is allowed to do whatever it can to fool the
interrogator. So, for example, if asked the question “How much is 12,324 times 73,981?” it could wait
several minutes and then respond with the wrong answer [Turing, 1963].
The more serious issue, though, is the amount of knowledge that a machine would need to pass
the Turing test. Turing gives the following example of the sort of dialogue a machine would have to be
capable of:
Interrogator: In the first line of your sonnet which reads “Shall 1 compare thee to a
summer’s day,”would not “a spring day” do as well or better?
A: It wouldn’t scan.
Interrogator: How about “a winter’s day.” That would scan all right.
A: Yes, but nobody wants to be compared to a winter’s day.
Interrogator: Would you say Mr. Pickwick reminded you of Christmas?
A: In a way.
Interrogator: Yet Christmas is a winter’s day, and 1 do not think Mr. Pickwick would mind
the comparison.
A: I don’t think you're serious. By a winter’s day one means a typical winter’s day,
rather than a special one like Christmas.
13
It will be a long time before a computer passes the Turing test. Some people believe none ever
will. But suppose we are willing to settle for less than a complete imitation of a person. Can we measure
the achievement of Al in more restricted domains?
Often the answer to this question is yes. Sometimes it is possible to get a fairly precise measure of
the achievement of a program. For example, a program can acquire a chess rating in the same way as a
human player. The rating is based on the ratings of players whom the program can beat. Already
programs have acquired chess ratings higher than the vast majority of human players. For other problem
domains, a less precise measure of a program’s achievement is possible. For example, DENDRAL is a
program that analyzes organic compounds to determine their structure. It is hard to gel a precise measure
of DENDRAL’s level of achievement compared to human chemists, but it has produced analyses that
have been published as original research results. Thus it is certainly performing competently.
In other technical domains, it is possible to compare the time it takes for a program to complete a
task to the time required by a person to do the same thing. For example, there are several programs in use
by computer companies to configure particular systems to customers’ needs (of which the pioneer was a
program called Rl). These programs typically require minutes to perform tasks that previously required
hours of a skilled engineer’s time. Such programs are usually evaluated by looking at the bottom line—
whether they save (or make) money.
For many everyday tasks, though, it may be even harder to measure a program’s performance.
Suppose, for example, we ask a program to paraphrase a newspaper story. For problems such as this, the
best test is usually just whether the program responded in a way that a person could have.
If our goal in writing a program is to simulate human performance at a task, then the measure of
success is the extent to which the program’s behavior corresponds to that performance, as measured by
various kinds of experiments and protocol analyses. In this we do not simply want a program that does as
well as possible. We want one that fails when people do. Various techniques developed by psychologists
for comparing individuals and for testing models can be used to do this analysis.
We are forced to conclude that the question of whether a machine has intelligence or can think is
too nebulous to answer precisely. But it is often possible to construct a computer program that meets
some performance standard for a particular task. That does not mean that the program does the task in the
best possible way. It means only that we understand at least one way of doing at least part of a task. When
we set out to design an Al program, we should attempt to specify as well as possible the criteria for
success for that particular program functioning in its restricted domain. For the moment, that is the best
we can do.
The importance of AI
Game Playing
You can buy machines that can play master level chess for a few hundred dollars. There is
some AI in them, but they play well against people mainly through brute force computation--
looking at hundreds of thousands of positions. To beat a world champion by brute force and
known reliable heuristics requires being able to look at 200 million positions per second.
Speech Recognition
14
In the 1990s, computer speech recognition reached a practical level for limited purposes. Thus
United Airlines has replaced its keyboard tree for flight information by a system using speech
recognition of flight numbers and city names. It is quite convenient. On the other hand, while it is
possible to instruct some computers using speech, most users have gone back to the keyboard and the
mouse as still more convenient.
Understanding Natural Language
Just getting a sequence of words into a computer is not enough. Parsing sentences is not enough either.
The computer has to be provided with an understanding of the domain the text is about, and this is
presently possible only for very limited domains.
Computer Vision
The world is composed of three-dimensional objects, but the inputs to the human eye and computers'
TV cameras are two dimensional. Some useful programs can work solely in two dimensions, but full
computer vision requires partial three-dimensional information that is not just a set of two-dimensional
views. At present there are only limited ways of representing three-dimensional information directly,
and they are not as good as what humans evidently use.
Expert Systems
A “knowledge engineer'' interviews experts in a certain domain and tries to embody their knowledge in
a computer program for carrying out some task. How well this works depends on whether the
intellectual mechanisms required for the task are within the present state of AI. When this turned out not
to be so, there were many disappointing results. One of the first expert systems was MYCIN in 1974,
which diagnosed bacterial infections of the blood and suggested treatments. It did better than medical
students or practicing doctors, provided its limitations were observed. Namely, its ontology
included bacteria, symptoms, and treatments and did not include patients, doctors, hospitals, death,
recovery, and events occurring in time. Its interactions depended on a single patient being considered.
Since the experts consulted by the knowledge engineers knew about patients, doctors, death, recovery,
etc., it is clear that the knowledge engineers forced what the experts told them into a predetermined
framework. The usefulness of current expert systems depends on their users having common sense.
Heuristic Classification
One of the most feasible kinds of expert system given the present knowledge of AI is to put some
information in one of a fixed set of categories using several sources of information. An example is
advising whether to accept a proposed credit card purchase. Information is available about the
owner of the credit card, his record of payment and also about the item he is buying and about the
establishment from which he is buying it (e.g., about whether there have been previous credit card
frauds at this establishment).
The applications of AI
Consumer Marketing
o Have you ever used any kind of credit/ATM/store card while shopping?
o if so, you have very likely been “input” to an AI algorithm
15
o All of this information is recorded digitally
o Companies like Nielsen gather this information weekly and search for patterns
– general changes in consumer behavior
– tracking responses to new products
– identifying customer segments: targeted marketing, e.g., they find out that consumers with
sports cars who buy textbooks respond well to offers of new credit cards.
o Algorithms (“data mining”) search data for patterns based on mathematical theories of learning
Identification Technologies
o ID cards e.g., ATM cards
o can be a nuisance and security risk: cards can be lost, stolen, passwords forgotten, etc
o Biometric Identification, walk up to a locked door
– Camera
– Fingerprint device
– Microphone
– Computer uses biometric signature for identification
– Face, eyes, fingerprints, voice pattern
– This works by comparing data from person at door with stored library
16
– Learning algorithms can learn the matching process by analyzing a large library database
off-line, can improve its performance.
Intrusion Detection
o Computer security - we each have specific patterns of computer use times of day, lengths of
sessions, command used, sequence of commands, etc
– would like to learn the “signature” of each authorized user
– can identify non-authorized users
o How can the program automatically identify users?
– record user’s commands and time intervals
– characterize the patterns for each user
– model the variability in these patterns
– classify (online) any new user by similarity to stored patterns
Machine Translation
o Language problems in international business
– e.g., at a meeting of Japanese, Korean, Vietnamese and Swedish investors, no common
language
– If you are shipping your software manuals to 127 countries, the
solution is ; hire translators to translate
– would be much cheaper if a machine could do this!
o How hard is automated translation
– very difficult!
– e.g., English to Russian
– not only must the words be translated, but their meaning also!
Early works in AI
“Artificial Intelligence (AI) is the part of computer science concerned with designing intelligent
computer systems, that is, systems that exhibit characteristics we associate with intelligence in human
behaviour – understanding language, learning, reasoning, solving problems, and so on.”
Scientific Goal To determine which ideas about knowledge representation, learning, rule systems,
search, and so on, explain various sorts of real intelligence.
Engineering Goal To solve real world problems using AI techniques such as knowledge representation,
learning, rule systems, search, and so on.
17
Traditionally, computer scientists and engineers have been more interested in the engineering goal,
while psychologists, philosophers and cognitive scientists have been more interested in the scientific
goal.
The Roots - Artificial Intelligence has identifiable roots in a number of older disciplines, particularly:
Philosophy
Logic/Mathematics
Computation
Psychology/Cognitive Science
Biology/Neuroscience
Evolution
There is inevitably much overlap, e.g. between philosophy and logic, or between mathematics and
computation. By looking at each of these in turn, we can gain a better understanding of their role in AI,
and how these underlying disciplines have developed to play that role.
Philosophy
~400 BC Socrates asks for an algorithm to distinguish piety from non-piety.
~350 BC Aristotle formulated different styles of deductive reasoning, which could mechanically
generate conclusions from initial premises, e.g. Modus Ponens If A ? B and A then B
If A implies B and A is true then B is true when it’s raining you get wet and it’s raining
then you get wet
1596 – 1650 Rene Descartes idea of mind-body dualism – part of the mind is exempt from
physical laws.
1646 – 1716 Wilhelm Leibnitz was one of the first to take the materialist position which holds
that the mind operates by ordinary physical processes – this has the implication that mental
processes can potentially be carried out by machines.
Logic/Mathematics
Earl Stanhope’s Logic Demonstrator was a machine that was able to solve syllogisms, numerical
problems in a logical form, and elementary questions of probability.
1815 – 1864 George Boole introduced his formal language for making logical inference in 1847 –
Boolean algebra.
1848 – 1925 Gottlob Frege produced a logic that is essentially the first-order logic that today
forms the most basic knowledge representation system.
1906 – 1978 Kurt Gödel showed in 1931 that there are limits to what logic can do. His
Incompleteness Theorem showed that in any formal logic powerful enough to describe the
properties of natural numbers, there are true statements whose truth cannot be established by any
algorithm.
1995 Roger Penrose tries to prove the human mind has non-computable capabilities.
Computation
1869 William Jevon’s Logic Machine could handle Boolean Algebra and Venn Diagrams, and
was able to solve logical problems faster than human beings.
1912 – 1954 Alan Turing tried to characterise exactly which functions are capable of being
18
computed. Unfortunately it is difficult to give the notion of computation a formal definition.
However, the Church-Turing thesis, which states that a Turing machine is capable of computing
any computable function, is generally accepted as providing a sufficient definition. Turing also
showed that there were some functions which no Turing machine can compute (e.g. Halting
Problem).
1903 – 1957 John von Neumann proposed the von Neuman architecture which allows a
description of computation that is independent of the particular realisation of the computer.
1960s Two important concepts emerged: Intractability (when solution time grows atleast
exponentially) and Reduction (to ‘easier’ problems).
Psychology / Cognitive Science
Modern Psychology / Cognitive Psychology / Cognitive Science is the science which studies how
the mind operates, how we behave, and how our brains process information.
Language is an important part of human intelligence. Much of the early work on knowledge
representation was tied to language and informed by research into linguistics.
It is natural for us to try to use our understanding of how human (and other animal) brains lead to
intelligent behavior in our quest to build artificial intelligent systems. Conversely, it makes sense
to explore the properties of artificial systems (computer models/simulations) to test our
hypotheses concerning human systems.
Many sub-fields of AI are simultaneously building models of how the human system operates,
and artificial systems for solving real world problems, and are allowing useful ideas to transfer
between them.
Biology / Neuroscience
Our brains (which give rise to our intelligence) are made up of tens of billions of neurons, each
connected to hundreds or thousands of other neurons.
Each neuron is a simple processing device (e.g. just firing or not firing depending on the total
amount of activity feeding into it). However, large networks of neurons are extremely powerful
computational devices that can learn how best to operate.
The field of Connectionism or Neural Networks attempts to build artificial systems based
on simplified networks of simplified artificial neurons.
The aim is to build powerful AI systems, as well as models of various human abilities.
Neural networks work at a sub-symbolic level, whereas much of conscious human reasoning
appears to operate at a symbolic level.
Artificial neural networks perform well at many simple tasks, and provide good models of many
human abilities. However, there are many tasks that they are not so good at, and other
approaches seem more promising in those areas.
Evolution
Modern Psychology / Cognitive Psychology / Cognitive Science is the science which studies how
the mind operates, how we behave, and how our brains process information.
Language is an important part of human intelligence. Much of the early work on knowledge
representation was tied to language and informed by research into linguistics.
It is natural for us to try to use our understanding of how human (and other animal) brains lead to
intelligent behavior in our quest to build artificial intelligent systems. Conversely, it makes sense
19
to explore the properties of artificial systems (computer models/simulations) to test our
hypotheses concerning human systems.
Many sub-fields of AI are simultaneously building models of how the human system operates,
and artificial systems for solving real world problems, and are allowing useful ideas to transfer
between them.
Biology / Neuroscience
Our brains (which give rise to our intelligence) are made up of tens of billions of neurons, each
connected to hundreds or thousands of other neurons.
Each neuron is a simple processing device (e.g. just firing or not firing depending on the total
amount of activity feeding into it). However, large networks of neurons are extremely powerful
computational devices that can learn how best to operate.
The field of Connectionism or Neural Networks attempts to build artificial systems based on
simplified networks of simplified artificial neurons.
The aim is to build powerful AI systems, as well as models of various human abilities.
Neural networks work at a sub-symbolic level, whereas much of conscious human reasoning
appears to operate at a symbolic level.
Artificial neural networks perform well at many simple tasks, and provide good models of many
human abilities. However, there are many tasks that they are not so good at, and other approaches
seem more promising in those areas.
Evolution
One advantage humans have over current machines/computers is that they have a long
evolutionary history.
Charles Darwin (1809 – 1882) is famous for his work on evolution by natural selection. The idea
is that fitter individuals will naturally tend to live longer and produce more children, and hence
after many generations a population will automatically emerge with good innate properties.
This has resulted in brains that have much structure, or even knowledge, built in at birth.
This gives them at the advantage over simple artificial neural network systems that have to learn
everything.
Computers are finally becoming powerful enough that we can simulate evolution and evolve good
AI systems.
We can now even evolve systems (e.g. neural networks) so that they are good at learning.
A related field called genetic programming has had some success in evolving programs, rather
than programming them by hand.
Sub-fields of Artificial Intelligence
Neural Networks – e.g. brain modelling, time series prediction, classification
Evolutionary Computation – e.g. genetic algorithms, genetic programming
Vision – e.g. object recognition, image understanding
Robotics – e.g. intelligent control, autonomous exploration
Expert Systems – e.g. decision support systems, teaching systems
Speech Processing– e.g. speech recognition and production
Natural Language Processing – e.g. machine translation
20
Planning – e.g. scheduling, game playing
Machine Learning – e.g. decision tree learning, version space learning
Speech Processing
As well as trying to understand human systems, there are also numerous real world applications:
speech recognition for dictation systems and voice activated control; speech production for
automated announcements and computer interfaces.
How do we get from sound waves to text streams and vice-versa?
21
AI and Related fields
Logical AI
What a program knows about the world in general the facts of the specific situation in which it must
act, and its goals are all represented by sentences of some mathematical logical language. The
program decides what to do by inferring that certain actions are appropriate for achieving its goals.
Search
AI programs often examine large numbers of possibilities, e.g. moves in a chess game or inferences
by a theorem proving program. Discoveries are continually made about how to do this more
efficiently in various domains.
Pattern Recognition
When a program makes observations of some kind, it is often programmed to compare what
it sees with a pattern. For example, a vision program may try to match a pattern of eyes and a nose in
a scene in order to find a face. More complex patterns, e.g. in a natural language text, in a chess
position, or in the history of some event are also studied.
Representation
Facts about the world have to be represented in some way. Usually languages of mathematical logic
are used.
Inference
From some facts, others can be inferred. Mathematical logical deduction is adequate for some
purposes, but new methods of non-monotonic inference have been added to logic since the 1970s.
The simplest kind of non-monotonic reasoning is default reasoning in which a conclusion is to be
inferred by default, but the conclusion can be withdrawn if there is evidence to the contrary. For
example, when we hear of a bird, we man infer that it can fly, but this conclusion can be reversed
when we hear that it is a penguin. It is the possibility that a conclusion may have to be withdrawn
that constitutes the non-monotonic character of the reasoning. Ordinary logical reasoning is
monotonic in that the set of conclusions that can the drawn from a set of premises is a monotonic
increasing function of the premises.
Common sense knowledge and reasoning
This is the area in which AI is farthest from human-level, in spite of the fact that it has been an active
research area since the 1950s. While there has been considerable progress, e.g. in developing systems
of non-monotonic reasoning and theories of action, yet more new ideas are needed.
Learning from experience
Programs do that. The approaches to AI based on connectionism and neural nets specialize in that.
There is also learning of laws expressed in logic. Programs can only learn what facts or behaviors
their formalisms can represent, and unfortunately learning systems are almost all based on
very limited abilities to represent information.
Planning
Planning programs start with general facts about the world (especially facts about the effects of
actions), facts about the particular situation and a statemeThis is a study of the kinds of knowledge
that are required for solving problems in the world.
Ontology
Ontology is the study of the kinds of things that exist. In AI, the programs and sentences deal with
various kinds of objects, and we study what these kinds are and what their basic properties are.
22
Emphasis on ontology begins in the 1990s.
Heuristics
A heuristic is a way of trying to discover something or an idea imbedded in a program. The term is
used variously in AI. Heuristic functions are used in some approaches to search to measure how far a
node in a search tree seems to be from a goal. Heuristic predicates that compare two nodes in a
search tree to see if one is better than the other, i.e. constitutes an advance toward the goal, may be
more useful.
Genetic Programming
Genetic programming is a technique for getting programs to solve a task by mating random Lisp
programs and selecting fittest in millions of generations.
23
It’s one thing to say that the mind operates, at least in part, according to logical rules, and to build
physical systems that emulate some of those rules; it’s another to say that the mind itself is such a
physical system. Ren´e Descartes (1596–1650) gave the first clear discussion of the distinction between
mind and matter and of the problems that arise. One problem with a purely physical conception of the
mind is that it seems to leave little room for free will: if the mind is governed entirely by physical laws,
then it has no more free will than a rock “deciding” to fall toward the center of the earth. Descartes was a
strong advocate of the power of reasoning in understanding the world, a philosophy now called
rationalism, and one that counts Aristotle and Leibnitz as members. But Descartes was also a proponent
of dualism. He held that there is a part of the human mind (or soul or spirit) that is outside of nature,
exempt from physical laws. Animals, on the other hand, did not possess this dual quality; they could be
treated as machines. An alternative to dualism is materialism, which holds that the brain’s operation
according to the laws of physics constitutes the mind. Free will is simply the way that the perception of
available choices appears to the choosing entity.
Given a physical mind that manipulates knowledge, the next problem is to establish the source of
knowledge. The empiricism movement, starting with Francis Bacon’s (1561– 1626) Novum Organum,2 is
characterized by a dictum of John Locke (1632–1704): “Nothing is in the understanding, which was not
first in the senses.” David Hume’s (1711–1776) A Treatise of Human Nature (Hume, 1739) proposed
what is now known as the principle of induction: that general rules are acquired by exposure to repeated
associations between their elements. Building on the work of Ludwig Wittgenstein (1889–1951) and
Bertrand Russell (1872–1970), the famous Vienna Circle, led by Rudolf Carnap (1891–1970), developed
the doctrine of logical positivism. This doctrine holds that all knowledge can be characterized by logical
theories connected, ultimately, to observation sentences that correspond to sensory inputs; thus logical
positivism combines rationalism and empiricism.3 The confirmation the- ory of Carnap and Carl Hempel
(1905–1997) attempted to analyze the acquisition of knowl- edge from experience. Carnap’s book The
Logical Structure of the World (1928) defined an explicit computational procedure for extracting
knowledge from elementary experiences. It was probably the first theory of mind as a computational
process.
The final element in the philosophical picture of the mind is the connection between knowledge
and action. This question is vital to AI because intelligence requires action as well as reasoning.
Moreover, only by understanding how actions are justified can we understand how to build an agent
whose actions are justifiable (or rational). Aristotle argued (in De Motu Animalium) that actions are
justified by a logical connection between goals and knowledge of the action’s outcome (the last part of
this extract also appears on the front cover of this book, in the original Greek):
But how does it happen that thinking is sometimes accompanied by action and sometimes not,
sometimes by motion, and sometimes not? It looks as if almost the same thing happens as in the case of
reasoning and making inferences about unchanging objects. But in that case the end is a speculative
proposition ... whereas here the conclusion which results from the two premises is an action. ... I need
covering; a cloak is a covering. I need a cloak. What I need, I have to make; I need a cloak. I have to
make a cloak. And the conclusion, the “I have to make a cloak,” is an action.
In the Nicomachean Ethics (Book III. 3, 1112b), Aristotle further elaborates on this topic, suggesting an
algorithm:
24
We deliberate not about ends, but about means. For a doctor does not deliberate whether he shall
heal, nor an orator whether he shall persuade, ... They assume the end and consider how and by what
means it is attained, and if it seems easily and best produced thereby; while if it is achieved by one means
only they consider how it will be achieved by this and by what means this will be achieved, till they come
to the first cause, ... and what is last in the order of analysis seems to be first in the order of becoming.
And if we come on an impossibility, we give up the search, e.g., if we need money and this cannot be got;
but if a thing appears possible we try to do it.
Aristotle’s algorithm was implemented 2300 years later by Newell and Simon in their GPS
program. We would now call it a regression planning system.
Goal-based analysis is useful, but does not say what to do when several actions will achieve the
goal or when no action will achieve it completely. Antoine Arnauld (1612–1694) correctly described a
quantitative formula for deciding what action to take in cases like this (see Chapter 16). John Stuart Mill’s
(1806–1873) book Utilitarianism (Mill, 1863) promoted the idea of rational decision criteria in all spheres
of human activity.
2 . Mathematics
• What are the formal rules to draw valid conclusions?
• What can be computed?
• How do we reason with uncertain information?
Philosophers staked out some of the fundamental ideas of AI, but the leap to a formal science required a
level of mathematical formalization in three fundamental areas: logic, computa- tion, and probability.
The idea of formal logic can be traced back to the philosophers of ancient Greece, but its
mathematical development really began with the work of George Boole (1815–1864), who worked out
the details of propositional, or Boolean, logic (Boole, 1847). In 1879, Gottlob Frege (1848–1925)
extended Boole’s logic to include objects and relations, creating the first- order logic that is used today.4
Alfred Tarski (1902–1983) introduced a theory of reference that shows how to relate the objects in a logic
to objects in the real world.
The next step was to determine the limits of what could be done with logic and com- putation.
The first nontrivial algorithm is thought to be Euclid’s algorithm for computing greatest common divisors.
The word algorithm (and the idea of studying them) comes from al-Khowarazmi, a Persian mathematician
of the 9th century, whose writings also introduced Arabic numerals and algebra to Europe. Boole and
others discussed algorithms for logical deduction, and, by the late 19th century, efforts were under way to
formalize general mathe- matical reasoning as logical deduction. In 1930, Kurt G¨odel (1906–1978)
showed that there exists an effective procedure to prove any true statement in the first-order logic of Frege
and Russell, but that first-order logic could not capture the principle of mathematical induction needed to
characterize the natural numbers. In 1931, G¨odel showed that limits on deduc- tion do exist. His
incompleteness theorem showed that in any formal theory as strong as Peano arithmetic (the elementary
theory of natural numbers), there are true statements that are undecidable in the sense that they have no
proof within the theory.
This fundamental result can also be interpreted as showing that some functions on the integers
cannot be represented by an algorithm—that is, they cannot be computed. This motivated Alan Turing
25
(1912–1954) to try to characterize exactly which functions are com- putable—capable of being computed.
This notion is actually slightly problematic because the notion of a computation or effective procedure
really cannot be given a formal definition. However, the Church–Turing thesis, which states that the
Turing machine (Turing, 1936) is capable of computing any computable function, is generally accepted as
providing a sufficient definition. Turing also showed that there were some functions that no Turing
machine can compute. For example, no machine can tell in general whether a given program will return
an answer on a given input or run forever.
Although decidability and computability are important to an understanding of computa- tion, the
notion of tractability has had an even greater impact. Roughly speaking, a problem is called intractable if
the time required to solve instances of the problem grows exponentially with the size of the instances. The
distinction between polynomial and exponential growth in complexity was first emphasized in the mid-
1960s (Cobham, 1964; Edmonds, 1965). It is important because exponential growth means that even
moderately large instances cannot be solved in any reasonable time. Therefore, one should strive to divide
the overall problem of generating intelligent behavior into tractable subproblems rather than intractable
ones.
How can one recognize an intractable problem? The theory of NP-completeness,pio- neered by
Steven Cook (1971) and Richard Karp (1972), provides a method. Cook and Karp showed the existence
of large classes of canonical combinatorial search and reasoning prob- lems that are NP-complete. Any
problem class to which the class of NP-complete problems can be reduced is likely to be intractable.
(Although it has not been proved that NP-complete problems are necessarily intractable, most
theoreticians believe it.) These results contrast with the optimism with which the popular press greeted
the first computers—“Electronic Super-Brains” that were “Faster than Einstein!” Despite the increasing
speed of computers, careful use of resources will characterize intelligent systems. Put crudely, the world
is an extremely large problem instance! Work in AI has helped explain why some instances of NP-
complete problems are hard, yet others are easy (Cheeseman et al., 1991).
Besides logic and computation, the third great contribution of mathematics to AI is the theory of
probability. The Italian Gerolamo Cardano (1501–1576) first framed the idea of probability, describing it
in terms of the possible outcomes of gambling events. In 1654, Blaise Pascal (1623–1662), in a letter to
Pierre Fermat (1601–1665), showed how to pre- dict the future of an unfinished gambling game and
assign average payoffs to the gamblers. Probability quickly became an invaluable part of all the
quantitative sciences, helping to deal with uncertain measurements and incomplete theories. James
Bernoulli (1654–1705), Pierre Laplace (1749–1827), and others advanced the theory and introduced new
statistical meth- ods. Thomas Bayes (1702–1761), who appears on the front cover of this book, proposed
a rule for updating probabilities in the light of new evidence. Bayes’ rule underlies most modern
approaches to uncertain reasoning in AI systems.
3. Economics
• How should we make decisions so as to maximize payoff?
• How should we do this when others may not go along?
• How should we do this when the payoff may be far in the future?
The science of economics got its start in 1776, when Scottish philosopher Adam Smith (1723–1790)
published An Inquiry into the Nature and Causes of the Wealth of Nations. While the ancient Greeks and
26
others had made contributions to economic thought, Smith was the first to treat it as a science, using the
idea that economies can be thought of as consist- ing of individual agents maximizing their own
economic well-being. Most people think of economics as being about money, but economists will say that
they are really studying how people make choices that lead to preferred outcomes. When McDonald’s
offers a hamburger for a dollar, they are asserting that they would prefer the dollar and hoping that
customers will prefer the hamburger. The mathematical treatment of “preferred outcomes” or utility was
first formalized by L´eon Walras (pronounced “Valrasse”) (1834-1910) and was improved by Frank
Ramsey (1931) and later by John von Neumann and Oskar Morgenstern in their book The Theory of
Games and Economic Behavior (1944).
Decision theory, which combines probability theory with utility theory, provides a for- mal and
complete framework for decisions (economic or otherwise) made under uncertainty— that is, in cases
where probabilistic descriptions appropriately capture the decision maker’s environment. This is suitable
for “large” economies where each agent need pay no attention to the actions of other agents as
individuals. For “small” economies, the situation is much more like a game: the actions of one player can
significantly affect the utility of another (either positively or negatively). Von Neumann and
Morgenstern’s development of game theory (see also Luce and Raiffa, 1957) included the surprising
result that, for some games, a rational agent should adopt policies that are (or least appear to be)
randomized. Unlike de- cision theory, game theory does not offer an unambiguous prescription for
selecting actions.
For the most part, economists did not address the third question listed above, namely, how to
make rational decisions when payoffs from actions are not immediate but instead re- sult from several
actions taken in sequence. This topic was pursued in the field of operations research, which emerged in
World War II from efforts in Britain to optimize radar installa- tions, and later found civilian applications
in complex management decisions. The work of Richard Bellman (1957) formalized a class of sequential
decision problems called Markov decision processes.
Work in economics and operations research has contributed much to our notion of ra- tional
agents, yet for many years AI research developed along entirely separate paths. One reason was the
apparent complexity of making rational decisions. The pioneering AI re- searcher Herbert Simon (1916–
2001) won the Nobel Prize in economics in 1978 for his early work showing that models based on
satisficing—making decisions that are “good enough,” rather than laboriously calculating an optimal
decision—gave a better description of actual human behavior (Simon, 1947). Since the 1990s, there has
been a resurgence of interest in decision-theoretic techniques for agent systems (Wellman, 1995).
4. Neuroscience
• How do brains process information?
Neuroscience is the study of the nervous system, particularly the brain. Although the exact way in which
the brain enables thought is one of the great mysteries of science, the fact that it does enable thought has
been appreciated for thousands of years because of the evidence that strong blows to the head can lead to
mental incapacitation. It has also long been known that human brains are somehow different; in about 335
B.C. Aristotle wrote, “Of all the animals, man has the largest brain in proportion to his size.”5 Still, it was
not until the middle of the 18th century that the brain was widely recognized as the seat of consciousness.
Before then, candidate locations included the heart and the spleen.
27
Paul Broca’s (1824–1880) study of aphasia (speech deficit) in brain-damaged patients in 1861
demonstrated the existence of localized areas of the brain responsible for specific cognitive functions. In
particular, he showed that speech production was localized to the portion of the left hemisphere now
called Broca’s area.6 By that time, it was known that the brain consisted of nerve cells, or neurons, but it
was not until 1873 that Camillo Golgi (1843–1926) developed a staining technique allowing the
observation of individual neurons in the brain (see Figure 1.2). This technique was used by Santiago
Ramon y Cajal (1852–1934) in his pioneering studies of the brain’s neuronal structures.7 Nicolas
Rashevsky (1936,1938) was the first to apply mathematical models to the study of the nervous sytem.
Fig: The parts of a nerve cell or neuron. Each neuron consists of a cell body, or soma, that contains a cell
nucleus. Branching out from the cell body are a number of fibers called dendrites and a single long fiber
called the axon. The axon stretches out for a long distance, much longer than the scale in this diagram
indicates. Typically, an axon is 1 cm long (100 times the diameter of the cell body), but can reach up to 1
meter. A neuron makes connections with 10 to 100,000 other neurons at junctions called synapses.
Signals are propagated from neuron to neuron by a complicated electrochemical reaction. The signals
control brain activity in the short term and also enable long-term changes in the connectivity of neurons.
These mechanisms are thought to form the basis for learning in the brain. Most information processing
goes on in the cerebral cortex, the outer layer of the brain. The basic organizational unit appears to be a
column of tissue about 0.5 mm in diameter, containing about 20,000 neurons and extending the full depth
of the cortex about 4 mm in humans).
We now have some data on the mapping between areas of the brain and the parts of the body that
they control or from which they receive sensory input. Such mappings are able to change radically over
the course of a few weeks, and some animals seem to have multiple maps. Moreover, we do not fully
understand how other areas can take over functions when one area is damaged. There is almost no theory
on how an individual memory is stored.
The measurement of intact brain activity began in 1929 with the invention by Hans Berger of the
electroencephalograph (EEG). The recent development of functional magnetic resonance imaging (fMRI)
(Ogawa et al., 1990; Cabeza and Nyberg, 2001) is giving neu- roscientists unprecedentedly detailed
images of brain activity, enabling measurements that correspond in interesting ways to ongoing cognitive
28
processes. These are augmented by advances in single-cell recording of neuron activity. Individual
neurons can be stimulated electrically, chemically, or even optically (Han and Boyden, 2007), allowing
neuronal input– output relationships to be mapped. Despite these advances, we are still a long way from
understanding how cognitive processes actually work.
The truly amazing conclusion is that a collection of simple cells can lead to thought, Action, and
consciousness or, in the pithy words of John Searle (1992), brains cause minds.
Fig: A crude comparison of the raw computational resources available to the IBM BLUE GENE
supercomputer, a typical personal computer of 2008, and the human brain. The brain’s numbers are
essentially fixed, whereas the supercomputer’s numbers have been in- creasing by a factor of 10 every 5
years or so, allowing it to achieve rough parity with the brain. The personal computer lags behind on all
metrics except cycle time.
The only real alternative theory is mysticism: that minds operate in some mystical realm that is beyond
physical science.
Brains and digital computers have somewhat different properties. Figure 1.3 shows that
computers have a cycle time that is a million times faster than a brain. The brain makes up for that with
far more storage and interconnection than even a high-end personal computer, although the largest
supercomputers have a capacity that is similar to the brain’s. (It should be noted, however, that the brain
does not seem to use all of its neurons simultaneously.) Futurists make much of these numbers, pointing
to an approaching singularity at which computers reach a superhuman level of performance (Vinge, 1993;
Kurzweil, 2005), but the raw comparisons are not especially informative. Even with a computer of
virtually unlimited capacity, we still would not know how to achieve the brain’s level of intelligence.
5. Psychology
• How do humans and animals think and act?
The origins of scientific psychology are usually traced to the work of the German physi- cist Hermann
von Helmholtz (1821–1894) and his student Wilhelm Wundt (1832–1920). Helmholtz applied the
scientific method to the study of human vision, and his Handbook of Physiological Optics is even now
described as “the single most important treatise on the physics and physiology of human vision” (Nalwa,
1993, p.15). In 1879, Wundt opened the first laboratory of experimental psychology, at the University of
Leipzig. Wundt insisted on carefully controlled experiments in which his workers would perform a
perceptual or as- sociative task while introspecting on their thought processes. The careful controls went a
long way toward making psychology a science, but the subjective nature of the data made it unlikely that
an experimenter would ever disconfirm his or her own theories. Biologists studying animal behavior, on
29
the other hand, lacked introspective data and developed an ob- jective methodology, as described by H. S.
Jennings (1906) in his influential work Behavior of the Lower Organisms. Applying this viewpoint to
humans, the behaviorism movement, led by John Watson (1878–1958), rejected any theory involving
mental processes on the grounds hat introspection could not provide reliable evidence. Behaviorists
insisted on studying only objective measures of the percepts (or stimulus) given to an animal and its
resulting actions (or response). Behaviorism discovered a lot about rats and pigeons but had less success
at understanding humans.
Cognitive psychology, which views the brain as an information-processing device, can be traced
back at least to the works of William James (1842–1910). Helmholtz also insisted that perception
involved a form of unconscious logical inference. The cognitive viewpoint was largely eclipsed by
behaviorism in the United States, but at Cambridge’s Ap- plied Psychology Unit, directed by Frederic
Bartlett (1886–1969), cognitive modeling was able to flourish. The Nature of Explanation, by Bartlett’s
student and successor Kenneth Craik (1943), forcefully reestablished the legitimacy of such “mental”
terms as beliefs and goals, arguing that they are just as scientific as, say, using pressure and temperature
to talk about gases, despite their being made of molecules that have neither. Craik specified the three key
steps of a knowledge-based agent: (1) the stimulus must be translated into an inter- nal representation, (2)
the representation is manipulated by cognitive processes to derive new internal representations, and (3)
these are in turn retranslated back into action. He clearly explained why this was a good design for an
agent:
If the organism carries a “small-scale model” of external reality and of its own possible actions within its
head, it is able to try out various alternatives, conclude which is the best of them, react to future situations
before they arise, utilize the knowledge of past events in dealing with the present and future, and in every
way to react in a much fuller, safer, and more competent manner to the emergencies which face it.
After Craik’s death in a bicycle accident in 1945, his work was continued by Donald Broad- bent,
whose book Perception and Communication (1958) was one of the first works to model psychological
phenomena as information processing. Meanwhile, in the United States, the development of computer
modeling led to the creation of the field of cognitive science.The field can be said to have started at a
workshop in September 1956 at MIT. (We shall see that this is just two months after the conference at
which AI itself was “born.”) At the workshop, George Miller presented The Magic Number Seven, Noam
Chomsky presented Three Models of Language, and Allen Newell and Herbert Simon presented The
Logic Theory Machine. These three influential papers showed how computer models could be used to
address the psychology of memory, language, and logical thinking, respectively. It is now a common
(although far from universal) view among psychologists that “a cognitive theory should be like a
computer program” (Anderson, 1980); that is, it should describe a detailed information- processing
mechanism whereby some cognitive function might be implemented.
6. Computer engineering
• How can we build an efficient computer?
For artificial intelligence to succeed, we need two things: intelligence and an artifact. The computer has
been the artifact of choice. The modern digital electronic computer was in- vented independently and
almost simultaneously by scientists in three countries embattled in World War II. The first operational
computer was the electromechanical Heath Robinson,8 built in 1940 by Alan Turing’s team for a single
purpose: deciphering German messages. In 1943, the same group developed the Colossus, a powerful
30
general-purpose machine based on vacuum tubes.9 The first operational programmable computer was the
Z-3, the inven- tion of Konrad Zuse in Germany in 1941. Zuse also invented floating-point numbers and
the first high-level programming language, Plankalk¨ul. The first electronic computer, the ABC, was
assembled by John Atanasoff and his student Clifford Berry between 1940 and 1942 at Iowa State
University. Atanasoff’s research received little support or recognition; it was the ENIAC, developed as
part of a secret military project at the University of Pennsylvania by a team including John Mauchly and
John Eckert, that proved to be the most influential forerunner of modern computers.
Since that time, each generation of computer hardware has brought an increase in speed and
capacity and a decrease in price. Performance doubled every 18 months or so until around 2005, when
power dissipation problems led manufacturers to start multiplying the number of CPU cores rather than
the clock speed. Current expectations are that future increases in power will come from massive
parallelism—a curious convergence with the properties of the brain.
Of course, there were calculating devices before the electronic computer. The earliest automated
machines, dating from the 17th century, were discussed on page 6. The first pro- grammable machine was
a loom, devised in 1805 by Joseph Marie Jacquard (1752–1834), that used punched cards to store
instructions for the pattern to be woven. In the mid-19th century, Charles Babbage (1792–1871) designed
two machines, neither of which he com- pleted. The Difference Engine was intended to compute
mathematical tables for engineering and scientific projects. It was finally built and shown to work in 1991
at the Science Museum in London (Swade, 2000). Babbage’s Analytical Engine was far more ambitious:
it included addressable memory, stored programs, and conditional jumps and was the first artifact capa-
ble of universal computation. Babbage’s colleague Ada Lovelace, daughter of the poet Lord Byron, was
perhaps the world’s first programmer. (The programming language Ada is named after her.) She wrote
programs for the unfinished Analytical Engine and even speculated that the machine could play chess or
compose music.
AI also owes a debt to the software side of computer science, which has supplied the operating
systems, programming languages, and tools needed to write modern programs (and papers about them).
But this is one area where the debt has been repaid: work in AI has pio- neered many ideas that have
made their way back to mainstream computer science, including time sharing, interactive interpreters,
personal computers with windows and mice, rapid de- velopment environments, the linked list data type,
automatic storage management, and key concepts of symbolic, functional, declarative, and object-oriented
programming.
7. Control theory and cybernetics
• How can artifacts operate under their own control?
Ktesibios of Alexandria (c. 250 B.C.) built the first self-controlling machine: a water clock with a
regulator that maintained a constant flow rate. This invention changed the definition of what an artifact
could do. Previously, only living things could modify their behavior in response to changes in the
environment. Other examples of self-regulating feedback control systems include the steam engine
governor, created by James Watt (1736–1819), and the thermostat, invented by Cornelis Drebbel (1572–
1633), who also invented the submarine. The mathematical theory of stable feedback systems was
developed in the 19th century.
31
The central figure in the creation of what is now called control theory was Norbert Wiener (1894–
1964). Wiener was a brilliant mathematician who worked with Bertrand Rus- sell, among others, before
developing an interest in biological and mechanical control systems and their connection to cognition.
Like Craik (who also used control systems as psychological models), Wiener and his colleagues Arturo
Rosenblueth and Julian Bigelow challenged the behaviorist orthodoxy (Rosenblueth et al., 1943). They
viewed purposive behavior as aris- ing from a regulatory mechanism trying to minimize “error”—the
difference between current state and goal state. In the late 1940s, Wiener, along with Warren McCulloch,
Walter Pitts, and John von Neumann, organized a series of influential conferences that explored the new
mathematical and computational models of cognition. Wiener’s book Cybernetics (1948) be- came a
bestseller and awoke the public to the possibility of artificially intelligent machines. Meanwhile, in
Britain, W. Ross Ashby (Ashby, 1940) pioneered similar ideas. Ashby, Alan Turing, Grey Walter, and
others formed the Ratio Club for “those who had Wiener’s ideas before Wiener’s book appeared.”
Ashby’s Design for a Brain (1948, 1952) elaborated on his idea that intelligence could be created by the
use of homeostatic devices containing appro- priate feedback loops to achieve stable adaptive behavior.
8. Linguistics
• How does language relate to thought?
In 1957, B. F. Skinner published Verbal Behavior. This was a comprehensive, detailed ac- count of the
behaviorist approach to language learning, written by the foremost expert in the field. But curiously, a
review of the book became as well known as the book itself, and served to almost kill off interest in
behaviorism. The author of the review was the linguist Noam Chomsky, who had just published a book
on his own theory, Syntactic Structures. Chomsky pointed out that the behaviorist theory did not address
the notion of creativity in language—it did not explain how a child could understand and make up
sentences that he or she had never heard before. Chomsky’s theory—based on syntactic models going
back to the Indian linguist Panini (c. 350 B.C.)—could explain this, and unlike previous theories, it was
formal enough that it could in principle be programmed.
Modern linguistics and AI, then, were “born” at about the same time, and grew up together, intersecting
in a hybrid field called computational linguistics or natural language processing. The problem of
understanding language soon turned out to be considerably more complex than it seemed in 1957.
Understanding language requires an understanding of the subject matter and context, not just an
understanding of the structure of sentences. This might seem obvious, but it was not widely appreciated
until the 1960s. Much of the early work in knowledge representation (the study of how to put knowledge
into a form that a computer can reason with) was tied to language and informed by research in linguistics,
which was connected in turn to decades of work on the philosophical analysis of language.
The History of Artificial Intelligence
1. The gestation of artificial intelligence (1943–1955)
The first work that is now generally recognized as AI was done by Warren McCulloch and Walter Pitts
(1943). They drew on three sources: knowledge of the basic physiology and function of neurons in the
brain; a formal analysis of propositional logic due to Russell and Whitehead; and Turing’s theory of
computation. They proposed a model of artificial neurons in which each neuron is characterized as being
“on” or “off,” with a switch to “on” occurring in response to stimulation by a sufficient number of
neighboring neurons. The state of a neuron was conceived of as “factually equivalent to a proposition
32
which proposed its adequate stimulus.” They showed, for example, that any computable function could be
computed by some network of connected neurons, and that all the logical connectives (and, or, not, etc.)
could be implemented by simple net structures. McCulloch and Pitts also suggested that suitably defined
networks could learn. Donald Hebb (1949) demonstrated a simple updating rule for modifying the
connection strengths between neurons. His rule, now called Hebbian learning, remains an influential
model to this day.
Two undergraduate students at Harvard, Marvin Minsky and Dean Edmonds, built the first neural
network computer in 1950. The SNARC, as it was called, used 3000 vacuum tubes and a surplus
automatic pilot mechanism from a B-24 bomber to simulate a network of 40 neurons. Later, at Princeton,
Minsky studied universal computation in neural networks. His Ph.D. committee was skeptical about
whether this kind of work should be considered mathematics, but von Neumann reportedly said, “If it
isn’t now, it will be someday.” Minsky was later to prove influential theorems showing the limitations of
neural network research.
There were a number of early examples of work that can be characterized as AI, but Alan
Turing’s vision was perhaps the most influential. He gave lectures on the topic as early as 1947 at the
London Mathematical Society and articulated a persuasive agenda in his 1950 article “Computing
Machinery and Intelligence.” Therein, he introduced the Turing Test, machine learning, genetic
algorithms, and reinforcement learning. He proposed the Child Programme idea, explaining “Instead of
trying to produce a programme to simulate the adult mind, why not rather try to produce one which
simulated the child’s?”
2. The birth of artificial intelligence (1956)
Princeton was home to another influential figure in AI, John McCarthy. After receiving his PhD there in
1951 and working for two years as an instructor, McCarthy moved to Stan- ford and then to Dartmouth
College, which was to become the official birthplace of the field. McCarthy convinced Minsky, Claude
Shannon, and Nathaniel Rochester to help him bring together U.S. researchers interested in automata
theory, neural nets, and the study of intel- ligence. They organized a two-month workshop at Dartmouth
in the summer of 1956. The proposal states:
We propose that a 2 month, 10 man study of artificial intelligence be carried out during the
summer of 1956 at Dartmouth College in Hanover, New Hamp- shire. The study is to proceed on the
basis of the conjecture that every aspect of learning or any other feature of intelligence can in principle be
so precisely de- scribed that a machine can be made to simulate it. An attempt will be made to find how to
make machines use language, form abstractions and concepts, solve kinds of problems now reserved for
humans, and improve themselves. We think that a significant advance can be made in one or more of
these problems if a carefully selected group of scientists work on it together for a summer.
There were 10 attendees in all, including Trenchard More from Princeton, Arthur Samuel from
IBM, and Ray Solomonoff and Oliver Selfridge from MIT.
Two researchers from Carnegie Tech,11 Allen Newell and Herbert Simon, rather stole the show.
Although the others had ideas and in some cases programs for particular appli- cations such as checkers,
Newell and Simon already had a reasoning program, the Logic Theorist (LT), about which Simon
claimed, “We have invented a computer program capable of thinking non-numerically, and thereby
solved the venerable mind–body problem.” Russell was reportedly delighted when Simon showed him
33
that the program had come up with a proof for one theorem that was shorter than the one in Principia. The
editors of the Journal of Symbolic Logic were less impressed; they rejected a paper coauthored by
Newell, Simon, and Logic Theorist.
The Dartmouth workshop did not lead to any new breakthroughs, but it did introduce all the
major figures to each other. For the next 20 years, the field would be dominated by these people and their
students and colleagues at MIT, CMU, Stanford, and IBM.
Looking at the proposal for the Dartmouth workshop (McCarthy et al., 1955), we can see why it
was necessary for AI to become a separate field. Why couldn’t all the work done in AI have taken place
under the name of control theory or operations research or decision theory, which, after all, have
objectives similar to those of AI? Or why isn’t AI a branch of mathematics? The first answer is that AI
from the start embraced the idea of duplicating human faculties such as creativity, self-improvement, and
language use. None of the other fields were addressing these issues. The second answer is methodology.
AI is the only one of these fields that is clearly a branch of computer science (although operations
research does share an emphasis on computer simulations), and AI is the only field to attempt to build
machines that will function autonomously in complex, changing environments.
3. Early enthusiasm, great expectations (1952–1969)
The early years of AI were full of successes—in a limited way. Given the primitive comput- ers and
programming tools of the time and the fact that only a few years earlier computers were seen as things
that could do arithmetic and no more, it was astonishing whenever a com- puter did anything remotely
clever. The intellectual establishment, by and large, preferred to believe that “a machine can never do X .”
AI researchers naturally responded by demonstrating one X after another. John McCarthy referred to this
period as the “Look, Ma, no hands!” era.
Newell and Simon’s early success was followed up with the General Problem Solver, or GPS.
Unlike Logic Theorist, this program was designed from the start to imitate human problem-solving
protocols. Within the limited class of puzzles it could handle, it turned out that the order in which the
program considered subgoals and possible actions was similar to that in which humans approached the
same problems. Thus, GPS was probably the first pro- gram to embody the “thinking humanly” approach.
The success of GPS and subsequent pro- grams as models of cognition led Newell and Simon (1976) to
formulate the famous physical symbol system hypothesis, which states that “a physical symbol system
has the necessary and sufficient means for general intelligent action.” What they meant is that any system
(human or machine) exhibiting intelligence must operate by manipulating data structures composed of
symbols. We will see later that this hypothesis has been challenged from many directions.
At IBM, Nathaniel Rochester and his colleagues produced some of the first AI pro- grams.
Herbert Gelernter (1959) constructed the Geometry Theorem Prover, which was able to prove theorems
that many students of mathematics would find quite tricky. Starting in 1952, Arthur Samuel wrote a series
of programs for checkers (draughts) that eventually learned to play at a strong amateur level. Along the
way, he disproved the idea that comput ers can do only what they are told to: his program quickly learned
to play a better game than its creator. The program was demonstrated on television in February 1956,
creating a strong impression. Like Turing, Samuel had trouble finding computer time. Working at night,
he used machines that were still on the testing floor at IBM’s manufacturing plant.
34
John McCarthy moved from Dartmouth to MIT and there made three crucial contribu- tions in
one historic year: 1958. In MIT AI Lab Memo No. 1, McCarthy defined the high-level language Lisp,
which was to become the dominant AI programming language for the next 30 years. With Lisp,
McCarthy had the tool he needed, but access to scarce and expensive com- puting resources was also a
serious problem. In response, he and others at MIT invented time sharing. Also in 1958, McCarthy
published a paper entitled Programs with Common Sense, in which he described the Advice Taker, a
hypothetical program that can be seen as the first complete AI system. Like the Logic Theorist and
Geometry Theorem Prover, McCarthy’s program was designed to use knowledge to search for solutions
to problems. But unlike the others, it was to embody general knowledge of the world. For example, he
showed how some simple axioms would enable the program to generate a plan to drive to the airport. The
program was also designed to accept new axioms in the normal course of operation, thereby allowing it to
achieve competence in new areas without being reprogrammed. The Advice Taker thus embodied the
central principles of knowledge representation and reasoning: that it is useful to have a formal, explicit
representation of the world and its workings and to be able to manipulate that representation with
deductive processes. It is remarkable how much of the 1958 paper remains relevant today.
1958 also marked the year that Marvin Minsky moved to MIT. His initial collaboration with
McCarthy did not last, however. McCarthy stressed representation and reasoning in for- mal logic,
whereas Minsky was more interested in getting programs to work and eventually developed an anti-logic
outlook. In 1963, McCarthy started the AI lab at Stanford. His plan to use logic to build the ultimate
Advice Taker was advanced by J. A. Robinson’s discov- ery in 1965 of the resolution method Work at
Stanford emphasized general-purpose methods for logical reasoning. Applications of logic included
Cordell Green’s question-answering and planning systems (Green, 1969b) and the Shakey robotics
project at the Stanford Research Institute (SRI).
Minsky supervised a series of students who chose limited problems that appeared to require
intelligence to solve. These limited domains became known as microworlds. James Slagle’s SAINT
program (1963) was able to solve closed-form calculus integration problems typical of first-year college
courses. Tom Evans’s ANALOGY program (1968) solved geo- metric analogy problems that appear in
IQ tests. Daniel Bobrow’s STUDENT program (1967) solved algebra story problems, such as the
following:
If the number of customers Tom gets is twice the square of 20 percent of the number of
advertisements he runs, and the number of advertisements he runs is 45, what is the number of customers
Tom gets? If the number of customers Tom gets is twice the square of 20 percent of the number of
advertisements he runs, and the number of advertisements he runs is 45, what is the number of customers
Tom gets?
35
Fig: A scene from the blocks world. SHRDLU (Winograd, 1972) has just completed the command “Find
a block which is taller than the one you are holding and put it in the box.”
The most famous microworld was the blocks world, which consists of a set of solid blocks placed on a
tabletop (or more often, a simulation of a tabletop), as shown in Figure 1.4. A typical task in this world is
to rearrange the blocks in a certain way, using a robot hand that can pick up one block at a time. The
blocks world was home to the vision project of David Huffman (1971), the vision and constraint-
propagation work of David Waltz (1975), the learning theory of Patrick Winston (1970), the natural-
language-understanding program of Terry Winograd (1972), and the planner of Scott Fahlman (1974).
Early work building on the neural networks of McCulloch and Pitts also flourished. The work of
Winograd and Cowan (1963) showed how a large number of elements could collectively represent an
individual concept, with a corresponding increase in robustness and parallelism. Hebb’s learning methods
were enhanced by Bernie Widrow (Widrow and Hoff, 1960; Widrow, 1962), who called his networks
adalines, and by Frank Rosenblatt (1962) with his perceptrons.Theperceptron convergence theorem
(Block et al., 1962) says that the learning algorithm can adjust the connection strengths of a perceptron to
match any input data, provided such a match exists.
4. A dose of reality (1966–1973)
From the beginning, AI researchers were not shy about making predictions of their coming successes.
The following statement by Herbert Simon in 1957 is often quoted:
It is not my aim to surprise or shock you—but the simplest way I can summarize is to say that
there are now in the world machines that think, that learn and that create. Moreover, their ability to do
these things is going to increase rapidly until—in a visible future—the range of problems they can
handle will be coextensive with the range to which the human mind has been applied.
Terms such as “visible future” can be interpreted in various ways, but Simon also made more
concrete predictions: that within 10 years a computer would be chess champion, and a significant
36
mathematical theorem would be proved by machine. These predictions came true (or approximately
true) within 40 years rather than 10. Simon’s overconfidence was due to the promising performance of
early AI systems on simple examples. In almost all cases, however, these early systems turned out to fail
miserably when tried out on wider selections of problems and on more difficult problems.
The first kind of difficulty arose because most early programs knew nothing of their subject
matter; they succeeded by means of simple syntactic manipulations. A typical story occurred in early
machine translation efforts, which were generously funded by the U.S. Na- tional Research Council in
an attempt to speed up the translation of Russian scientific papers in the wake of the Sputnik launch in
1957. It was thought initially that simple syntactic trans- formations based on the grammars of Russian
and English, and word replacement from an electronic dictionary, would suffice to preserve the exact
meanings of sentences. The fact is that accurate translation requires background knowledge in order to
resolve ambiguity and establish the content of the sentence. The famous retranslation of “the spirit is
willing but the flesh is weak” as “the vodka is good but the meat is rotten” illustrates the difficulties en-
countered. In 1966, a report by an advisory committee found that “there has been no machine translation
of general scientific text, and none is in immediate prospect.” All U.S. government funding for
academic translation projects was canceled. Today, machine translation is an im- perfect but widely
used tool for technical, commercial, government, and Internet documents.
The second kind of difficulty was the intractability of many of the problems that AI was
attempting to solve. Most of the early AI programs solved problems by trying out different
combinations of steps until the solution was found. This strategy worked initially because microworlds
contained very few objects and hence very few possible actions and very short solution sequences.
Before the theory of computational complexity was developed, it was widely thought that “scaling up”
to larger problems was simply a matter of faster hardware and larger memories. The optimism that
accompanied the development of resolution theorem proving, for example, was soon dampened when
researchers failed to prove theorems involv- ing more than a few dozen facts. The fact that a program
can find a solution in principle does not mean that the program contains any of the mechanisms
needed to find it in practice.
The illusion of unlimited computational power was not confined to problem-solving
programs. Early experiments in machine evolution (now called genetic algorithms)(Fried- berg,
1958; Friedberg et al., 1959) were based on the undoubtedly correct belief that by making an
appropriate series of small mutations to a machine-code program, one can gen- erate a program with
good performance for any particular task. The idea, then, was to try random mutations with a selection
process to preserve mutations that seemed useful. De- spite thousands of hours of CPU time, almost no
progress was demonstrated. Modern genetic algorithms use better representations and have shown more
success.
Failure to come to grips with the “combinatorial explosion” was one of the main criti- cisms of
AI contained in the Lighthill report (Lighthill, 1973), which formed the basis for the decision by the
British government to end support for AI research in all but two universities. (Oral tradition paints a
somewhat different and more colorful picture, with political ambitions and personal animosities whose
description is beside the point.)
A third difficulty arose because of some fundamental limitations on the basic structures being
used to generate intelligent behavior. For example, Minsky and Papert’s book Percep- trons (1969)
37
proved that, although perceptrons (a simple form of neural network) could be shown to learn anything
they were capable of representing, they could represent very little. In particular, a two-input perceptron
(restricted to be simpler than the form Rosenblatt originally studied) could not be trained to recognize
when its two inputs were different. Although their results did not apply to more complex, multilayer
networks, research funding for neural-net research soon dwindled to almost nothing. Ironically, the new
back-propagation learning al- gorithms for multilayer networks that were to cause an enormous
resurgence in neural-net research in the late 1980s were actually discovered first in 1969 (Bryson and
Ho, 1969).
5. Knowledge-based systems: The key to power? (1969–1979)
The picture of problem solving that had arisen during the first decade of AI research was of a general-
purpose search mechanism trying to string together elementary reasoning steps to find complete
solutions. Such approaches have been called weak methods because, although general, they do not scale
up to large or difficult problem instances. The alternative to weak methods is to use more powerful,
domain-specific knowledge that allows larger reasoning steps and can more easily handle typically
occurring cases in narrow areas of expertise. One might say that to solve a hard problem, you have to
almost know the answer already.
The DENDRAL program (Buchanan et al., 1969) was an early example of this approach. It was
developed at Stanford, where Ed Feigenbaum (a former student of Herbert Simon), Bruce Buchanan (a
philosopher turned computer scientist), and Joshua Lederberg (a Nobel laureate geneticist) teamed up to
solve the problem of inferring molecular structure from the information provided by a mass
spectrometer. The input to the program consists of the ele- mentary formula of the molecule (e.g.,
C6H13NO2) and the mass spectrum giving the masses of the various fragments of the molecule
generated when it is bombarded by an electron beam. For example, the mass spectrum might contain a
peak at m =15, corresponding to the mass of a methyl (CH3) fragment.
The naive version of the program generated all possible structures consistent with the formula,
and then predicted what mass spectrum would be observed for each, comparing this with the actual
spectrum. As one might expect, this is intractable for even moderate-sized molecules. The DENDRAL
researchers consulted analytical chemists and found that they worked by looking for well-known
patterns of peaks in the spectrum that suggested common substructures in the molecule. For example,
the following rule is used to recognize a ketone (C=O) subgroup (which weighs 28):
If there are two peaks at x1 and x2 such that
(a) x1 + x2 = M +28(M is the mass of the whole molecule);
(b) x1 − 28 is a high peak;
(c) x2 − 28 is a high peak;
(d) At least one of x1 and x2 is high.
Then there is a ketone subgroup
Recognizing that the molecule contains a particular substructure reduces the number of possible
candidates enormously. DENDRAL was powerful because
38
All the relevant theoretical knowledge to solve these problems has been mapped over from its general
form in the [spectrum prediction component] (“first principles”) to efficient special forms (“cookbook
recipes”). (Feigenbaum et al., 1971)
The significance of DENDRAL was that it was the first successful knowledge-intensive sys- tem: its
expertise derived from large numbers of special-purpose rules. Later systems also incorporated the main
theme of McCarthy’s Advice Taker approach—the clean separation of the knowledge (in the form of
rules) from the reasoning component.
With this lesson in mind, Feigenbaum and others at Stanford began the Heuristic Pro- gramming
Project (HPP) to investigate the extent to which the new methodology of expert systems could be applied
to other areas of human expertise. The next major effort was in the area of medical diagnosis.
Feigenbaum, Buchanan, and Dr. Edward Shortliffe developed MYCIN to diagnose blood infections. With
about 450 rules, MYCIN was able to perform as well as some experts, and considerably better than junior
doctors. It also contained two major differences from DENDRAL. First, unlike the DENDRAL rules, no
general theoretical model existed from which the MYCIN rules could be deduced. They had to be
acquired from extensive interviewing of experts, who in turn acquired them from textbooks, other experts,
and direct experience of cases. Second, the rules had to reflect the uncertainty associated with medical
knowledge. MYCIN incorporated a calculus of uncertainty called certainty factors, which seemed (at the
time) to fit well with how doctors assessed the impact of evidence on the diagnosis.
The importance of domain knowledge was also apparent in the area of understanding natural
language. Although Winograd’s SHRDLU system for understanding natural language had engendered a
good deal of excitement, its dependence on syntactic analysis caused some of the same problems as
occurred in the early machine translation work. It was able to overcome ambiguity and understand
pronoun references, but this was mainly because it was designed specifically for one area—the blocks
world. Several researchers, including Eugene Charniak, a fellow graduate student of Winograd’s at MIT,
suggested that robust language understanding would require general knowledge about the world and a
general method for using that knowledge.
At Yale, linguist-turned-AI-researcher Roger Schank emphasized this point, claiming, “There is
no such thing as syntax,” which upset a lot of linguists but did serve to start a useful discussion. Schank
and his students built a series of programs (Schank and Abelson, 1977; Wilensky, 1978; Schank and
Riesbeck, 1981; Dyer, 1983) that all had the task of under- standing natural language. The emphasis,
however, was less on language per se and more on the problems of representing and reasoning with the
knowledge required for language under- standing. The problems included representing stereotypical
situations (Cullingford, 1981), describing human memory organization (Rieger, 1976; Kolodner, 1983),
and understanding plans and goals (Wilensky, 1983).
The widespread growth of applications to real-world problems caused a concurrent in- crease in
the demands for workable knowledge representation schemes. A large number of different representation
and reasoning languages were developed. Some were based on logic—for example, the Prolog language
became popular in Europe, and the PLANNER fam- ily in the United States. Others, following Minsky’s
idea of frames (1975), adopted a more structured approach, assembling facts about particular object and
event types and arranging the types into a large taxonomic hierarchy analogous to a biological taxonomy.
6. AI becomes an industry (1980–present)
39
The first successful commercial expert system, R1, began operation at the Digital Equipment Corporation
(McDermott, 1982). The program helped configure orders for new computer systems; by 1986, it was
saving the company an estimated $40 million a year. By 1988, DEC’s AI group had 40 expert systems
deployed, with more on the way. DuPont had 100 in use and 500 in development, saving an estimated $10
million a year. Nearly every major U.S.Corporation had its own AI group and was either using or
investigating expert systems.
In 1981, the Japanese announced the “Fifth Generation” project, a 10-year plan to build
intelligent computers running Prolog. In response, the United States formed the Microelec- tronics and
Computer Technology Corporation (MCC) as a research consortium designed to assure national
competitiveness. In both cases, AI was part of a broad effort, including chip design and human-interface
research. In Britain, the Alvey report reinstated the funding that was cut by the Lighthill report.13 In all
three countries, however, the projects never met their ambitious goals.
Overall, the AI industry boomed from a few million dollars in 1980 to billions of dollars in 1988,
including hundreds of companies building expert systems, vision systems, robots, and software and
hardware specialized for these purposes. Soon after that came a period called the “AI Winter,” in which
many companies fell by the wayside as they failed to deliver on extravagant promises.
40
AI was founded in part as a rebellion against the limitations of existing fields like control theory
and statistics, but now it is embracing those fields. As David McAllester (1998) put it:
In the early period of AI it seemed plausible that new forms of symbolic computation, e.g., frames and
semantic networks, made much of classical theory obsolete. This led to a form of isolationism in which
AI became largely separated from the rest of computer science. This isolationism is currently being
abandoned. There is a recognition that machine learning should not be isolated from information theory,
that uncertain reasoning should not be isolated from stochastic modeling, that search should not be
isolated from classical optimization and control, and that automated reasoning should not be isolated from
formal methods and static analysis.
In terms of methodology, AI has finally come firmly under the scientific method. To be ac-
cepted, hypotheses must be subjected to rigorous empirical experiments, and the results must be analyzed
statistically for their importance (Cohen, 1995). It is now possible to replicate experiments by using
shared repositories of test data and code.
The field of speech recognition illustrates the pattern. In the 1970s, a wide variety of different
architectures and approaches were tried. Many of these were rather ad hoc and fragile, and were
demonstrated on only a few specially selected examples. In recent years, approaches based on hidden
Markov models (HMMs) have come to dominate the area. Two aspects of HMMs are relevant. First, they
are based on a rigorous mathematical theory. This has allowed speech researchers to build on several
decades of mathematical results developed in other fields. Second, they are generated by a process of
training on a large corpus of real speech data. This ensures that the performance is robust, and in rigorous
blind tests the HMMs have been improving their scores steadily. Speech technology and the related field
of handwritten character recognition are already making the transition to widespread industrial and
consumer applications. Note that there is no scientific claim that humans use HMMs to recognize speech;
rather, HMMs provide a mathematical framework for understanding the problem and support the
engineering claim that they work well in practice.
Machine translation follows the same course as speech recognition. In the 1950s there was initial
enthusiasm for an approach based on sequences of words, with models learned according to the principles
of information theory. That approach fell out of favor in the 1960s, but returned in the late 1990s and now
dominates the field.
Neural networks also fit this trend. Much of the work on neural nets in the 1980s was done in an
attempt to scope out what could be done and to learn how neural nets differ from “traditional” techniques.
Using improved methodology and theoretical frameworks, the field arrived at an understanding in which
neural nets can now be compared with corresponding techniques from statistics, pattern recognition, and
machine learning, and the most promising technique can be applied to each application. As a result of
these developments, so-called data mining technology has spawned a vigorous new industry.
Judea Pearl’s (1988) Probabilistic Reasoning in Intelligent Systems led to a new accep- tance of
probability and decision theory in AI, following a resurgence of interest epitomized by Peter
Cheeseman’s (1985) article “In Defense of Probability.” The Bayesian network formalism was invented
to allow efficient representation of, and rigorous reasoning with, uncertain knowledge. This approach
largely overcomes many problems of the probabilistic reasoning systems of the 1960s and 1970s; it now
dominates AI research on uncertain reason- ing and expert systems. The approach allows for learning
from experience, and it combines the best of classical AI and neural nets. Work by Judea Pearl (1982a)
41
and by Eric Horvitz and David Heckerman (Horvitz and Heckerman, 1986; Horvitz et al., 1986)
promoted the idea of normative expert systems: ones that act rationally according to the laws of decision
theory and do not try to imitate the thought steps of human experts. The WindowsTM operating sys- tem
includes several normative diagnostic expert systems for correcting problems.
Similar gentle revolutions have occurred in robotics, computer vision, and knowledge
representation. A better understanding of the problems and their complexity properties, com- bined with
increased mathematical sophistication, has led to workable research agendas and robust methods.
Although increased formalization and specialization led fields such as vision and robotics to become
somewhat isolated from “mainstream” AI in the 1990s, this trend has reversed in recent years as tools
from machine learning in particular have proved effective for many problems. The process of
reintegration is already yielding significant benefits.
9. The emergence of intelligent agents (1995–present)
Perhaps encouraged by the progress in solving the subproblems of AI, researchers have also started to
look at the “whole agent” problem again. The work of Allen Newell, John Laird, and Paul Rosenbloom
on SOAR (Newell, 1990; Laird et al., 1987) is the best-known example of a complete agent architecture.
One of the most important environments for intelligent agents is the Internet. AI systems have become so
common in Web-based applications that the “-bot” suffix has entered everyday language. Moreover, AI
technologies underlie many Internet tools, such as search engines, recommender systems, and Web site
aggregators.
One consequence of trying to build complete agents is the realization that the previously isolated
subfields of AI might need to be reorganized somewhat when their results are to be tied together. In
particular, it is now widely appreciated that sensory systems (vision, sonar, speech recognition, etc.)
cannot deliver perfectly reliable information about the environment. Hence, reasoning and planning
systems must be able to handle uncertainty. A second major consequence of the agent perspective is that
AI has been drawn into much closer contact with other fields, such as control theory and economics, that
also deal with agents. Recent progress in the control of robotic cars has derived from a mixture of
approaches ranging from better sensors, control-theoretic integration of sensing, localization and
mapping, as well as a degree of high-level planning.
Despite these successes, some influential founders of AI, including John McCarthy (2007),
Marvin Minsky (2007), Nils Nilsson (1995, 2005) and Patrick Winston (Beal and Winston, 2009), have
expressed discontent with the progress of AI. They think that AI should put less emphasis on creating
ever-improved versions of applications that are good at a spe- cific task, such as driving a car, playing
chess, or recognizing speech. Instead, they believe AI should return to its roots of striving for, in Simon’s
words, “Machines that think, that learn and that create.” They call the effort human-level AI or HLAI;
their first symposium was in 2004 (Minsky et al., 2004). The effort will require very large knowledge
bases; Hendler et al. (1995) discuss where these knowledge bases might come from.
A related idea is the subfield of Artificial General Intelligence or AGI (Goertzel and Pennachin,
2007), which held its first conference and organized the Journal of Artificial Gen- eral Intelligence in
2008. AGI looks for a universal algorithm for learning and acting in any environment, and has its roots in
the work of Ray Solomonoff (1964), one of the atten- dees of the original 1956 Dartmouth conference.
Guaranteeing that what we create is really Friendly AI is also a concern (Yudkowsky, 2008; Omohundro,
2008).
42
10.The availability of very large data sets (2001–present)
Throughout the 60-year history of computer science, the emphasis has been on the algorithm as the main
subject of study. But some recent work in AI suggests that for many problems, it makes more sense to
worry about the data and be less picky about what algorithm to apply. This is true because of the
increasing availability of very large data sources: for example, trillions of words of English and billions of
images from the Web (Kilgarriff and Grefenstette, 2006); or billions of base pairs of genomic sequences
(Collins et al., 2003).
One influential paper in this line was Yarowsky’s (1995) work on word-sense disam- biguation:
given the use of the word “plant” in a sentence, does that refer to flora or factory? Previous approaches to
the problem had relied on human-labeled examples combined with machine learning algorithms.
Yarowsky showed that the task can be done, with accuracy above 96%, with no labeled examples at all.
Instead, given a very large corpus of unanno- tated text and just the dictionary definitions of the two
senses—“works, industrial plant” and “flora, plant life”—one can label examples in the corpus, and from
there bootstrap to learn new patterns that help label new examples. Banko and Brill (2001) show that
techniques like this perform even better as the amount of available text goes from a million words to a
billion and that the increase in performance from using more data exceeds any difference in algorithm
choice; a mediocre algorithm with 100 million words of unlabeled training data outperforms the best
known algorithm with 1 million words.
As another example, Hays and Efros (2007) discuss the problem of filling in holes in a
photograph. Suppose you use Photoshop to mask out an ex-friend from a group photo, but now you need
to fill in the masked area with something that matches the background.Hays and Efros defined an
algorithm that searches through a collection of photos to find something that will match. They found the
performance of their algorithm was poor when they used a collection of only ten thousand photos, but
crossed a threshold into excellent performance when they grew the collection to two million photos.
Work like this suggests that the “knowledge bottleneck” in AI—the problem of how to express all
the knowledge that a system needs—may be solved in many applications by learn- ing methods rather
than hand-coded knowledge engineering, provided the learning algorithms have enough data to go on
(Halevy et al., 2009). Reporters have noticed the surge of new ap- plications and have written that “AI
Winter” may be yielding to a new Spring (Havenstein, 2005). As Kurzweil (2005) writes, “today, many
thousands of AI applications are deeply embedded in the infrastructure of every industry.”
43
One way to think of structuring these entities is as two levels:
• The knowledge level, at which facts (including each agent’s behaviors and current goals) are described.
• The symbol level, at which representations of objects at the knowledge level are defined in terms of
Symbols that can be manipulated by programs.
See Newell [19821 for a detailed exposition of this view in the context of agents and their goals and
behaviors. In the rest of our discussion here, we will follow a model more like the one shown in Fig.
Rather than thinking of one level on top of another, we will focus on facts, on representations, and on the
two-way mappings that must exist between them. We will call these links representation mappings. The
forward representation mapping maps from facts to representations. The backward representation
mapping goes the other way, from representations to facts.
One representation of facts is so common that it deserves special mention: natural language
(particularly English) sentences. Regardless of the representation for facts that we use in a program, we
may also need to be concerned with an English representation of those facts in order to facilitate getting
information into and out of the system. In this case, we must also have mapping functions from English
sentences to the representation we are actually going to use and from it back to sentences. Figure 4.1
shows how these three kinds of objects relate to each other.
44
Spot has a tail.
Or we could make use of this representation of a new fact to cause us to take some appropriate action or
to derive representations of additional facts.
It is important to keep in mind that usually the available mapping functions are not one-to-one. In
fact, they are often not even functions but rather many-to-many relations. (In other words, each object in
the domain may map to several elements in the range, and several elements in the domain may map to the
same element of the range.) This is particularly true of the mappings involving English representations of
facts. For example, the two sentences “All dogs have tails” and “Every dog has a tail” could both
represent the same fact, namely, that every dog has at least one tail. On the other hand, the former could
represent either the fact that every dog has at least one tail or the fact that each dog has several tails. The
latter may represent either the fact that every dog has at least one tail or the fact that there is a tail that
every dog has. As we will see shortly, when we try to convert English sentences into some other
representation, such as logical propositions, we must first decide what facts the sentences represent and
then con ven those facts into the new representation.
The starred links of Fig. are key components of the design of any knowledge-based program. To
see why, we need to understand the role that the internal representation of a fact plays in a program. What
an Al program does is to manipulate the internal representations of the facts it is given. This manipulation
should result in new structures that can also be interpreted as internal representations of facts. More
precisely, these structures should be the internal representations of facts that correspond to the answer to
the problem described by the starting set of facts.
The Mutilated Checker board Problem. Consider a normal checker board from which two squares, in
opposite corners, have been removed. The task is to cover all the remaining squares exactly with
dominoes, each of which covers two squares. No overlapping, either of dominoes on top of each other or
of dominoes over the boundary of the mutilated board is allowed. Can this task be done?
One way to solve this problem is to try to enumerate, exhaustively, all possible tilings to see if
one works. But suppose one wants to be cleverer. Figure shows three ways in which the mutilated checker
board could be represented (to a person). The first representation does not directly suggest the answer to
the problem. The second may; the third does, when combined with the single additional fact that each
domino must cover exactly one white square and one black square. Even for human problem solvers a
representation shift may make an enormous difference in problem-solving effectiveness. Recall that we
saw a slightly less dramatic version of this phenomenon with respect to a problem-solving program,
where we considered two different way of representing a tic-tac-toe board, one of which was a magic
square.
45
Fig: Three Representations of a Mutilated Checker board
Fig shows an expanded view of the starred part of Fig. 4.1 The dotted line across the top
represents the abstract reasoning process that a program is intended to model. The solid line across the
bottom represents the concrete reasoning process that a particular program performs. This program
successfully models the abstract process to the extent that, when the backward representation mapping
is applied to the program’s output, the appropriate final facts are actually generated. If either the
program’s operation or one of the representation mappings is not faithful to the problem that is being
modeled, then the final facts will probably not be the desired ones. The key role that is played by the
nature of the representation mapping is apparent from this figure. If no good mapping can be defined for a
problem, then no matter how good the program to solve the problem is, it will not be able to produce
answers that correspond to real answers to the problem.
46
• Inferential Adequacy — the ability to manipulate the representational structures in such a way as to
derive new structures corresponding to new knowledge inferred from old.
• Inferential Efficiency — the ability to incorporate into the knowledge structure additional information
that can be used to focus the attention of the inference mecha- nisms in the most promising directions.
• Acquisitional Efficiency — the ability to acquire new information easily. The simplest case involves
direct insertion, by a person, of new knowledge into the database. Ideally, the program itself would be
able to control knowledge acquisition.
Unfortunately, no single system that optimizes all of the capabilities for all kinds of knowledge
has yet been found. As a result, multiple techniques for knowledge representation exist. Many programs
rely on more than one technique. In the chapters that follow, the most important of these techniques are
described in detail. But in this section, we provide a simple, example-based introduction to the important
ideas.
Simple Relational Knowledge
The simplest way to represent declarative facts is as a set of relations of the same sort used in
database systems. Following fig. shows an example of such a relational system.
47
need not be as simple as that shown in our example. In particular, it is possible to augment the basic
representation with inference mechanisms that operate on the structure of the representation. For this to
be effective, the structure must be designed to correspond to the inference mechanisms that are desired.
One of the most useful forms of inference is property inheritance, in which elements of specific classes
inherit attributes and values from more general classes in which they are included.
In order to support property inheritance, objects must be organized into classes and classes must
be arranged in a generalization hierarchy. Following Fig shows some additional baseball knowledge
inserted into a structure that is so arranged. Lines represent attributes. Boxed nodes represent objects and
values of attributes of objects. These values can also be viewed as objects with attributes and values, and
so on. The arrows on the lines point from an object to its value along the corresponding attribute line; the
structure shown in the figure is a slot-and-filler structure. It may also be called a semantic network or a
collection of frames. In the latter case, each individual frame represents the collection of attributes and
values associated with a particular node. Figure 4.6 shows the node for baseball player displayed as a
frame.
48
term frame system implies somewhat more structure on the attributes and the inference mechanisms that
are available to apply to them than does the term semantic network.
We discuss structures such as these in substantial detail. But to get an idea of how these structures
support inference using the knowledge they contain, we discuss them briefly here. All of the objects and
most of the attributes shown in this example have been chosen to correspond to the baseball domain, and
they have no general significance. The two exceptions to this are the attribute isa, which is being used to
show class inclusion, and the attribute instance, which is heing used to show class membership. These
two specific (and generally useful) attributes provide the basis for property inheritance as an inference
technique. Using this technique, the knowledge base can support retrieval both of facts that have heen
explicitly stored and of facts that can be derived from those that are explicitly stored.
An idealized form of the property inheritance algorithm can be stated as follows:
Algorithm: Property Inheritance
To retrieve a value V for attribute A of an instance object O:
1. Find O in the knowledge base.
2. If there is a value there for the attribute A, report that value.
3. Otherwise, see if there is a value for the attribute instance. If not, then fail.
4. Otherwise, move to the node corresponding to that value and look for a value for the attribute A. If
one is found, report it.
5. Otherwise, do until there is no value for the isa attrihute or until an answer is found:
(a) Get the value of the isa attribute and move to that node.
(b) See if there is a value for the attribute A. If there is, report it.
This procedure is simplistic. It does not say what we should do if there is more than one value of
the instance or isa attribute. But it does describe the basic mechanism of inheritance. We can apply this
procedure to our example knowledge base to derive answers to the following queries:
• team(Pee-Wee-Reese) = Brooklyn-Dodgers. This attribute had a value stored explicitly in the
knowledge base.
• batting-average(Three-Finger Brown) = .106. Since there is no value for batting average stored
explicitly for Three Finger Brown, we follow the instance attribute to Pitcher and extract the value stored
there. Now we observe one of the critical characteristics of property inheritance, namely that it may
produce default values that are not guaranteed to be correct but that represent “best guesses’' in the face of
a lack of more precise information. In fact, in 1906, Brown’s batting average was .204.
• height (Pee-Wee-Reese) = 6-1. This represents another default inference. Notice here that because we
get to it first, the more specific fact about the height of baseball players overrides a more general fact
about the height of adult males.
• bats(Three-Finger-Brown) = Right. To get a value for the attribute bats required going up the isa
hierarchy to the class Baseball-Player. But what we found there was not a value but a rule for computing
a value. This rule required another value (that for handed) as input. So the entire process must be begun
again recursively to find a value for handed. This time, it is necessary to go all the way up to Person to
49
discover that the default value for handedness for people is Right. Now the rule for bats can be applied,
producing the result Right. In this case, that turns out to be wrong, since Brown is a switch hitter (i.e., he
can hit both left-and right-handed).
Inferential Knowledge
Property inheritance is a powerful form of inference, but it is not the only useful form. Sometimes all the
power of traditional logic (and sometimes even more than that) is necessary to describe the inferences that
are needed. Fig shows two examples of the use of first-order predicate logic to represent additional
knowledge about baseball.
50
straightforward way as the representation of Fig. does. The LISP representation is slightly more powerful
since it makes explicit use of the name of the node whose value for handed is to be found. But if this
matters, the simpler representation can be augmented to do this as well.
51
• Are any attributes of objects so basic that they occur in almost every problem domain? If there
are, we need to make sure that they are handled appropriately in each of the mechanisms we
propose. If such attributes exist, what are they?
• Are there any important relationships that exist among attributes of objects?
• Al what level should knowledge be represented? Is there a good set of primitives into which
all knowledge can be broken down? Is it helpful to use such primitives?
• How should sets of objects be represented?
• Given a large amount of knowledge stored in a database, how can relevant parts be accessed
when they are needed?
We will talk about each of these questions briefly in the next five sections.
1. Important Attributes
There are two attributes that are of very general significance, and we have already seen their use: instance
and Isa. These attributes are important because they support property inheritance. They are called a
variety of things in Al systems, but the names do not matter. What does matter is that they represent class
membership and class inclusion and that class inclusion is transitive. In slot-and-filler systems, these
attributes are usually represented explicitly in a way much like that shown in Figures. In logic-based
systems, these relationships may be represented this way or they may be represented implicitly by a set of
predicates describing particular classes.
2. Relationships among Attributes
The attributes that we use to describe objects are themselves entities that we represent. What properties do
they have independent of the specific knowledge they encode? There are four such properties that deserve
mention here:
• Inverses
• Existence in an isa hierarchy
• Techniques for reasoning about values
• Single-valued attributes
Inverses
Entities in the world are related to each other in many different ways. But as soon as we decide to
describe those relationships as attributes, we commit to a perspective in which we focus on one object and
look for binary relationships between it and others. Attributes are those relationships. So, for example, in
Fig., we used the attributes instance, isa and team. Each of these was shown in the figure with a directed
arrow, originating at the object that was being described and terminating at the object representing the
value of the specified attribute. But we could equally well have focused on the object representing the
value. If we do that, then there is still a relationship between the two entities, although it is a different one
since the original relationship was not symmetric (although some relationships, such as sibling, are). In
many cases, it is important to represent this other view of relationships. There are two good ways to do
this.
52
The first is to represent both relationships in a single representation that ignores focus. Logical
representations are usually interpreted as doing this. For example, the assertion:
tearnf Pee-Wee-Reese, Brooklyn-Dodgers)
can equally easily be interpreted as a statement about Pee Wee Reese or about the Brooklyn Dodgers.
How it is actually used depends on the other assertions that a system contains.
The second approach is to use attributes that focus on a single entity but to use them in pairs, one
the inverse of the other. In this approach, we would represent the team information with two attributes:
• One associated with Pee Wee Reese:
team = Brooklyn-Dodgers
• One associated with Brooklyn Dodgers:
team-members = Pee-Wee-Reese,...
Illis is the approach that is taken in semantic net and frame-hased systems. When it is used, it is usually
accompanied by a knowledge acquisition tool that guarantees the consistency of inverse slots by forcing
them to be declared and then checking each time a value is added to one attribute that the corresponding
value is added to the inverse.
An Isa Hierarchy of Attributes
Just as there are classes of objects and specialized subsets of those classes, there are attributes and
specializations of attributes. Consider, for example, the attribute height. It is actually a specialization of
the more general attribute physical-size which is, in turn, a specialization of physical-attribute. These
generalization-specialization relationships are important for attributes for the same reason that they are
important for other concepts—they support inheritance. In the case of attributes, they support inheriting
information about such things as constraints on the values that the attribute can have and mechanisms for
computing those values.
Techniques for Reasoning about Values
Sometimes values of attributes are specified explicitly when a knowledge base is created. We saw several
examples of that in the baseball example of Fig. But often the reasoning system must reason about values
it has not been given explicitly. Several kinds of information can play a role in this reasoning, including:
• Information about the type of the value. For example, the value of height must be a number measured
in a unit of length.
• Constraints on the value often stated in terms of related entities. For example, the age of a person
cannot be greater than the age of either of that person’s parents.
• Rules for computing the value when it is needed. We showed an example of such a rule in Fig. for the
bats attribute. These rules are called backward rules. Such rules have also been called if-needed rules.
• Rules that describe actions that should be taken if a value ever becomes known. These rules are called
forward rules, or sometimes if-added rules.
Single-Valued Attributes
53
A specific but very useful kind of attribute is one that is guaranteed to take a unique value. For example, a
baseball player can, at any one time, have only a single height and be a member of only one team. If there
is already a value present for one of these attributes and a different value is asserted, then one of two
things has happened. Either a change has occurred in the world or there is now a contradiction in the
knowledge base that needs to be resolved. Knowledge-representation systems have taken several different
approaches to providing support for single-valued attributes, including:
• Introduce an explicit notation for temporal interval. If two different values are ever asserted for the
same temporal interval, signal a contradiction automatically.
• Assume that the only temporal interval that is of interest is now. So if a new value is asserted, replace
the old value.
• Provide no explicit support. Logic-based systems are in this category. But in these systems, knowledge
base builders can add axioms that state that if an attribute has one value then it is known not to have all
other values.
3. Choosing the Granularity of Representation
Regardless of the particular representation formalism we choose, it is necessary to answer (he question
“At what level of detail should the world be represented?” Another way this question is often phrased is
“What should be our primitives?” Should there be a small number of low-level ones or should there be a
larger number covering a range of granularities? A brief example illustrates the problem. Suppose we are
interested in the following fact:
John spotted Sue.
We could represent this as1
spotted^(agentf John ),
object(Sue))
Such a representation would make it easy to answer questions such as:
Who spotted Sue?
But now suppose we want to know:
Did John see Sue?
The obvious answer is “yes,” but given only the one fact we have, we cannot discover that answer. Wc
could, of course, add other facts, such as
Spotted(x, y) —> saw(x, y)
We could then infer the answer to the question.
An alternative solution to this problem is to represent the fact that spotting is really a special type of
seeing explicitly in the representation of the fact. We might write something such as
a H'f agent (John),
object (Sue),
timespan (briefly))
54
In this representation, we have broken the idea of spotting apart into more primitive concepts of
seeing and timespan. Using this representation, the fact that John saw Sue is immediately accessible. But
the fact that he spotted her is more difficult to get to.
The major advantage of converting all statements into a representation in terms of a small set of
primitives is that the rules that are used to derive inferences from that knowledge need be written only in
terms of the primitives rather than in terms of the many ways in which the knowledge may originally
have appeared. Thus what is really being argued for is simply some sori of canonical form. Several Al
programs, including those described by Schank and Abelson [1977] and Wilks [1972], are based on
knowledge bases described in terms of a small number of low-level primitives.
55
first went to where Mary was. But suppose we also know that Mary punched John. Then we must also
store the structure shown in Fig.b If, however, punching were represented simply as punching, then most
of the detail of both structures could be omitted from the structures themselves. It could instead be stored
just once in a common representation of the concept of punching.
A second but related problem is that if knowledge is initially presented to the system in a
relatively high- level form, such as English, then substantial work must be done to reduce the knowledge
into primitive form.
Yet, for many purposes, this detailed primitive representation may be unnecessary. Both in
understanding language and in interpreting the world that we see, many things appear that later turn out to
be irrelevant. For the sake of efficiency, it may be desirable to store these things at a very high level and
then to analyze in detail only those inputs that appear to be important.
A third problem with the use of low-level primitives is that in many domains, it is not at all clear
what the primitives should be. And even in domains in which there may be an obvious set of primitives,
there may not be enough information present in each use of the high-level constructs to enable them to be
converted into their primitive components. When this is true, there is no way to avoid representing facts at
a variety of granularities.
The classical example of this sort of situation is provided by kinship terminology [Lindsay,
1963]. There exists at least one obvious set of primitives: mother, father, son, daughter, and possibly
brother and sister. But now suppose we are told that Mary is Sue’s cousin. An attempt to describe the
cousin relationship in terms of the primitives could produce any of the following interpretations:
• Mary = daughter(brother(mother(Sue)))
• Mary = daughter(sister(mother(Sue)))
• Mary = daughter(brother(father(Sue)))
• Mary = daughter(sister(father(Sue)))
If we do not already know that Mary is female, then of course there are four more possibilities as
well. Since in general we may have no way of choosing among these representations, we have no choice
but to represent the fact using the nonprimitive relation cousin.
The other way to solve this problem is to change our primitives. We could use the set:
parent, child, sibling, male, and female. Then the fact that Mary is Sue’s cousin could be represented as
Mar}' = child(sibling(parent(Sue)))
But now the primitives incorporate some generalizations that may or may not be appropriate. The
main point to be learned from this example is that even in very simple domains, the correct set of
primitives is not obvious.
In less well-structured domains, even more problems arise. For example, given just the fact
John broke the window.
a program would not be able to decide if John’s actions consisted of the primitive sequence:
1. Pick up a hard object.
56
2. Hurl the object through the window.
or the sequence:
1. Pick up a hard object.
2. Hold onto the object while causing it to crash into the window.
or the single action:
1. Cause hand (or foot) to move fast and crash into the window.
or the single action:
1. Shut the window so hard that the glass breaks.
As these examples have shown, the problem of choosing the correct granularity of epresentation for a
particular body of knowledge is not easy. Clearly, the lower the level we choose, the less inference
required to reason with it in some cases, but the more conference required to create the representation
from English and the more room it takes to store, since many inferences will be represented many times.
The answer for any particular MKk domain must come to a large extent from the domain itself—to what
use is the knowledge to be put?
One way of looking at the question of whether there exists a good set of low-level primitives is
that it is a question of the existence of a unique representation. Does there exist a single, canonical way in
which large bodies of knowledge can be represented independently of how they were originally stated?
Another, closely related, uniqueness question asks whether individual objects can be represented uniquely
and independently of how they are described. This issue is raised in the following quotation from Quine
[1961] and discussed in Woods 11975]:
The phrase Evening Star names a certain large physical object of spherical form, which is
hurtling through space some scores of millions of miles from here. The phrase Morning Star names the
same thing, as was probably first established by some observant Babylonian. But the two phrases cannot
be regarded as having the same meaning; otherwise that Babylonian could have dispensed with his
observations and contented himself with reflecting on the meaning of his words. The meanings, then,
being different from one another, must be other than the named object, which is one and the same in both
cases.
In order for a program to be able to reason as did the Babylonian, it must be able to handle
several distinct representations that turn out to stand for the same object.
We discuss the question of the correct granularity of representation, as well as issues involving
redundant storage of information, throughout the next several chapters, particularly in the section on
conceptual dependency, since that theory explicitly proposes that a small set of low-level primitives
should be used for representing actions.
4. Representing Sets of Objects
It is important to be able to represent sets of objects for several reasons. One is that there are some
properties that are true of sets that are not true of the individual members of a set. As examples, consider
the assertions that are being made in the sentences ‘There are more sheep than people in Australia” and
57
“English speakers can be found all over the world.” The only way to represent the facts described in these
sentences is to attach assertions to the sets representing people, sheep, and English speakers, since, for
example, no single English speaker can be found all over the world. The other reason that it is important
to be able to represent sets of objects is that if a property is true of all (or even most) elements of a set,
then it is more efficient to associate it once with the set rather than to associate it explicitly with every
element of the set. We have already looked at ways of doing that, both in logical representations through
the use of the universal quantifier and in slot- and-filler structures, where we used nodes to represent sets
and inheritance to propagate set-level assertions down to individuals. As we consider ways to represent
sets, we will want to consider both of these uses of set level representations. We will also need to
remember that the two uses must be kept distinct. Thus if we assert something like large (Elephant), it
must be clear whether we are asserting some property of the set itself (i.e., that the set of elephants is
large) or some property that holds for individual elements of the set (i.e., that anything that is an elephant
is large).
There are three obvious ways in which sets may be represented. The simplest is just by a name.
This is essentially what we did when we used the node named Baseball-Player in our semantic net and
when we used predicates such as Ball and Batter in our logical representation. This simple representation
does make it possible to associate predicates with sets. But it does not, by itself, provide any information
about the set it represents. It does not, for example, tell how to determine whether a particular object is a
member of the set or not.
There are two ways to state a definition of a set and its elements. The first is to list the members.
Such a specification is called an extensional definition. The second is to provide a rule that, when a
particular object is evaluated, returns true or false depending on whether the object is in the set or not.
Such a rule is called an intensional definition. For example, an extensional description of the set of our
sun’s planets on which people live is (Earth], An intensional description is
(x : sun-planet(x) A human-inhabited(x)}
For simple sets, it may not matter, except possibly with respect to efficiency concerns, which
representation is used. But the two kinds of representations can function differently in some cases.
One way in which extensional and intensional representations differ is that they do not
necessarily correspond one-to-one with each other. For example, the extensionally defined set (Earth) has
many intensional definitions in addition to the one we just gave. Others include:
(x .* sun-planet{x) ? nth-farthest-ftnm-sun(x, 3))
{x ; sun-planet(x) 1 nth-biggest(x, 5))
Thus, while it is trivial to determine whether two sets are identical if extensional descriptions are
used, it may be very difficult to do so using intensional descriptions.
Intensional representations have two important properties that extensional ones lack, however.
The first is that they can be used to describe infinite sets and sets not all of whose elements are explicitly
known. Thus we can describe intensionally such sets as prime numbers (of which there are infinitely
many) or kings of England (even though we do not know who all of them are or even how many of them
there have been). The second thing we can do with intensional descriptions is to allow them to depend on
parameters that can change, such as time or spatial location. If we do that, then the actual set that is
58
represented by the description will change as a function of the value of those parameters. To see the effect
of this, consider the sentence, “The president of the United States used to be a Democrat,” uttered when
the cunent president is a Republican. This sentence can mean two things. The first is that the specific
person who is now president was once a Democrat. This meaning can be captured straightforwardly with
an extensional representation of “the president of the United States.” We just specify the individual. But
there is a second meaning, namely that there was once someone who was the president and who was a
Democrat. To represent the meaning of “the president of the United States” given this interpretation
requires an intensional description that depends on time. Thus we might write presidentt), where president
is some function that maps instances of time onto instances of people, namely U.S. presidents.
5. Finding the Right Structures as Needed
Recall that in Chapter 2, we briefly touched on the problem of matching rules against state descriptions
during the problem-solving process. This same issue now rears its head with respect to locating
appropriate knowledge structures that have been stored in memory.
For example, suppose we have a script (a description of a class of events in terms of contexts,
participants, and subevents) that describes the typical sequence of events in a restaurant.This script would
enable us to take a text such as
John went to Steak and Ale last night. He ordered a large rare steak, paid his bill, and left.
and answer “yes” to the question
Did John eat dinner last night?
Notice that nowhere in the story was John’s eating anything mentioned explicitly. But the fact
that when one goes to a restaurant one eats will be contained in the restaurant script. If we know in
advance to use the restaurant script, then we can answer the question easily. But in order to be able to
reason about a variety of things, a system must have many scripts for everything from going to work to
sailing around the world. How will it select the appropriate one each time? For example, nowhere in our
story was the word “restaurant” mentioned.
In fact, in order to have access to the right structure for describing a particular situation, it is
necessary to solve all of the following problems.
• How to perform an initial selection of the most appropriate structure.
• How to fill in appropriate details from the current situation.
• How to find a better structure if the one chosen initially turns out not to be appropriate.
• What to do if none of the available structures is appropriate.
• When to create and remember a new structure.
There is no good, general purpose method for solving all these problems. Some knowledge-
representation techniques solve some of them. In this section we survey some solutions to two of these
problems: how to select an initial structure to consider and how to find a better structure if that one turns
out not to be a good match.
Selecting an initial Structure
59
Selecting candidate knowledge structures to match a particular problem-solving situation is a hard
problem; there are several ways in which it can be done. Three important approaches are the following:
• Index the structures directly by the significant English words that can be used to describe them. For
example, let each verb have associated with it a structure that describes its meaning. This is the approach
taken in conceptual dependency theory, discussed in Chapter 10. Even for selecting simple structures,
such as those representing the meanings of individual words, though, this approach may not be adequate,
since many words may have several distinct meanings. For example, the word “fly” has a different
meaning in each of the following sentences:
- John flew to New York. (He rode in a plane from one place to another.)
- John flew a kite. (He held a kite that was up in the air.)
- John flew down the street. (He moved very rapidly.)
- John flew into a rage. (An idiom)
Another problem with this approach is that it is only useful when there is an English description of the
problem to be solved.
• Consider each major concept as a pointer to all of the structures (such as scripts) in which it might be
involved. This may produce several sets of prospective structures. For example, the concept Steak might
point to two scripts, one for restaurant and one for supermarket. The concept Bill might point to a
restaurant and a shopping script. Take the intersection of those sets to get the structure(s), preferably
precisely one, that involves all the content words. Given the pointers just described and the story about
John’s trip to Steak and Ale, the restaurant script would be evoked. One important problem with this
method is that if the problem description contains any even slightly extraneous concepts, then the
intersection of their associated structures will be empty. This might occur if we had said, for example,
“John rode his bicycle to Steak and Ale last night.” Another problem is that it may require a great deal of
computation to compute all of the possibility sets and then to intersect them. However, if computing such
sets and intersecting them could be done in parallel, then the time required to produce an answer would
be reasonable even if the total number of computations is large. For an exploration of this
parallel approach to clue intersection, see Fahlman [1979],
• Locate one major clue in the problem description and use it to select an initial structure. As other clues
appear, use them to refine the initial selection or to make a completely new one if necessary. For a
discussion of this approach, see Charniak [1978], The major problem with this method is that in some
situations there is not an easily identifiable major clue. A second problem is that it is necessary to
anticipate which clues are going to be important and which are not. But the relative importance of clues
can change dramatically from one situation to another. For example, in many contexts, the color of the
objects involved is not important. But if we are told “The light turned red,” then the color of the light is
the most important feature to consider.
None of these proposals seems to be the complete answer to the problem. It often turns out,
unfortunately, that the more complex the knowledge structures are, the harder it is to tell when a
particular one is appropriate.
Revising the Choice When Necessary
60
Once we find a candidate knowledge structure, we must attempt to do a detailed match of it to the
problem at hand. Depending on the representation we are using, the details of the matching process will
vary. It may require variables to be bound to objects. It may require attributes to have their values
compared. In any case, if values that satisfy the required restrictions as imposed by the knowledge
structure can be found, they are put into the appropriate places in the structure. If no appropriate values
can be found, then a new structure must be selected. The way in which the attempt to instantiate this first
structure failed may provide useful cues as to which one to try next. If, on the other hand, appropriate
values can be found, then the current structure can be taken to be appropriate for describing the current
situation. But, of course, that situation may change. Then information about what happened (for example,
we walked around the room we were looking at) may be useful in selecting a new structure to describe the
revised situation.
As was suggested above, the process of instantiating a structure in a particular situation often
does not proceed smoothly. When the process runs into a snag, though, it is often not necessary to
abandon the effort and start over. Rather, there are a variety of things that can be done:
• Select the fragments of the current structure that do correspond to the situation and match them against
candidate alternatives. Choose the best match. If the current structure was at all close to being
appropriate, much of the work that has been done to build substructures to fit into it will be preserved.
• Make an excuse for the current structure’s failure and continue to use it. For example, a proposed chair
with only three legs might simply be broken. Or there might be another object in front of it which
occludes one leg. Part of the structure should contain information about the features for which it is
acceptable to make excuses. Also, there are general heuristics, such as the fact that a structure is more
likely to be appropriate if a desired feature is missing than if an inappropriate feature is present. For
example, a person with one leg is more plausible than a person with a tail.
• Refer to specific stored links between structures to suggest new directions in which to explore. An
example of this sort of linking among a set of frames is shown in the similarity network shown in Fig.
• If the knowledge structures are stored in an isa hierarchy, traverse upward in it until a structure is found
that is sufficiently general that it does not conflict with the evidence. Either use this structure if it is
specific enough to provide the required knowledge or consider creating a new structure just below the
matching one.
61
Fig: A Similarity Net
62
MODULE – II
SEARCH TECHNIQUES
63
SEARCH TECHNIQUES
The State of the Art of AI
What can AI do today? A concise answer is difficult because there are so many activities in so many
subfields. Here we sample a few applications;
Robotic vehicles: A driverless robotic car named STANLEY sped through the rough terrain of the
Mojave dessert at 22 mph, finishing the 132-mile course first to win the 2005 DARPA Grand Challenge.
STANLEY is a Volkswagen Touareg outfitted with cameras, radar, and laser rangefinders to sense the
environment and onboard software to command the steer- ing, braking, and acceleration (Thrun, 2006).
The following year CMU’s BOSS won the Ur- ban Challenge, safely driving in traffic through the streets
of a closed Air Force base, obeying traffic rules and avoiding pedestrians and other vehicles.
Speech recognition: A traveler calling United Airlines to book a flight can have the en- tire conversation
guided by an automated speech recognition and dialog management system.
Autonomous planning and scheduling: A hundred million miles from Earth, NASA’s Remote Agent
program became the first on-board autonomous planning program to control the scheduling of operations
for a spacecraft (Jonsson et al., 2000). REMOTE AGENT gen- erated plans from high-level goals
specified from the ground and monitored the execution of those plans—detecting, diagnosing, and
recovering from problems as they occurred. Succes- sor program MAPGEN (Al-Chang et al., 2004) plans
the daily operations for NASA’s Mars Exploration Rovers, and MEXAR2 (Cesta et al., 2007) did mission
planning—both logistics and science planning—for the European Space Agency’s Mars Express mission
in 2008.
Game playing: IBM’sDEEP BLUE became the first computer program to defeat the world champion in a
chess match when it bested Garry Kasparov by a score of 3.5 to 2.5 in an exhibition match (Goodman and
Keene, 1997). Kasparov said that he felt a “new kind of intelligence” across the board from him.
Newsweek magazine described the match as “The brain’s last stand.” The value of IBM’s stock increased
by $18 billion. Human champions studied Kasparov’s loss and were able to draw a few matches in
64
subsequent years, but the most recent human-computer matches have been won convincingly by the
computer.
Spam fighting: Each day, learning algorithms classify over a billion messages as spam, saving the
recipient from having to waste time deleting what, for many users, could comprise 80% or 90% of all
messages, if not classified away by algorithms. Because the spammers are continually updating their
tactics, it is difficult for a static programmed approach to keep up, and learning algorithms work best
(Sahami et al., 1998; Goodman and Heckerman, 2004).
Logistics planning: During the Persian Gulf crisis of 1991, U.S. forces deployed a Dynamic Analysis
and Replanning Tool, DART (Cross and Walker, 1994), to do automated logistics planning and
scheduling for transportation. This involved up to 50,000 vehicles, cargo, and people at a time, and had to
account for starting points, destinations, routes, and conflict resolution among all parameters. The AI
planning techniques generated in hours a plan that would have taken weeks with older methods. The
Defense Advanced Research Project Agency (DARPA) stated that this single application more than paid
back DARPA’s 30-year investment in AI.
Robotics: The iRobot Corporation has sold over two million Roomba robotic vacuum cleaners for home
use. The company also deploys the more rugged PackBot to Iraq and Afghanistan, where it is used to
handle hazardous materials, clear explosives, and identify the location of snipers.
Machine Translation: A computer program automatically translates from Arabic to English, allowing an
English speaker to see the headline “Ardogan Confirms That Turkey Would Not Accept Any Pressure,
Urging Them to Recognize Cyprus.” The program uses a statistical model built from examples of Arabic-
to-English translations and from examples of English text totaling two trillion words (Brants et al., 2007).
None of the computer scientists on the team speak Arabic, but they do understand statistics and machine
learning algorithms.
These are just a few examples of artificial intelligence systems that exist today.
65
move.There are several ways in which these rules can be written. For example, we could write a rule such
as that shown in fig.
MODULE – IV
HANDLING UNCERTANITY
66
MODULE-IV- HANDLING UNCERTANITY
Symbolic Reasoning under Uncertainty: Introduction to Non monotonic Reasoning
Some of the problems posed by uncertain, fuzzy, and often changing knowledge. A variety of logical
frameworks and computational methods have been proposed for handling such problems. In this
chapter and the next, we discuss two approaches:
i. Nonmonotonic reasoning, in which the axioms and/or the rules of inference are extended to make it
possible to reason with incomplete information. These systems preserve, however, the property that, at any
given moment, a statement is either believed to be true, believed to be false, or not believed to be
either.
ii. Statistical reasoning, in which the representation is extended to allow some kind of
numeric measure of certainty (rather than simply TRUE or FALSE) to be associated with each statement.
Other approaches to these issues have also been proposed and used in systems. For example, it is
sometimes the case that there is not a single knowledge base that captures the beliefs of all the agents
involved in solving a problem. This would happen in our murder scenario if we were to attempt to model
the reasoning of Abbott, Babbitt, and Cabot, as well as that of the police investigator. To be able to do
this reasoning, we would require a technique for maintaining several parallel belief spaces, each of
which would correspond to the beliefs of one agent. Such techniques are complicated by the fact that
the belief spaces of the various agents, although not identical, are sufficiently similar that it is
unacceptably inefficient to represent them as completely separate knowledge bases.
Conventiotnal reasoning systems, such first-order predicate logic, are designed to work with
information that has three important properties:
i. It is complete with respect to the domain of interest. In other words, all the facts that are necessary
to solve a problem are present in the system or can be derived from those that are by the
conventional rules of first-order logic.
ii. It is consistent.
67
iii. The only way it can change is that new facts can be added as they become available. If these new facts
are consistent with all the other facts that have already been asserted, then nothing will ever be
retracted from the set of facts that are known to be true. This property is called monotonicity.
LOGICS FOR NONMONOTONIC REASONING
Because monotonicity is fundamental to the definition of first-order predicate logic, we are forced to
find some alternative to support nonmonotonic reasoning. In this section, we look at several formal
approaches to doing this. We examine several because no single formalism with all the desired
properties has yet emerged (although there are some attempts, e.g., Shoham and Konolige, to present
a unifying framework for these several theories). In particular, we would like to find formalism that
does all of the following things:
• Defines the set of possible worlds that could exist given the facts that we do have. More
precisely, we will define an interpretation of a set of wff’s to be a domain (a set of objects)
£>, together with a function that assigns: to each predicate, a relation (of corresponding arity); to
each n-ary function, an operator that maps from D” into D; and to each constant, an element of D.
A model of a set of wff’s is an interpretation that satisfies them. Now we can be more precise about
this requirement. We require a mechanism for defining the set of models of any set of wff’s we are
given.
• Provides a way to say that we prefer to believe in some models rather than others.
• Provides the basis for a practical implementation of this kind of reasoning.
• Corresponds to our intuitions about how this kind of reasoning works. In other words, we
do not want vagaries of syntax to have a significant impact on the conclusions that can be
drawn within our system.
Before we go into specific theories in detail, let’s consider Fig., which shows one way of visualizing
how nonmonotonic reasoning works in all of them. The box labeled A corresponds to an original set of
wff’s. The large circle contains all the models of A. When we add some nonmonotonic reasoning
capabilities to A, we get a new set of wff’s, which we’ve labeled B2 B (usually) contains more
information than A does. As a result, fewer models satisfy 5 than A. The set of models corresponding
to B is shown at the lower right of the large circle. Now suppose we add some new wff’s (representing
new information) to A. We represent A with these additions as the box C. A difficulty may arise,
however, if the set of models corresponding to C is as shown in the smaller, interior circle, since it is
disjoint with the models for B. In order to find a new set of models that satisfy C, we need to accept
models that had previously been rejected. To do that, we need to eliminate the wff’s that were
responsible for those models being thrown away. This is the essence of nonmonotonic reasoning.
68
Default Reasoning
We want to use nonmonotonic reasoning to perform what is commonly called default
reasoning. We want to draw conclusions based on what is most likely to be true. In this section, we
discuss two approaches to doing this.
• Nonmonotonic Logic
• Default Logic
Nonmonotonic Logic
One system that provides a basis for default reasoning is Nonmonotonic Logic (NML) [McDermott
and Doyle, 1980], in which the language of first-order predicate logic is augmented with a modal
operator M, which can be read as “is consistent.” For example, the formula
Vx, y : Related(x, y) A M GetAlong(x, y) —> ~iWillDefend(x, y)
should be read as, “For all jt and y, if x and v are related and if the fact that x gets along with y is
consistent with everything else that is believed, then conclude that x will defend y.”
Once we augment our theory to allow statements of this form, one important issue iflust be resolved if
we want our theory to be even semidecidable. (Recall that even in a standard first-order theory, the
question of theoremhood is undecidable, so semide-cidability is the best we can hope for.) We must
define what “is consistent” means. Because consistency in this system, as in first-order predicate logic,
is undecidable, we need some approximation.
Default Logic
An alternative logic for performing default-based reasoning is Reiter's Default Logic
(DL) [Reiter, 1980], in which a new class of inference rules is introduced. In this approach, we
69
allow inference rules of the form Such a rule should be read as, “If A is provable and it is consistent
to assume B then conclude C.” As you can see, this is very similar in intent to the nonmonotonic
expressions that we used in NML. There are some important differences between the two theories,
however. The first is that in DL the new inference rules are used as a basis for computing a set of
plausible extensions to the knowledge base. Each extension corresponds to one maximal consistent
augmentation of the knowledge base.5 The logic then admits as a theorem any expression that is
valid in any extension. If a decision among the extensions is necessary to support problem solving,
some other mechanism must be provided.
IMPLEMENTATION ISSUES
Although the logical frameworks that we have just discussed take us part of the way toward a
basis for implementing nonmonotonic reasoning in problem-solving programs, they are not
enough. As we have seen, they all have some weaknesses as logical systems. In addition, they
fail to deal with four important problems that arise in real systems.
The first is how to derive exactly those nonmonotonic conclusions that are relevant to
solving the problem at hand while not wasting time on those that, while they may be
licensed by the logic, are not necessary and are not worth spending time on.
The second problem is how to update our knowledge incrementally as problem-solving
progresses. The definitions of the logical systems tell us how to decide on the truth status of a
proposition with respect to a given truth status of the rest of the knowledge base.
The third problem is that in nonmonotonic reasoning systems, it often happens that more than
one interpretation of the known facts is licensed by the available inference rules.
The final problem is that, in general, these theories are not computationally
effective. None of them is decidable. Some are semidecidable, but only in their
propositional forms. And none is efficient.
AUGMENTING A PROBLEM-SOLVER
As we have already discussed several times, problem-solving can be done using either forward
orbackward reasoning. Problem-solving using uncertain knowledge is no exception. As a result, there
are two basic approaches to this kind of problem-solving (as well as a variety of hybrids):
• Reason forward from what is known. Treat nonmonotonically derivable conclusions the same way
monotonically derivable ones are handled. Nonmonotonic reasoning systems that support this kind
of reasoning allow standard forward-chaining rules to be augmented with unless clauses, which
introduce a basis for reasoning by default.
• Reason backward to determine whether some expression P is true (or perhaps to find a set of
bindings for its variables that make it true). Nonmonotonic reasoning systems that support this kind
of reasoning may do either or both of the following two things’.
- Allow default (unless) clauses in backward rules. Resolve conflicts among defaults
70
using the same, control strategy that is used for other kinds of reasoning (usually rule ordering).
— Support a kind of debate in which an attempt is made to construct arguments both in favor of P
and opposed to it. Then some additional knowledge is applied to the arguments to determine
which side has the stronger case.
Let’s look at backward reasoning first. We will begin with the simple case of backward reasoning in
which we attempt to prove (and possibly to find bindings for) an expression P. Suppose that we
have a knowledge base that consists of the backward rules shown in Fig. 7.2.
71
Before we go into detail on how dependency-directed backtracking works, it is worth pointing
out that although one of the big motivations for it is in handling nonmonotonic reasoning, it turns
out to be useful for conventional search programs as well. This is not too surprising when you
consider that what any depth-first search program does is to “make a guess" at something, thus
creating a branch in the search space. If that branch eventually dies out, then we know that at least
one guess that led to it must be wrong. It could be any guess along the branch. In chronological
backtracking we have to assume it was the most recent guess and back up there to try an alternative.
Sometimes, though, we have additional information that tells us which guess caused the problem.
We’d like to retract only that guess and the work that explicitly depended on it, leaving
everything else that has happened in the meantime intact. This is exactly what dependency-directed
backtracking does.
As an example, suppose we want to build a program that generates a solution to a fairly
simple problem, such as-finding a time at which three busy people can all attend a meeting. One
way to solve such a problem is first to make an assumption that the meeting will be held on some
particular day, say Wednesday, add to the database an assertion to that effect, suitably tagged as
an assumption, and then proceed to find a time, checking along the way for any inconsistencies
in people’s schedules. If a conflict arises, the statement representing the assumption must be discarded
and replaced by another, hopefully noncontradictory, one. But, of course, any statements that have
been generated along the way that depend on the now-discarded assumption must also be
discarded.
Of course, this kind of situation can be handled by a straightforward tree search with
chronological back tracking. All assumptions, as well as the inferences drawn from them, are
recorded at the search node that created them. When a node is determined to represent a contradiction,
sim ply backtrack to the next node from which there remain unexplored paths. The assumptions and
their inferences will disappear automatically. The drawback to this approach is illustrated in Fig. 7.4,
which shows part of the search tree of a program that is trying to schedule a meeting.
72
In order to solve the problem, the system must try to satisfy one constraint at a time. Initially, there
is little reason to choose one alternative over another, so it decides to schedule the meeting on
Wednesday. That creates a new constraint that must be met by the rest of the solution. The assumption
that the meeting will be held on Wednesday is stored at the node it generated. Next the program tries
to select a time at which all participants are available. Among them, they have regularly scheduled
daily meetings at all times except 2:00. So 2:00 is chosen as the meeting time. But it would not have
mattered which day was chosen. Then the program discovers that on Wednesday there are no rooms
available. So it backtracks past the assumption that the day would be Wednesday and tries another
day, Tuesday. Now it must duplicate the chain of reasoning that led it to choose 2:00 as the time
because that reasoning was lost when it backtracked to redo the choice of day. This occurred even
though that reasoning did not depend in any way on the assumption that the day would be
Wednesday. By withdrawing statements based on the order in which they were generated by the
search process rather than on the basis of responsibility for inconsistency, we may waste a great deal
of effort.
Logic-Based Truth Maintenance Systems
A logic-based truth maintenance system (LTMS) is very similar to a JTMS. It differs in one important
way. In a JTMS, the nodes in the network are treated as atoms by the TMS, which assumes no
relationships among them except the ones that are explicitly stated in the justifications. In particular, a
JTMS has no problem simultaneously labeling both P and -iP IN. For example, we could have
represented explicitly both LiesB-I-L and Not Lies B-I-L and labeled both of them IN. No
contradiction will be detected automatically. In an LTMS, on the other hand, a contradiction would be
73
asserted automatically in such a case. If we had constructed the ABC example in an LTMS system, we
would not have created an explicit contradiction corresponding to the assertion that there was no
suspect. Instead we would (replace the contradiction node by one that asserted something like No
Suspect. Then we would assert Suspect. When No Suspect came IN, it would cause a contradiction to
be asserted automatically.
IMPLEMENTATION: BREADTH-FIRST SEARCH
The assumption-based truth maintenance system (ATMS) is an alternative way of implementing
nonmonotonic reasoning. In both JTMS and LTMS systems, a single line of reasoning is pursued at a
time, and dependency-directed backtracking occurs whenever it is necessary to change the system’s
assumptions. In an ATMS, alternative paths are maintained in parallel. Backtracking is avoided at the
expense of maintaining multiple contexts, each of which corresponds to a set of consistent as-
sumptions. As reasoning proceeds in an ATMS-based system, the universe of consistent contexts is
pruned as contradictions are discovered. The remaining consistent contexts are used to label assertions,
thus indicating the contexts in which each assertion has a valid justification. Assertions that do not
have a valid justification in any consistent context can be pruned from consideration by the problem
solver. As the set of consistent contexts gets smaller, so too does the set of assertions that can
consistently be believed by the problem solver. Essentially, an ATMS system works breadth-first,
considering all possible contexts at once, while both JTMS and LTMS systems operate depth-first.
The ATMS is designed to be used in conjunction with a separate problem solver. The problem
solver’s job is to:
• Create nodes that correspond to assertions (both those that are given as axioms and those
that are derived by the problem solver).
• Associate with each such node one or more justifications, each of which describes a reasoning chain
that led to the node.
• Inform the ATMS of inconsistent contexts.
The role of the ATMS system is then to:
• Propagate inconsistencies, thus ruling out contexts that include subcontexts (sets of assertions) that
are known to be inconsistent.
• Label each problem solver node with the contexts in which it has a valid justification. This is done by
combining contexts that correspond to the components of a justification. In particular, given a
justification of the form assign as a context for the node corresponding to C the intersection of the
contexts corresponding to the nodes A1 through An.
Al A A2 A ... A An C
Contexts get eliminated as a result of the problem-solver asserting inconsistencies and the ATMS
propagating them. Nodes get created by the problem-solver to represent possible components of a
problem solution. They may then get pruned from consideration if all their context labels get pruned.
74
Thus a choice among possible solution components gradually evolves in a process very much like the
constraint satisfaction procedure.
One problem with this approach is that given a set of n assumptions, the number of possible contexts
that may have to be considered is 2”. Fortunately, in many problem-solving scenarios, most of them
can be pruned without ever looking at them. Further, the ATMS exploits an efficient labeling system
that makes it possible to encode a set of contexts as a single context that delimits the set. To see how
both of these things work, it is necessary to think of the set of contexts that are defined by a set of
assump tions as forming a lattice, as shown for a simple example with four assumptions in Fig. 7.13.
Lines going upward indicate a subset relationship.
The first thing this lattice does for us is to illustrate a simple mechanism by which contradictions
(inconsistent contexts) can be propagated so that large parts of the space of 2n contexts can be
eliminated. Suppose that the context labeled {A2, A3| is asserted to be Inconsistent. Then all contexts
that include it (i.e., those that are above it) must also be inconsistent.
Now consider how a node can be labeled with all the contexts in which it has a valid justification.
Suppose its justification depends on assumption Al. Then the context labeled {Al) and all the contexts
that include it are acceptable. But this can be indicated just by saying (A 1}. It is not necessary to
enumerate its supersets. In general, each node will be labeled with the greatest lower bounds of the
contexts in which it should be believed.
Clearly, it is important that this lattice not be built explicitly but only used as an implicit structure as
the ATMS proceeds.
As an example of how an ATMS-based problem-solver works, let’s return to the ABC
Murder story. Again, our goal is to find a primary suspect. We need (at least) the following
assumptions:
i. Al. Hotel register was forged.
75
ii. A2. Hotel register was not forged.
iii. A3. Babbitt’s brother-in-law lied.
iv. A4. Babbitt’s brother-in-law did not lie.
v. A5. Cabot lied.
vi. A6. Cabot did not lie.
vii. A7. Abbott, Babbitt, and Cabot are the only possible suspects.
viii. A8. Abbott, Babbitt, and Cabot are not the only suspects.
The problem-solver could then generate the nodes and associated justifications shown in the first
two columns of Fig. 7.14. In the figure, the justification fora node that corresponds to a decision to
make assumption N is shown as (2V}. Justifications for nodes that correspond to the result of
applying reasoning rules are shown as the rule involved. Then the ATMS can assign labels to the
nodes as shown in the second two columns. The first shows the label that would be generated for
each justification taken by itself. The second shows the label (possibly containing multiple contexts)
that is actually assigned to the node given all its current justifications. These columns are identical in
simple cases, but they may differ in more complex situations as we see for nodes 12, 13, and 14 of
our example.
• Nodes may have several justifications if there are several possible reasons for believing them. This
is the case for nodes 12, 13, and 14.
• Recall that when we were using a JTMS, a node was labeled IN if it had at least one valid
76
justification. Using an ATMS, a node will end up being labeled with a consistent context if it
has at least one justification that can occur in a consistent context.
• The label assignment process is sometimes complicated. We describe it in more detail below.
Suppose that a problem-solving program first created nodes 1 through 14, representing the various
dependencies among them without committing to which of them it currently believes. It can indicate
known contradictions by marking as no good the context:
• A, B, C are the only suspects; A, B, C are not the only suspects: {A7, A8)
The ATMS would then assign the labels shown in the figure. Let’s consider die case of node 12. We
generate four possible labels, one for each justification. But we want to assign to the node a label that
contains just the greatest lower bounds of all the contexts in which it can, occur, since they implicitly
encode the superset contexts. The label [A2J is the greatest lower bound of the first, third, and fourth
label, and (A8) is the same for the second label. Thus those two contexts are all that are required as the
label for the node. Now let's consider labeling node 8. Its label must be the union of the labels of nodes
7, 13, and 14. But nodes 13 and 14 have complex labels representing alternative justifications. So we
must consider all ways of combining the labels of all three nodes. Fortunately, some of these
combinations, namely those that contain both A7 and A8, can be eliminated because they are already
known to be contradictory. Thus we are left with a single label as shown.
STATISTICAL REASONING
PROBABILITY AND BAYES' THEOREM
An important goal for many problem-solving systems is to collect evidence as the system goes along
and to modify its behavior on the basis of the evidence. To model this behavior, we need a statistical
theory of evidence. Bayesian statistics is such a theory. The fundamental notion of Bayesian statistics
is that of conditional probability:
Read this expression as the probability of hypothesis H given that we have observed evidence E. To
compute this, we need to take into account the prior probability of //(the probability that we would
assign to H if we had no evidence) and the extent to which E provides evidence of H. To do this, we
need to define a universe that contains an exhaustive, mutually exclusive set of 77/s, among which we
are trying to discriminate. Then, let
P(H^E) = the probability that hypothesis Ht is true given evidence E
P(E\H,) = the probability that we will observe evidence E given that hypothesis i is true
P(Ht) = the a priori probability that hypothesis i is true in the absence of any specific evidence. These
probabilities are called prior probabilities or priors.
k = the number of possible hypotheses
77
Suppose, for example, that we are interested in examining the geological evidence at a particular
location to determine whether that would be a good place to dig to find a desired mineral. If we know
the prior probabilities of finding each of the various minerals and we know the probabilities that if a
mineral is present then certain physical characteristics will be observed, then we can use Bayes’
formula to compute, from the evidence we collect, how likely it is that the various minerals are
present. This is, in fact, what is done by the PROSPECTOR program [Duda et al., 1979], which has
been used successfully to help locate deposits of several minerals, including copper and uranium.
Unfortunately, in an arbitrarily complex world, the size of the set of joint probabilities that we
require in order to compute this function grows as 2" if there are n different propositions being
considered. This makes using Bayes’ theorem intractable for several reasons:
• The knowledge acquisition problem is insurmountable: too many probabilities have to be provided.
In addition, there is substantial empirical evidence (e.g., Tversky and Kahneman that people are very
poor probability estimators.
• The space that would be required to store all the probabilities is too large.
• The time required to compute the probabilities is too large.
Despite these problems, though, Bayesian statistics provide an attractive basis for an uncertain
reasoning system. As a result, several mechanisms for exploiting its power while at the same time
making it tractable have been developed. In the rest of this chapter, we explore three of these:
• Attaching certainty factors to rules
• Bayesian networks
• Dempster-Shafer theory
CERTAINTY FACTORS AND RULE-BASED SYSTEMS
In this section we describe one practical way of compromising on a pure Bayesian system. The
approach we discuss was pioneered in the MYCIN system [Shortliffe, 1976; Buchanan and Shortliffe,
1984; Shortliffe and Buchanan, 1975], which attempts to recommend appropriate therapies for patients
with bacterial infections. It interacts with the physician to acquire the clinical data it needs. MYCIN is
an example of an expert system, since it performs a task normally done by a human expert. Here we
concentrate on the use of probabilistic reasoning; Chapter 20 provides a broader view of expert
systems.
78
MYCIN represents most of its diagnostic knowledge as a set of rules. Each rule has associated
with it a certainty factor, which is a measure of the extent to which the evidence that is described by
the antecedent of the rule supports the conclusion that is given in the rule’s consequent. A typical
MYCIN rule looks like:
MYCIN uses these rules to reason backward to the clinical data available from its goal of finding
significant disease-causing organisms. Once it finds the identities of such organisms, it then attempts
to select a therapy by which the disease (s) may be treated.
The CF's of MYCIN’s rules are provided by the experts who write the rules. They reflect the experts'
assessments of the strength of the evidence in support of the hypothesis. As MYCIN reasons, however,
these CF’s need to be combined to reflect the operation of multiple pieces of evidence and multiple
rules applied to a problem. Figure 8.1 illustrates three combination scenarios that we need to consider.
In Fig. 8.1(a), several rules all provide evidence that relates to a single hypothesis. In Fig. 8.1 (b), we
need to consider our belief in a collection of several propositions taken together. In Fig. 8.1(c), the
output of one rule provides the input to another.
What formulas should be used to perform these combinations? Before we answer that question, we
need first to describe some properties that we would like th ^combining functions to satisfy:
• Since the order in which evidence is collected is arbitrary, the combining functions should
79
be commutative and associative.
• Until certainty is reached, additional confirming evidence should increase MB (and similarly
for disconfirming evidence and MD}.
• If uncertain inferences are chained together, then the result should be less certain than either of the
inferences alone.
BAYESIAN NETWORKS
In this section, we describe an alternative approach, Bayesian networks [Pearl, 1988], in which we
preserve the formalism and rely instead on the modularity of the world we are trying to model. The
main idea is that to describe the real world, it is not necessary to use a huge joint probability table in
which we list the probabilities of all conceivable combinations of events. Most events are
conditionally independent of most other ones, so their interactions need not be considered. Instead, we
can use a more local representation in which we will describe clusters of events that interact.
We used a network notation to describe the various kinds of constraints on likelihoods that
propositions can have on each other. The idea of constraint networks turns out to be very powerful.
We expand on it in this section as a way to represent interactions among events. Let’s return to the
example of the sprinkler, rain, and grass that we introduced in the last section. Figure 8.2(a) shows the
flow of constraints we described in MYCIN-style rules. But recall that the problem that we
encountered with that example was that the constraints flowed incorrectly from “sprinkler on” to
“rained last night.” The problem was that we failed to make a distinction that turned out to be critical.
There are two different ways that propositions can influence the likelihood of each other. The first is
that causes influence the likelihood of their symptoms; the second is that observing a symptom affects
the likelihood of all of its possible causes. The idea behind the Bayesian network structure is to make a
clear distinction between these two kinds of influence.
Specifically, we construct a directed acyclic graph (DAG) that represents causality relationships
among variables. The idea of a causality graph (or network) has proved to be very useful in several
systems, particularly medical diagnosis systems such as CAS- NET [Weiss et al., 1978] and
INTERNIST/CADUCEUS [Pople, 1982]. The variables in such a graph may be propositional (in
80
which case they can take on the values TRUE and FALSE) or they may be variables that take on
values of some other type (e.g., a specific disease, a body temperature, or a reading taken by some
other diagnostic device). In Fig. 8.2(Z>). we show a causality graph for the wet grass example. In
addition to the three nodes we have been talking about, the graph contains a new node corresponding
to the propositional variable that tells us whether it is currently the rainy season.
A DAG, such as the one we have just drawn, illustrates the causality relationships that occur
among the nodes it contains. In order to use it as a basis for probabilistic reasoning, however, we need
more information. In particular, we need to know, for each value of a parent node, what evidence is
provided about the values that the child node can take on. We can state this in a table in which the
conditional probabilities are provided. We show such a table for our example in Fig. 8.3. For example,
from the table we see that the prior probability of the rainy season is 0.5. “Then, if it is the rainy
season, the probability of rain on a given night is 0.9; if it is not, the probability is only 0.1.
DEMPSTER-SHAFER THEORY
So far, we have described several techniques, all of which consider individual propositions and assign
to each of them a point estimate (i.e., a single number) of the degree of belief that is warranted given
the evidence. In this section, we consider an alternative technique, called Dempster-Shafer theory.
This new approach considers sets of propositions and assigns to each of them an interval
[Belief, Plausibility]
in which the degree of belief must lie. Belief (usually denoted Bel) measures the strength of the
evidence in favor of a set of propositions. It ranges from 0 (indicating no evidence) to 1 (denoting
certainty).
Plausibility (Pl) is denned to be
Pl(s) = 1 - Bel(-s)
It also ranges from 0 to 1 and measures the extent to which evidence in favor of leaves room for belief
in s. In particular, if we have certain evidence in favor of then Bel(-*s) will be 1 and Pl(s) will be 0.
This tells us that the only possible value for Bel(s) is also 0.
The belief-plausibility interval we have just defined measures not only our level of belief in some
propositions, but also the amount of information we have. Suppose that we are currently considering
three competing hypotheses: A, B, and C. If we have no information, we represent that by saying, for
each of them, that the true likelihood is in the range [0,1]. As evidence is accumulated, this interval
81
can be expected to shrink, representing increased confidence that we know how likely each hypothesis
is. Note that this contrasts with a pure Bayesian approach, in which we would probably begin by
distributing the prior probability equally among the hypotheses and thus assert for each that P(h) =
0.33. The interval approach makes it clear that we have no information when we start. The Bayesian
approach does not, since we could end up with the same probability values if we collected volumes of
evidence, which taken together suggest that the three values occur equally often. This difference can
matter if one of the decisions that our program needs to make is whether to collect more evidence or to
act on the basis of the evidence it already has.
So far we have talked intuitively about Bel as a measure of our belief in some „ hypothesis given
some evidence. Let’s now define it more precisely. To do this, we need to start, just as with Bayes’
theorem, with an exhaustive universe of mutually exclusive hypotheses. We’ll call this the frame of
discernment and we’ll write it as 0. For example, in a simplified diagnosis problem, 0 might consist of
the set {All, Flu, Cold, Pneu]:
All: allergy
Flu: flu
Cold: cold Pneu: pneumonia
Our goal is to attach some measure of belief to elements of 0. However, not all evidence is
directly supportive of individual elements. Often it supports sets of elements (i.e., subsets of G). For
example, in our diagnosis problem, fever might support {Flu, Cold, Pneu}. In addition, since the
elements of Gare mutually exclusive, evidence in favor of some may have an affect on our belief in the
others. In a purely Bayesian system, we can handle both of these phenomena by listing all of the
combinations of conditional probabilities. But our goal is not to have to do that. Dempster-Shafer
theory lets us handle interactions by manipulating sets of hypotheses directly.
The key function we use is a probability density function, which we denote as m. The function m
is defined not just for elements of (9 but for all subsets of it (including singleton subsets, which
correspond to individual elements). The quantity m(p) measures the amount of belief that is currently
assigned to exactly the set p of hypotheses. If G contains n elements, then there are 2n subsets of G.
We must assign m so that the sum of all the tn values assigned to the subsets of G is 1. Although
dealing with 2n values may appear intractable, it Usually turns out that many of the subsets will never
need to be considered because they have no significance in the problem domain (and so their
associated value of m will be 0).
Let us see how rn works for our diagnosis problem. Assume that we have no information about
how to choose among the four hypotheses when we start the diagnosis task. Then we define m as:
<G| (1-0)
All other values of m are thus 0. Although this means that the actual value must be some one
element All, Flu, Cold, or Pneu, we do not have any information that allows us to assign belief in any
other way than to say that we are sure the answer is somewhere in the whole set. Now suppose we
acquire a piece of evidence that suggests (at a level of 0.6) that the correct diagnosis is in the set {Flu,
Cold, and Pneu}. Fever might be such a piece of evidence. We update m as follows:
82
At this point, we have assigned to the set {Flu, Cold,Pneu} the appropriate belief. The remainder of
our belief still resides in the larger set G. Notice that we do not make the commitment that the
remainder must be assigned to the complement of {Flu,Cold, Pneu}.
Having defined m, we can now define Bel(p) for a set p as the sum of the values of m for p and for all
of its subsets. Thus Bel(p) is our overall belief that the correct answer lies somewhere in the set p.
In order to be able to use m (and thus Bel and Pl) in reasoning programs, we need to define functions
that enable us to combine m’s that arise from multiple sources of evidence.
FUZZY LOGIC
In the techniques we have discussed so far, we have hot modified the mathematical underpinnings
provided by set theory and logic. We have instead augmented those ideas with additional constructs
provided by probability theory. In this section, we take a different approach and briefly consider what
happens if we make fundamental changes to our idea of set membership and corresponding changes to
our definitions of logical operations.
The motivation for fuzzy sets is provided by the need to represent such propositions as:
John is very tall.
Mary is slightly ill.
Sue and Linda are close friends.
Exceptions to the rule are nearly impossible.
Most Frenchmen are not very tall.
While traditional set theory defines set membership as a boolean predicate, fuzzy set theory
allows us to represent set membership as a possibility distribution, such as the ones shown in Fig.
8.4(a) for the set of tall people and the set of very tall people. Notice how this contrasts with the
standard boolean definition for tall people shown in Fig. 8.4(b). In the latter, one is either tall or not
and there must be a specific height that defines the boundary. The same is true for very tall. In the
former, one’s tallness increases with one’s height until the value of 1 is reached.
Once set membership has been redefined in this way, it is possible to define a reasoning system
based on techniques for combining distributions (or see the papers in the journal Fuzzy Sets and
Systems). Such reasoners have been applied in control systems for devices as diverse as trains and
washing machines.
83
MODULE – IV
84
PLANNING, LEARNING AND EXPERT
SYSTEMS
85
If they can, then the process of planning a complete solution can proceed just as would an attempt
to find a solution by actually trying particular actions. If a dead-end path is detected, then a new one
can be explored by backtracking to the last choice point. So, for example, in solving the 8-puzzle, a
computer could look for a solution plan in the same way as a person who was actually trying to solve
the problem by moving tiles on a board. If solution steps in the real world cannot be ignored or
undone, though, planning becomes extremely important. Although real world steps may be
irrevocable, computer simulation of those steps is not. So we can circumvent the constraints of the real
world by looking for a complete solution in a simulated world in which backtracking is allowed. After
we find a solution, we can execute it in the real world.
But in an unpredictable universe, we cannot know the outcome of a solution step if we are only
simulating it by computer. At best, we can consider the set of possible outcomes, possibly in some
order according to the likelihood of the outcomes occurring. But then when we produce a plan and
attempt to execute it, we must be prepared in case the actual outcome is not what we expected. If the
plan included paths for all possible outcomes of each step, then we can simply traverse the paths that
turn out to be appropriate. But often there are a great many possible outcomes, most of which are
highly unlikely. In such situations, it would be a great waste of effort to formulate plans for all
contingencies.
Instead, we have two choices. We can just take things one step at a time and not really try to plan
ahead. This is the approach that is taken in reactive systems. Our other choice is to produce a plan that
is likely to succeed. But then what should we do if it fails? One possibility is simply to throw away the
rest of the plan and start the planning process over, using the current situation as the new initial state.
Sometimes, this is a reasonable thing to do.
Hardly any aspect of the real world is completely predictable. So we must always be prepared to
have plans fail. But, as we have just seen, if we have built our plan by decomposing our problem into
as many separate (or nearly separate) subproblems as possible, then the impact on our plan of the
failure of one particular step may be quite locals Thus we have an additional argument in favor of the
problem-decomposition approach to problem-solving. In addition to reducing the combinatorial
complexity of the problem-solving process, it also reduces the complexity of the dynamic plan
revision process that may be required during the execution of a plan in an unpredictable world (such as
the one in which we live).
In order to make it easy to patch up plans if they go awry at execution time, we will find that it is
useful during the planning process not only to record the steps that are to be performed but also to
associate with each step the reasons why it must be performed. Then, if a step fails, it is easy, using
techniques for dependency- directed backtracking, to determine which of the remaining parts of the
plan were dependent on it and so may need to be changed. If the plan-generation process proceeds
backward from the desired goal state, then it is easy to record this dependency information. If, on the
other hand, it proceeded forward from the start state, determining the necessary dependencies may be
difficult. For this reason and because, for most problems, the branching factor is smaller going
backward, most planning systems work primarily in a goal-directed mode in which they search
backward from a goal state to an achievable initial state.
AN EXAMPLE DOMAIN: THE BLOCKS WORLD
86
The techniques we are about to discuss can be applied in a wide variety of task domains, and they
have been. But to make it easy to compare the variety of methods we consider, we should find it useful
to look at all of them in a single domain that is complex enough that the need for each of the
mechanisms is apparent yet simple enough that easy-to- follow examples can be found. The blocks
world is such a domain. There is a flat surface on which blocks can be placed. There are a number of
square blocks, all the same size. They can be stacked one upon another. There is a robot arm that can
manipulate the blocks. The actions it can perform include:
Notice that in the world we have described, the robot arm can hold only one block at a time. Also,
since all blocks are the same size, each block can have at most one other block directly on top of it.1
In order to specify both the conditions under which an operation may be performed and the
results of performing it, we need to use the following predicates:
The first of these statements says simply that if the arm is holding anything, then it is not empty. The
second says that if a block is on the table, then it is not also on another block.’ The third says that any
block with no blocks on it is clear.
COMPONENTS OF A PLANNING SYSTEM
In problem-solving systems based on the elementary techniques discussed in Chapter 3, it was
necessary to perform each of the following functions:
87
In the more complex systems we are about to explore, techniques for doing each of these tasks are also
required. In addition, a fifth operation is often important:
• Detect when an almost correct solution has been found and employ special techniques to make it
totally correct.
Choosing Rules to Apply
The most widely used technique for selecting appropriate rules to apply is first to isolate a set of
differences between the desired goal state and the current state and then to identify those rules that are
relevant to reducing those differences. If several rules are found, a variety of other heuristic
information can be exploited to choose among them. This technique is based on the means-ends
analysis method.
Applying Rules
Each rule simply specified the problem state that would result from its application. Now, however, we
must be able to deal with rules that specify only a small part of the complete problem state. There are
many ways of doing this.
Detecting a Solution
A planning system has succeeded in finding a solution to a problem when it has found a sequence of
operators that transforms the initial problem state into the goal state. The way it can be solved depends
on the way that state descriptions are represented. For any representational scheme that is used, it must
be possible to reason with representations to discover whether one matches another.
One representational technique has served as the basis for many of the planning systems that have
been built. It is predicate logic, which is appealing because of the deductive mechanisms that it
provides. Suppose that, as part of our goal, we have the predicate P(x). To see whether P(x) is satisfied
in some state, we ask whether we can prove P(x) given the assertions that describe that state and the
axioms that define the world model (such as the fact that if the arm is holding something, then it is not
empty). If we can construct such a proof, then the problem-solving process terminates. If we cannot,
then a sequence of operators that might solve the problem must be proposed. This sequence can then
be tested in the same way as the initial state was by asking whether P(x) can be proved from the
axioms and the state description that was derived by applying the operators.
Detecting Dead Ends
88
in a given set can be satisfied at once. For example, the robot arm cannot be both empty and holding a
block. Any path that is attempting to make both of those goals true simultaneously can be pruned
immediately.
Repairing an Almost Correct Solution
The kinds of techniques we are discussing are often useful in solving nearly decomposable
problems. One good way of solving such problems is to assume that they are completely
decomposable, proceed to solve the subproblems separately, and then check that when the
subsolutions are combined, they do in fact yield a solution to the original problem. Of course, if
they do, then nothing more need be done. If they do not, however, there are a variety of things that we
can do. The simplest is just to throw out the solution, look for another one, and hope that it is better.
Although this is simple, it may lead to a great deal of wasted effort.
A slightly better approach is to look at the situation those results when the sequence of operations
corresponding to the proposed solution is executed and to compare that situation to the desired goal. In
most cases, the difference between the two will be smaller than the difference between the initial state
and the goal (assuming that the solution we found did some useful things). Now the problem-solving
system can be called again and asked to find a way of eliminating this new difference. The first
solution can then be combined with this second one to form a solution to the original problem.
GOAL STACK PLANNING
One of the earliest techniques to be developed for solving compound goals that may interact was the
use of a goal stack. This was the approach used by STRIPS. In this method, the problem solver makes
use of a single stack that contains both goals and operators that have been proposed to satisfy those
goals. The problem solver also relies on a database that describes the current situation and a set of
operators described as PRECONDITION, ADD, and DELETE lists. To see how this method works,
let us carry it through for the simple example shown in Fig. 13.4.
When we begin solving this problem, the goal stack is simply
ON(C, A) A ON(B, D) A ONTABLE(A) A ONTABLE(D)
But we want to separate this problem into four subproblems, one for each component of the
original goal. Two of the subproblems, ONTABLE (A) and ONTABLE (D), are already true in the
initial state. So we will work on only the remaining two. Depending on the order in which we want to
tackle the subproblems, there are two goal stacks that could be created as our first step, where each
line represents one goal on the stack and OTAD is an abbreviation for ONTABLE (A) A ONTABLE
(D):
89
At each succeeding step of the problem-solving process, the top goal on the stack will be pursued.
When a sequence of operators that satisfies it is found, that sequence is applied to the state description,
yielding a new description. Next, the goal that is then at the top of the stack is explored and an attempt
is made to satisfy it, starting from the situation that was produced as a result of satisfying the first goal.
This process continues until the goal stack is empty. Then, as one last check, the original goal is
compared to the final state derived from the application of the chosen operators. If any components of
the goal are not satisfied in that state (which they might not be if they were achieved at one point and
then undone later), then those unsolved parts of the goal are reinserted onto the stack and the process
resumed.
To continue with the example we started above, let us assume that we choose first to explore
alternative 1. Alternative 2 will also lead to a solution. In fact, it finds one so trivially that it is not very
interesting. Exploring alternative 1, we first check to see whether ON(C, A) is true in the current state.
Since it is not, we check for operators that could cause it to be true. Of the four operators we are
considering, there is only one, STACK, and it would have to be called with C and A. So we place
STACK(C, A) on the stack in place of ON(C, A), yielding
90
STACK(C, A) replaced ON(C, A) because after performing the STACK we are guaranteed that ON(C,
A) will hold. But in order to apply STACK(C, A), its preconditions must hold, so we must establish
them as subgoals. Again we must separate a compound goal
CLEAR (A) A HOLDING(C)
into its components and choose an order in which to work on them. At this point, it is useful to exploit
some heuristic knowledge. HOLDING(x) is very easy to achieve. At most, it is necessary to put down
something else and then to pick up the desired object. But HOLDING is also very easy to undo. In
order to do almost anything else, the robot will need to use the arm. So if we achieve HOLDING first
and then try to do something else, we will most likely end up with HOLDING no longer true. So we.
exploit the heuristic that if HOLDING is one of several goals to be achieved at once, it should be
tackled last. This produces the new goal stack
CLEAR (A)
HOLDING(C)
CLEAR (A) A HOLDING(C) STACK(C, A)
ON (B, D)
ON(C, A) A ON (B, D) A OTAD
This kind of heuristic information could be contained in the precondition list itself by stating the
predicates in the order in which they should be achieved.
NONLINEAR PLANNING USING CONSTRAINT POSTING
The goal-stack planning method attacks problems involving conjoined goals by solving the goals one
at a time, in order. A plan generated by this method contains a sequenc; of operators for attaining the
first goal, followed by a complete sequence for the second goal, etc. But as we have seen, difficult
problems cause goal interactions. The operators used to solve one subproblem may interfere with the
solution to a previous subproblem. Most problems require an intertwined plan in which multiple
subproblems are worked on simultaneously. Such a plan is called a nonlinear plan because it is not
composed of a linear sequence of complete subplans.
As an example of the need for a nonlinear plan, let us return to the Sussman anomaly described in Fig.
13.5. A good plan for the solution of this problem is the following:
91
This section explores some heuristics and algorithms for tackling nonlinear problems such as this one.
Many ideas about nonlinear planning were present in HACKER, an automatic programming
system. The first true nonlinear planner, though, was NOAH. NOAH was further improved upon by
the NONLIN program. The goal stack algorithm of STRIPS was transformed into a goal set algorithm
by Nilsson. Subsequent planning systems, such as MOLGEN and TWEAK, used constraint posting as
a central technique.
The idea of constraint posting is to build up a plan by incrementally hypothesizing operators, partial
orderings between operators, and bindings of variables within operators. At any given time in the
problem-solving process, we may have a set of useful operators but perhaps no clear idea of how those
operators should be ordered with respect to each other. A solution is a partially ordered, partially
instantiated set of operators; to generate an actual plan, we convert the partial order into any of a
number of total orders. Figure 13.7 shows the difference between the constraint posting method and
the planning methods discussed in earlier sections.
92
We now examine several operations for nonlinear planning in a constraint-posting environment,
although many of the operations themselves predate the use of the technique in planning. Let us
incrementally generate a nonlinear plan to solve the Sussman anomaly problem. We begin with the
null plan, i.e., a plan with no steps. Next we look at the goal state and posit steps for achieving that
goal. Means-ends analysis tells us to choose two steps with respective postconditions ON (A, B) and
ON (B, C):
Each step is written with its preconditions above it and its postconditions below it. Delete
postconditions are marked with a negation symbol (-1). Notice that, at this point, the steps are not
ordered with respect to each other. All we know is that we want to execute both of them eventually.
93
Neither can be executed right away because some of their preconditions are not satisfied. An
unachieved precondition is marked with a star (*). Both of the *HOLDING preconditions are
unachieved because the arm holds nothing in the initial problem state.
Hierarchical Planning
In order to solve hard problems, a problem solver may have to generate long plans. In order to do that
efficiently, it is important to be able to eliminate some of the details of the problem until a solution
that addresses the main issues is found. Then an attempt can be made to fill in the appropriate details.
Early attempts to do this involved the use of macro-operators, in which larger operators were built
from smaller ones. But in this approach, no details were eliminated from the actual descriptions of the
operators. A better approach was developed in the ABSTRIPS system, which actually planned in a
hierarchy of abstraction spaces, in each of which preconditions at a lower level of abstraction were
ignored.
As an example, suppose you want to visit a friend in Europe, but you have a limited amount of
cash to spend. It makes sense to check airfares first, since finding an affordable flight will be the most
difficult part of the task. You should not worry about getting out of your driveway, planning a route to
the airport, or parking your car until you are sure you have a flight.
The ABSTRIPS approach to problem-solving is as follows: First solve the problem completely,
considering only preconditions whose criticality value is the highest possible. These values reflect the
expected difficulty of satisfying the precondition. To do this, do exactly what STRIPS did, but simply
ignore preconditions of lower than peak criticality. Once this is done, use the constructed plan as the
outline of a complete plan and consider preconditions at the next-lowest criticality level. Augment the
plan with operators that satisfy those preconditions. Again, in choosing operators, ignore all
preconditions whose criticality is less than the level now being considered. Continue this process of
considering less and less critical preconditions until all of the preconditions of the original rules have
94
been considered. Because this process explores entire plans at one level of detail before it looks at the
lower-level details of any one of them, it has been called length-first search.
Clearly, the assignment of appropriate criticality values is crucial to the success of this
hierarchical planning method. Those preconditions that no operators can satisfy are clearly the most
critical. For example, if we are trying to solve a problem involving a robot moving around in a house
and we are considering the operator PUSH-THROUGH- DOOR, the precondition that there exist a
door big enough for the robot to get through is of high criticality since there is (in the normal situation)
nothing we can do about it if it is not true. But the precondition that the door be open is of lower
criticality if we have the operator OPEN-DOOR. In order for a hierarchical planning system to work
with STRIPS-like rules, it must be told, in addition to the rules themselves, the appropriate criticality
value for each term that may occur in a precondition. Given these values, the basic process can
function in very much the same way that nonhierarchical planning does. But effort will not be wasted
filling in the details of plans that do not even come close to solving the problem.
REACTIVE SYSTEMS
A reactive system must have access to a knowledge base of some sort that describes what actions
should be taken under what circumstances. A reactive system is very different from the other kinds of
planning systems we have discussed because it chooses actions one at a time; it does not anticipate and
select an entire action sequence before it does the first thing.
One of the very simplest reactive systems is a thermostat. The job of a thermostat is to keep the
temperature constant inside a room. One might imagine a solution to this problem that requires
significant amounts of planning, taking into account how the external temperature rises and falls
during the day, how heat flows from room to room, and so forth. But a real thermostat uses the simple
pair of situation-action rules:
1. If the temperature in the room is k degrees above the desired temperature, then turn the air
conditioner on
2. If the temperature in the room is k degrees below the desired temperature, then turn the air
conditioner off
It turns out that reactive systems are capable of surprisingly complex behaviors, especially in real
world tasks such as robot navigation. The main advantage reactive systems have over traditional
planners is that they operate robustly in domains that are difficult to model completely and accurately.
Reactive systems dispense with modeling altogether and base their actions directly on their perception
of the world. In complex and unpredictable domains, the ability to plan an exact sequence of steps
ahead of time is of questionable value. Another advantage of reactive systems is that they are
extremely responsive, since they avoid the combinatorial explosion involved in deliberative planning.
This makes them attractive for real lime tasks like driving and walking.
Of course, many Al tasks do require significant deliberation, which is usually imple- mented as
internal search: Since reactive systems maintain no model of the world and no explicit goal structures,
their performance in these tasks is limited. For example, it seems unlikely that a purely reactive system
could ever play expert chess. It is possible to provide a reactive system with rudimentary planning
capability, but only by explicitly storing whole plans along with the situations that should trigger them.
95
Deliberative planners need not rely on pre-stored plans; they can construct a new plan for each new
problem.
Nevertheless, inquiry into reactive systems has served to illustrate many of the shortcomings of
traditional planners. For one thing, it is vital to interleave planning and plan execution. Planning is
important, but so is action. An intelligent system with limited resources must decide when to start
thinking, when to stop thinking, and when to act. Also, goals arise naturally when the system interacts
with the environment. Some mechanism for suspending plan execution is needed so that the system
can turn its attention to high priority goals. Finally, some situations require immediate attention and
rapid action. For this reason, some deliberative planners [Mitchell, 1990] compile out reactive
subsystems (i.e., sets of situation-action rules) based on their problem solving experiences. Such
systems learn to be reactive over time.
Learning
Definition of Learning:
One of the most often heard criticisms of Al is that machines cannot be called intelligent until
they are able to learn to do new things and to adapt to new situations, rather than simply doing as they
are told to do. There can be little question that the ability to adapt to new surroundings and to solve
new problems is an important characteristic of intelligent entities.
As thus defined, learning covers a wide range of phenomena. At one end of the spectrum is skill
refinement. People get better at many tasks simply by practicing. The more you ride a bicycle or play
tennis, the better you get. At the other end of the spectrum lies knowledge acquisition.
Knowledge acquisition itself includes many different activities. Simple storing of computed
information, or rote learning, is the most basic learning activity. Many computer programs, e.g.,
database systems, can be said to “learn” in this sense, although most people would not call such simple
storage learning. However, many Al programs are able to improve their performance substantially
through rote-learning techniques, and we will look at one example in depth, the checker-playing
program of Samuel.
People also learn through their own problem-solving experience. After solving a complex
problem, we remember the structure of the problem and the methods we used to solve it. The next time
we see the problem, we can solve it more efficiently. Moreover, we can generalize from our
experience to solve related problems more easily. In contrast to advice taking, learning from problem-
solving experience docs not usually involve gathering new knowledge that was previously unavailable
to the learning program. That is, the program remembers its experiences and generalizes from them,
but does not add to the transitive closure1 of its knowledge, in the sense that an advice-taking program
would, i.e., by receiving stimuli from the outside world. In large problem spaces, however, efficiency
gains are critical. Practically speaking, learning can mean the difference between solving a problem
rapidly and not solving it at all. In addition, programs that learn through problem-solving experience
may be able to eome up with qualitatively better solutions in the future.
Another form of learning that does involve stimuli from the outside is learning from examples.
We often learn to classify things in the world without being given explicit rules. For example, adults
can differentiate between cats and dogs, but small children often cannot. Somewhere along the line,
we induce a method for telling cats from dogs based on seeing numerous examples of each. Learning
96
from examples usually involves a teacher who helps us classify things by correcting us when we are
wrong. Sometimes, however, a program can discover things without the aid of a teacher.
ROTE LEARNING
When a computer stores a piece of data, it is performing a rudimentary form of learning. After all,
this act of storage presumably allows the program to perform better in the future (otherwise, why
bother?). In the case of data caching, we store computed values so that we do not have to recompute
them later. When computation is more expensive than recall, this strategy can save a significant
amount of time. Caching has been used in Al programs to produce some surprising performance
improvements. Such eaching is known as rote learning.
This program learned to play checkers well enough to beat its creator. It exploited two kinds of
learning: rote learning, which we look at now, and parameter (or coefficient) adjustment, which is
described in Section 17.4.1. Samuel’s program used the minimax search procedure to explore checkers
game trees. As is the case with all such programs, time constraints permitted it to search only a few
levels in the tree. (The exact number varied depending on the situation.) When it could search no
deeper, it applied its static evaluation function to the hoard position and used that score to continue its
search of the game tree. When it finished searching the tree and propagating the values backward, it
had a score for the position represented by the root of the tree. It could then choose the best move and
make it. But it also recorded the board position at the root of the tree and the backed up score that had
just been computed for it. This situation is shown in Fig. 17.1 (a).
Now suppose that in a later game, the situation shown in Fig. 17.1 (b) were to arise. Instead of using
the static evaluation function to compute a score for position A, the stored value for A can be used.
This creates the effect of having searched an additional several ply since the stored value for A was
computed by backing up values from exactly such a search.
Rote learning of this sort is very simple. It does not appear to involve any sophisticated problem-
solving capabilities. But even it shows the need for some capabilities that will become increasingly
important in more complex learning systems. These capabilities include:
• Organized Storage of Information—In order for it to be faster to use a stored value than it would
be to recompute it, there must be a way to access the appropriate stored value quickly. In Samuel’s
program, this was done by indexing board positions by a few important characteristics, such as the
97
number of pieces. But as the complexity of the stored information increases, more sophisticated
techniques are necessary.
• Generalization—the number of distinct objects that might potentially be stored can be very large.
To keep the number of stored objects down to a manageable level, some kind of generalization is
necessary. In Samuel’s program, for example, the number of distinct objects that could be stored was
equal to the number of different board positions that can arise in a game. Only a few simple forms of
generalization were used in Samuel’s program to cut down that number. All positions are stored as
though White is to move. This cuts the number of stored positions in half. When possible, rotations
along the diagonal are also combined. Again, though, as the complexity of the learning process
increases, so too does the need for generalization.
LEARNING BY TAKING ADVICE
A computer can do very little without a program for it to run. When a programmer writes a series
of instructions into a computer, a rudimentary kind of learning is taking place: The programmer is a
sort of teacher, and the computer is a sort of student. After being programmed, the computer is now
able to do something it previously could not. Executing the program may not be such a simple matter,
however. Suppose the program is written in a high-level language like LISP. Some interpreter or
compiler must intervene to change the teacher’s instructions into code that the machine can execute
directly.
People process advice in an analogous way. In chess, the advice ‘Tight for control of the center of
the board” is useless unless the player can translate the advice into concrete moves and plans. A
computer program might make use of the advice by adjusting its static evaluation function to include a
factor based on the number of center squares attacked by its own pieces.
Mostow [1983] describes a program called FOO, which accepts advice for playing hearts, a card
game. A human user first translates the advice from English into a representation that FOO can
understand. For example, “Avoid taking points” becomes:
(Avoid (take-points me) (trick))
FOO must operationalize this advice by turning it into an expression that contains concepts and actions
FOO can use when playing the game of hearts. One strategy FOO can follow is to UNFOLD an
expression by replacing some term by its definition. By UNFOLDing the definition of avoid, FOO
comes up with:
(achieve (not (during (trick) (take-points me))))
FOO considers the advice to apply to the player called “me.” Next, FOO UNFOLDs the definition of
trick:
(achieve (not (during
(scenario
(each pl (players) (play-card pl))
(take-trick (trick-winner)))
(take-points me))))
98
In other words, the player should avoid taking points during the scenario consisting of (1) players
playing cards and (2) one player taking the trick. FOO then uses case analysis to determine which
steps could cause one to take points. It rules out step 1 on the basis that it knows of no intersection of
the concepts take-points and play-card. But step 2 could affect taking points, so FOO UNFOLDs the
definition of take-points:
(achieve (not (there-exists cl (cards-played)
(there-exists c2 (point-cards)
(during (take (trick-winner) cl)
(take me c2))))))
This advice says that the player should avoid taking point-cards during the process of the trick-winner
taking the trick. The question for FOO now is: Under what conditions does (take me c2) occur during
(take (trick-winner) cl)? By using a technique called partial match, FOO hypothesizes that points will
be taken if me = trick-winner and c2 = cl. It transforms the advice into:
(achieve (not (and (have-points (cards-played))
(= (trick-winner) me))))
This means “Do not win a trick that has points.” We have not traveled very far conceptually from
“avoid taking points,” but it is important to note that the current vocabulary is one that FOO can
understand in terms of actually playing the game of hearts. Through a number of other
transformations, FOO eventually settles on:
(achieve (>= (and (in-suit-led (card-of me))
(possible (trick-has-points)))
(low (card-of me)))
LEARNING FROM EXAMPLES: INDUCTION
Classification is the process of assigning to a particular input, the name of a class to which it
belongs. The classes from which the classification procedure can choose can be described in a variety
of ways. Their definition will depend on the use to which they will be put.
Classification is an important component of many problem-solving tasks. In its simplest form, it
is presented as a straightforward recognition task. An example of this is the question “What letter of
the alphabet is this?” But often classification is embedded inside another operation. To see how this
can happen, consider a problem solving system that contains the following production rule:
If: the current goal is to get from place A to place B, and there is a WALL separating the two
places
then: look for a DOORWAY in the WALL and go through it.
To use this rule successfully, the system’s matching routine must be able to identify an object as a
wall. Without this, the rule can never be invoked. Then, to apply the rule, the system must be able to
recognize a doorway.
99
To use this rule successfully, the system’s matching routine must be able to identify an object as a
wall. Without this, the rule can never be invoked. Then, to apply the rule, the system must be able to
recognize a doorway.
Before classification can be done, the classes it w ill use must be defined. This can be done in a variety
of ways, including:
• Isolate a set of features that are relevant to the task domain. Define each class by a weighted sum of
values of these features. Each class is then defined by a scoring function that looks very similar to the
scoring functions often used in other situations, such as game playing. Such a function has the form:
Each t corresponds to a value of a relevant parameter, and each c represents the weight to be attached
to the corresponding t. Negative weights can be used to indicate features whose presence usually
constitutes negative evidence for a given class.
• Isolate a set of features that are relevant to the task domain. Define each class as a structure
composed of those features.
Regardless of the way that classes are to be described, it is often difficult to construct, by hand,
good class definitions. This is particularly true in domains that are not well understood or that change
rapidly. Thus the idea of producing a classification program that can evolve its own class definitions is
appealing. This task of constructing class definitions is called concept learning, or induction. The
techniques used for this task must, of course, depend on the way that classes (concepts) are described.
If classes are described by scoring functions, then concept learning can be done using the technique of
coefficient adjustment. If, however, we want to define classes structurally, some other technique for
learning class definitions is necessary. In this section, we present three such techniques.
Algorithm: Candidate Elimination
Given: A representation language and a set of positive and negative examples expressed in that
language.
Compute: A concept description that is consistent with all the positive examples and none of the
negalisc examples.
1. Initialize G to contain one element: the null description (all features are variables).
2. Initialize S to contain one element: the first positive example.
3. Accept a new training example.
If it is a positive example, first remove from G any descriptions that do not cover the example.
Then, update the S set to contain the most specific set of descriptions in the version space that
cover the example and the current elements of the S set.
That is, generalize the elements of S as little as possible so that they cover the new training
example. If it is a negative example, first remove from S any descriptions that cover the example.
Then, update the G set to contain the most general set of descriptions in the version space that do
100
not cover the example. That is, specialize the elements of G as little as possible so that the
negative example is no longer covered by any of the elements of G.
4. If S and G are both singleton sets, then if they are identical, output their value and halt. If they are
both singleton sets but they are different, then the training cases were inconsistent. Output this
result and halt. Otherwise, go to step 3.
Let us trace the operation of the candidate elimination algorithm. Suppose we want to learn the
concept ul ‘■Japanese economy car” from the examples in Fig. 17.12. G and S both start out as
singleton sets. G contains the null description (see Fig. 17. H), and S contains the first positive training
example. The version space contains all descriptions that are consistent with this first example:
EXPLANATION-BASED LEARNING
Learning complex concepts using these procedures typically requires a substantial number of training
instances.
But people seem to be able to learn quite a bit from single examples. Consider a chess player who, as
Black, has reached the position shown in Fig. 17.14. The position is called a “fork” because the white
knight attacks both the black king and the black queen. Black must move the king, thereby leaving the
queen open to capture. From this single experience, Black is able to learn quite a bit about the fork
trap: the idea is that if any piece x attacks both the opponent’s king and another piece y, then piece y
will be lost. We don’t need to see dozens of positive and negative examples of fork positions in order
to draw these conclusions. From just one experience, we can learn to avoid this trap in the future and
perhaps to use it to our own advantage.
What makes such single-example learning possible? The answer, not surprisingly, is knowledge. The
chess player has plenty of domain-specific knowledge that can be brought to bear, including the rules
of chess and any previously acquired strategies. That knowledge can be used to identify the critical
aspects of the training example. In the case of the fork, we know that the double simultaneous attack is
important while the precise position and type of the attacking piece is not.
101
Much of the recent work in machine learning has moved away from the empirical, data-intensive
approach described in the last section toward this more analytical, knowledge-intensive approach. A
number of independent studies led to the characterization of this approach as explanation-based
learning. An EBL system attempts to learn from a single example x by explaining why x is an example
of the target concept. The explanation is then generalized, and the system’s performance is improved
through the availability of this knowledge.
Mitchell et al. and DeJong and Mooney both describe general frameworks for EBL programs and give
general learning algorithms. We can think of EBL programs as accepting the following as input:
• A Training Example—what the learning program “sees” in the world, e.g., the car of Fig. 17.7
• A Goal Concept—A high-level description of what the program is supposed to learn
• An Operationally Criterion—A description of which concepts are usable
• A Domain Theory—A set of rules that describe relationships between objects and actions in a
domain
From this, EBL computes a generalization of the training example that is sufficient to describe the goal
concept, and also satisfies the operationality criterion.
102
consisting of White’s knight, Black’s king, and Black’s queen, each in their specific positions.
Operationally is ensured: all chess-playing programs understand the basic concepts of piece and
position. Next, the explanation is generalized. Using domain knowledge, we find that moving the
pieces to a different part of the board is still bad for Black. We can also determine that other pieces
besides knights and queens can participate in fork attacks.
DISCOVERY
Learning is the process by which one entity acquires knowledge. Usually that knowledge is already
possessed by some number of other entities who may serve as teachers. Discovery is a restricted form
of learning in which one entity acquires knowledge without the help of a teacher.5 In this section, we
look at three types of automated discovery systems.
AM: Theory-Driven Discovery
Discovery is certainly learning. But it is also, perhaps more clearly than other kinds of learning,
problem solving. Suppose that we want to build a program to discover things, for example, in
mathematics. We expect that such a program would have to rely heavily on the problem-solving
techniques we have discussed. In fact, one such program was written by Lenat [ 1977; 1982]. It was
called AM, and it worked from a few basic concepts of set theory to discover a good deal of standard
number theory.
AM exploited a variety of general-purpose Al techniques. It used a frame system to represent
mathematical concepts. One of the major activities of AM is to create new concepts and fill in their
slots. An example of an AM concept is shown in Fig. 17.16. AM also uses heuristic search, guided by
a set of 250 heuristic rules representing hints about activities that are likely to lead to “interesting”
discoveries. Examples of the kind of heuristics AM used are shown in Fig. 17.17. Generate-and-test is
used to form hypotheses on the basis of a small number of examples and then to test the hypotheses on
a larger set to see if they still appear to hold. Finally, an agenda controls the entire discovery process.
When the heuristics suggest a task, it is placed on a central agenda, along with the reason that it was
suggested and the strength with which it was suggested. AM operates in cycles, each time choosing
the most promising task from the agenda and performing it.
103
In one run, AM discovered the concept of prime numbers. How did it do that? Having stumbled
onto the natural numbers, AM explored operations such as addition, multiplication, and their inverses.
It created the concept of divisibility and noticed that some numbers had very few divisors. AM has a
built-in heuristic that tells it to explore extreme cases. It attempted to list all numbers with zero
divisors (finding none), one divisor (finding one: 1), and two divisors. AM was instructed to call the
last concept “primes.” Before pursuing this concept, AM went on to list numbers with three divisors,
such as 49. AM tried to relate this property with other properties of 49, such as its being odd and a
perfect square. AM generated other odd numbers and other perfect squares to test its hypotheses. A
side effect of determining the equivalence of perfect squares with numbers with three divisors was to
boost the “interestingness” rating of the divisor concept. This led; AM to investigate ways in which a
number could be broken down into factors. AM then noticed that there was only one way to break a
number down into prime factors (known as the Unique Factorization Theorem).
Since breaking down numbers into multiplicative components turned out to be interesting, AM
decided, by analogy, to pursue additive components -as well. It made several uninteresting
conjectures, such as that every number could be expressed as a sum of 1 ‘s. It also found more
interesting phenomena, such as that many numbers were expressible as the sum of two primes. By
listing cases, AM determined that all even numbers greater than 2 seemed to have this property. This
conjecture, known as Goldbach’s Conjecture, is widely believed to be true, but a proof of it has yet to
be found in mathematics.
104
Clustering
A third type of discovery, called clustering, is very similar to induction, as we described it in Section
17.5. In inductive learning, a program learns to classify objects based on the labelings provided by a
teacher. In clustering, no class labelings are provided. The program must discover for itself the natural
classes that exist for the objects, in addition to a method for classifying instances. AUTOCLASS
[Cheeseman et al., 1988] is one program that accepts a number of training cases <and hypothesizes a
set of classes. For any given case, the program provides a set of probabilities that predict into which
class (es) the case is likely to fall. In one application, AUTOCLASS found meaningful new classes of
stars from their infrared spectral data. This was an instance of true discovery by computer, since the
facts it discovered were previously unknown to astronomy.
ANALOGY
Analogy is a powerful inference tool. Our language and reasoning are laden with analogies. Consider
the following sentences:
• Last month, the stock market was a roller coaster.
• Bill is like a fire engine.
• Problems in electromagnetism are just like problems in fluid flow.
Underlying each of these examples is a complicated mapping between what appear to be
dissimilar concepts. For example, to understand the first sentence above, it is necessary to do two
things: (1) pick out one key property of a roller coaster, namely that it travels up and down rapidly and
(2) realize that physical travel is itself an analogy for numerical fluctuations (in stock prices). This is
no easy trick. The space of possible analogies is very large. We do not want to entertain possibilities
such as “the stock market is like a roller coaster because it is made of metal.”
Lakoff and Johnson 11980] make the case that everyday language is filled with such analogies
and metaphors. An Al program that is unable to grasp analogy will be difficult to talk to and,
consequently, difficult to teach. Thus, analogical reasoning is an important factor in learning by advice
taking. It is also important to learning in problem-solving.
Transformational Analogy
Suppose you are asked to prove a theorem in plane geometry. You might look for a previous theorem
that is very similar and “copy” its proof, making substitutions when necessary. The idea is to
transform a solution to a previous problem into a solution for the current problem. Figure 17.19 shows
this process. An example of transformational analogy is shown in Fig. 17.20. The program has seen
proofs about points and line segments; for example, it knows a proof that the line segment RN is
exactly as long as the line segment OY, given Fig. 17.19 Transformational Analogy that RO is exactly
as long as NY. The program is now asked to prove a theorem about angles, namely that the angle BD
is equivalent to the angle CE, given that angles BC and DE are equivalent. The proof about line
segments is retrieved and transformed into a proof about angles by substituting the notion of line for
point, angle for line segment, AB for R, AC for O, AD for N, and AE for Y.
Carbonell describes one method for transforming old solutions into new solutions. Whole solutions are
viewed as states in a problem space called T-space. T-operators prescribe the methods of transforming
solutions (states) into other solutions. Reasoning by analogy becomes search in T-space: starting with
105
an old solution, we use means-ends analysis or some other method to find a solution to the current
problem.
Derivational Analogy
Notice that transformational analogy does not look at how the old problem was solved: it only looks at
the final solution. Often the twists and turns involved in solving an old problem are relevant to solving
a new problem. The detailed history of a problemsolving episode is called its derivation. Analogical
reasoning that takes these histories into account is called derivational analogy (see Fig. 17.21).
Carbonell claims that derivational analogy is a necessary component in the transfer of skills in
complex domains. For example, suppose you have coded an efficient sorting routine in Pascal, and
then you are asked to recode the routine in LISP. A line-by-line translation is not appropriate, but you
will reuse the major structural and control decisions you made when you constructed the Pascal
program. One way to model this behavior is to have a problem-solver “replay” the previous derivation
and modify it when necessary. If the original reasons and assumptions for a step’s existence still hold
in the new problem, the step is copied over. If some assumption is no longer valid, another assumption
must be found. If one cannot be found, then we can try to find justification for some alternative stored
in the derivation of the original problem. Or perhaps we can try some step marked as leading to search
failure in the original derivation, if the reasons to failure conditions are not valid in the current
derivation.
106
NEURAL NET LEARNING AND GENETIC LEARNING
Perceptrons
The perceptron, an invention of Rosenblatt, was one of the earliest neural network models. A
perceptron models a neuron by taking a weighted sum of its inputs and sending the output 1 if the sum
is greater than some adjustable threshold value (otherwise it sends 0). Fig. 18.5 shows the device.
Notice that in a perceptron, unlike a Hopfield network, connections are unidirectional.
The inputs (xj, x2,...»x„) and connection weights (wl? w2,w„) in the figure are typically real
values, both positive and negative. If the presence of some feature x, tends to cause the perceptron to
fire, the weight will be positive; if the feature xf inhibits the perceptron, the weight Wj will be
negative. The perceptron itself consists of the weights, the summation processor, and the adjustable
threshold processor. Learning is a process of modifying the values of the weights and the threshold. It
is convenient to implement the threshold as just another weight w0, as in Fig. 18.6. This weight can be
thought of as the propensity of the perceptron to fire irrespective of its inputs. The perceptron of Fig.
18.6 fires if the weighted sum is greater than zero.
107
A perceptron computes a binary function of its input. Several perceptrons can be combined to
compute more complex functions, as shown in Fig. 18.7. Such a group of perceptrons can be trained
on sample input-output pairs until it learns to compute the correct function. The amazing property of
perceptron learning is this: Whatever a perceptron can compute, it can learn to compute! We
demonstrate this in a moment. At the time perceptrons were invented; many people speculated that
intelligent systems could be constructed out of perceptrons (see Fig. 18.8).
108
Reinforcement Learning
What if we train our networks not with sample outputs but with punishment and reward instead? This
process is certainly sufficient to train animals to perform relatively interesting tasks. Barto describes a
network which learns as follows:
(1) The network is presented with a sample input from the training set,
(2) The network computes what it thinks should be the sample output,
(3) The network is supplied with a real-valued judgment by the teacher,
(4) The network adjusts its weights, and the process repeats. A positive value in step 3 indicates good
performance, while a negative value indicates bad performance. The network seeks a set of weights
that will prevent negative reinforcement in the future, much as an experimental rat seeks behaviors
that will prevent electric shocks.
Expert System
Expert systems solve problems that are normally solved by human “experts.” To solve expert-level
problems, expert systems need access to a substantial domain knowledge base, which must be built as
efficiently as possible. They also need to exploit one or more reasoning mechanisms to apply their
knowledge to the problems they are given. Then they need a mechanism for explaining what they have
done to the users who rely on them. One way to look at expert systems is that they represent applied
Al in a very broad sense. The problems that expert systems deal with are highly diverse. There are
109
some general issues that arise across these varying domains. But it also turns out that there are
powerful techniques that can be defined for specific classes of problems.
REPRESENTING AND USING DOMAIN KNOWLEDGE
Expert systems are complex Al programs. Almost all the techniques that have been exploited in at
least one expert system. However, the most widely used way of representing domain knowledge in
expert systems is as a set of production rules, which are often coupled with a frame system that defines
the objects that occur in the rules. In Section 8.2, we saw one example of an expert system rule, which
was taken from the MYCIN system. Let’s look at a few additional examples drawn from some other
representative expert systems. All the rules we show are English versions of the actual rules that the
systems use. Differences among these rules illustrate some of the important differences in the ways
that expert systems operate.
Rl [McDermott, 1982; McDermott, 19841 (sometimes also called XCON) is a program that
configures DEC VAX systems. Its rules look like this:
Notice that Rl’s rules, unlike MYCIN’s, contain no numeric measures of certainty. In the task
domain with which Rl deals, it is possible to state exactly the correct thing to be done in each
particular set of circumstances (although it may require a relatively complex set of antecedents to do
so). One reason for this is that there exists a good deal of human expertise in this area. Another is that
since Rl is doing a design task (in contrast to the diagnosis task performed by MYCIN), it is not
necessary to consider all possible alternatives; one good one is enough. As a result, probabilistic
information is not necessary in Rl.
PROSPECTOR [Duda et al... 1979; Hart et al., 1978] is a program that provides advice on
mineral exploration. Its rules look like this:
110
In PROSPECTOR, each rule contains two confidence estimates. The first indicates the extent to
which the presence of the evidence described in the condition part of the rule suggests the validity of
the rule's conclusion. In the PROSPECTOR rule shown above, the number 2 indicates that the
presence of the evidence is mildly encouraging. The second confidence estimate measures the extent
to which the evidence is necessary to the validity of the conclusion, or stated another way, the extent
to which the lack of the evidence indicates that the conclusion is not valid. In the example rule shown
above, the number -4 indicates that the absence of the evidence is strongly discouraging for the
conclusion.
DESIGN ADVISOR [Steele et al., 1989] is a system that critiques chip designs. Its rules look like:
The DESIGN ADVISOR gives advice to a chip designer, who can accept or reject the advice. If
the advice is rejected, the system can exploit a justification-based truth maintenance system to revise
its model of the circuit. The first rule shown here says that an element should be criticized for poor
resetability if its sequential level count is greater than two, unless its signal is currently believed to be
resetable. Resetability is a fairly common condition, so it is mentioned explicitly in this first rule. But
there is also a much less common condition, called direct resetability. The DESIGN ADVISOR does
not even bother to consider that condition unless it gets in trouble with its advice. At that point, it can
exploit the second of the rules shown above. Specifically, if the chip designer rejects a critique about
resetability and if that critique was based on a high level count, then the system will attempt to
discover (possibly by asking the designer) whether the element is directly resetable. If it is, then the
original rule is defeated and the conclusion withdrawn.
Reasoning with the Knowledge
As these example rules have shown, expert systems exploit many of the representation and
reasoning mechanisms that we have discussed. Because these programs are usually written primarily
as rule-based systems, forward chaining, backward chaining, or some combination of the two, is
usually used. For example, MYCIN used backward chaining to discover what organisms were present;
then it used forward chaining to reason from the organisms to a treatment regime. Rl, on the other
hand, used forward chaining. As the field of expert systems matures, more systems that exploit other
kinds of reasoning mechanisms are being developed. The DESIGN ADVISOR is an example of such a
system; in addition to exploiting rules, it makes extensive use of a justification-based truth
maintenance system.
EXPERT SYSTEM SHELLS
Initially, each expert system that was built was created from scratch, usually in LISP. But, after
several systems had been built this way, it became clear that these systems often had a lot in common.
In particular, since the systems were constructed as a set Of declarative representations (mostly rules)
111
combined with an interpreter for those representations, it was possible to separate the interpreter from
the domain-specific knowledge and thus to create a System that could be used to construct new
expert systems by adding new knowledge corresponding to the new problem domain. The resulting
interpreters are called shells. One influential example of such a shell is EMYCIN (for Empty MYCIN)
[Buchanan and Shortliffe, 1984], which was derived from MYCIN.
There are now several commercially available shells that serve as the basis for many of the expert
systems currently being built. These shells provide much greater flexibility in representing knowledge
and in reasoning with it than MYCIN did. They typically support rules, frames, truth maintenance
systems, and a variety of other reasoning mechanisms.
Early expert system shells provided mechanisms forknowledge representation, reasoning, and
explanation. Later, tools for knowledge acquisition were added. But as experience with using these
systems to solve real world problems grew, it became clear that expert system shells needed to do
something else as well. They needed to make it easy to integrate expert systems with other kinds of
programs. Expert systems cannot operate in a vacuum, any more than their human counterparts can.
They need access to corporate databases, and access to them needs to be controlled just as it does for
other systems. They are often embedded within larger application programs that use primarily
conventional programming techniques. So one of the important features that a shell must provide is an
easy-to-use interface between an expert system that is written with the shell and a larger, probably
more conventional, programming environment.
EXPLANATION
In order for an expert system to be an effective tool, people must be able to interact with it easily. To
facilitate this interaction, the expert system must have the following two capabilities in addition to the
ability to perform its underlying task:
• Explain its reasoning. In many of the domains in which expert systems operate, people will not
accept results unless they have been convinced of the accuracy of the reasoning process that produced
those results. This is particularly true, for example, in medicine, where a doctor must accept
ultimate responsibility for a diagnosis, even if that diagnosis was arrived at with considerable help
from a program. Thus it is important that the reasoning process used in such programs proceed in
understandable steps and that enough meta-knowledge (knowledge about the reasoning process) be
available so the explanations of those steps can be generated.
• Acquire new knowledge and modifications of old knowledge. Since expert systems derive their
power from the richness of the knowledge bases they exploit, it is extremely important that those
knowledge bases be as complete and as accurate as possible. But often there exists no standard
codification of that knowledge; rather it exists only inside the heads of human experts. One way to get
this knowledge into a program is through interaction with the human expert. Another way is to have
the program leam expert behavior from raw data.
MYCIN attempts to solve its goal of recommending a therapy for a particular patient by first
finding the cause of the patient’s illness. It uses its production rules to reason backward from goals to
clinical observations, lb solve the top-level diagnostic goal, it looks for rules whose right sides suggest
diseases. It then uses the left sides of those rules (the preconditions) to set up subgoals whose success
would enable the rules to be invoked. These subgoals are again matched against rules, and their
112
preconditions are used to set up additional subgoals. Whenever a precondition describes a specific
piece of clinical evidence, MYCIN uses that evidence if it already has access to it. Otherwise, it asks
the user to provide the information. In order that MYCIN’s requests for information will appear
coherent to the user, the actual goals that MYCIN sets up are often more general than they need be to
satisfy the preconditions of an individual rule. For example, if a precondition specifies that the identity
of an organism is X, MYCIN will set up the goal “infer identity.” This approach also means that if
another rule mentions the organism’s identity, no further work will be required, since the identity will
be known.
TEIRESIAS was the first program to support explanation and knowledge acquisition.
TEIRESIAS served as a front-end for the MYCIN expert system. A fragment of a TEIRES1AS-
MYCIN conversation with a user (a doctor) is shown in Fig. 20.1. The program has asked for a piece
of information that it needs in order to continue its reasoning. The doctor wants to know why the
program wants the information, and later asks how the program arrived at a conclusion that it claimed
it had reached.
An important premise underlying TEIRESIAS’s approach to explanation is that the behavior of a
program can be explained simply by referring to a trace of the program’s execution. There are ways in
which this assumption limits the kinds of explanations that can be produced, but it does minimize the
overhead involved in generating each explanation. To understand how TEIRESIAS generates
explanations of MYCIN’s behavior, we need to know how that behavior is structured.
When TEIRESIAS provides the answer to the first of these questions, the user may be satisfied or
may want to follow the reasoning process back even further. The user can do that by asking additional
“WHY” questions.
When TEIRESIAS provides the answer to the second of these questions and tells the user what it
already believes, the user may want to know the basis for those beliefs. The user can ask this with a
“HOW” question, which TEIRESIAS will interpret as “How did you know that?” This question also
can be answered by looking at the goal tree and chaining backward from the stated fact to the evidence
that allowed a rule that determined the fact to fire. Thus we see that by reasoning backward from its
top-level goal and by keeping track of the entire tree that it traverses in the process, TEIRESIAS-
MYCIN can do a fairly good job of justifying its reasoning to a human user.
113
KNOWLEDGE ACQUISITION
The knowledge engineer interviews a domain expert to elucidate expert knowledge, which is then
translated into rules. After the initial system is built, it must be iteratively refined until it approximates
expert-level performance. This process is expensive and time-consuming, so it is worthwhile to look
for more automatic ways of constructing expert knowledge bases. While no totally automatic
knowledge acquisition systems yet exist, there are many programs that interact with domain experts to
extract expert knowledge efficiently. These programs provide support for the following activities:
• Entering knowledge
• Maintaining knowledge base consistency
• Ensuring knowledge base completeness
The most useful knowledge acquisition programs are those that are restricted to a particular
problem-solving paradigm, e.g., diagnosis or design. It is important to be able to enumerate the roles
that knowledge can play in the problem-solving process. For example, if the paradigm is diagnosis,
then the program can structure its knowledge base around symptoms, hypotheses, and causes. It can
114
identify symptoms for which the expert has not yet provided causes. Since one symptom may have
multiple causes, the program can ask for knowledge about how to decide when one hypothesis is better
than another. If we move to another type of problem-solving, say designing artifacts, then these
acquisition strategies no longer apply, and we must look for other ways of profitably interacting with
an expert. We now examine two knowledge acquisition systems in detail.
MOLE is a knowledge acquisition system for heuristic classification problems, such as diagnosing
diseases. In particular, it is used in conjunction with the cover-and-differentiate problem-solving
method. An expert system produced by MOLE accepts input data, comes up with a set of candidate
explanations or classifications that cover (or explain) the data, then uses differentiating knowledge to
determine which one is best. The process is iterative, since explanations must themselves be justified,
until ultimate causes are ascertained.
MOLE interacts with a domain expert to produce a knowledge base that a system called MOLE-p (for
MOLE-performance) uses to solve problems. The acquisition proceeds through several steps:
1. Initial knowledge base construction. MOLE asks the expert to list common symptoms or
complaints that might require diagnosis. For each symptom, MOLE prompts for a list of possible
explanations. MOLE then iteratively seeks out higher-level explanations until it comes up with a
set of ultimate causes.
Whenever an event has multiple explanations, MOLE tries to determine the conditions under
which one explanation is correct. The expert provides covering knowledge, that is, the knowledge
that a hypothesized event might be the cause of a certain symptom. MOLE then tries to infer
anticipatory knowledge, which says that if the hypothesized event does occur, then the symptom
will definitely appear. This knowledge allows the system to rule out certain hypotheses on the
basis that specific symptoms are absent.
2. Refinement of the knowledge base. MOLE now tries to identify the weaknesses of the
knowledge base. One approach is to find holes and prompt the expert to fill them. It is difficult, in
general, to know whether a knowledge base is complete, so instead MOLE lets the expert watch
MOLE-p solving sample problems. Whenever MOLE-p makes an incorrect diagnosis, the expert
adds new knowledge. There are several ways in which MOLE-p can reach the wrong conclusion.
It may incorrectly reject a hypothesis because it does not feel that the hypothesis is needed to
explain any symptom. It may advance a hypothesis because it is needed to explain some otherwise
inexplicable hypothesis. Or it may lack differentiating knowledge for choosing between alternative
hypotheses.
MOLE has been used to build systems that diagnose problems with car engines, problems in
steel-rolling mills, and inefficiencies in coal-burning power plants. For MOLE to be applicable,
however, it must be possible to preenumerate solutions or classifications. It must also be practical to
encode the knowledge in terms of covering and differentiating.
One problem-solving method useful for design tasks is called propose-and-revise. Propose-and-
revise systems build up solutions incrementally. First, the system proposes an extension to the current
design. Then it checks whether the extension violates any global or local constraints. Constraint
violations are then fixed, and the process repeats. It turns out that domain experts are good at listing
115
overall design constraints and at providing local constraints on individual parameters, but not so good
at explaining how to arrive at global solutions.
Like MOLE, SALT builds a dependency network as it converses with the expert. Each node
stands for a value of a parameter that must be acquired or generated. There are three kinds of links:
contributes-to, constrains, and suggests-revision-of Associated with the first type of link are
procedures that allow SALT to generate a value for one parameter based on the value of another. The
second type of link, rules out certain parameter values. The third link, suggests-revision-of points to
ways in which a constraint violation can be fixed. SALT uses the following heuristics to guide the
acquisition process:
i. Every noninput node in the network needs at least one contributes-to link coming into it. If
links are missing, the expert is prompted to fill them in.
ii. No contributes-to loops are allowed in the network. Without a value for at least one parameter
in the loop, it is impossible to compute values for any parameter in that loop. If a loop exists,
SALT tries to transform one of the contributes-to links into a constrains link.
iii. Constraining links should have suggests-revision-of links associated with them. These include
constrains links that are created when dependency loops are broken.
Control knowledge is also important. It is critical that the system propose extensions and
revisions that lead toward a design solution. SALT allows the expert to rate revisions in terms of how
much trouble they tend to produce.
SALT compiles its dependency network into a set of production rules. As with MOLE, an expert
can watch the production system solve problems and can override the system’s decision. At that point,
the knowledge base can be changed or the override can be logged for future inspection.
The process of interviewing a human expert to extract expertise presents a number of difficulties,
regardless of whether the interview is conducted by a human or by a machine. Experts are surprisingly
inarticulate when it comes to how they solve problems. They do not seem to have access to the low-
level details of what they do and are especially inadequate suppliers of any type of statistical
information. There is, therefore, a great deal of interest in building systems that automatically induce
their own rules by looking at sample problems and solutions.
Expert System Architectures
Expert systems have proven to be effective in a number of problem domains which normally
require the kind of intelligence possessed by a human expert. The areas of application are almost
endless. Wherever human expertise is needed to solve problems. Expert systems are likely candidates
for application. Application domains include law, chemistry, biology, engineering, manufacturing,
aerospace, military operations, finance, banking, meteorology, geology. Geophysics. and more. The
list goes on and on.
An expert system is a set of programs that manipulate encoded knowledge to solve problems in a
specialized domain that normally requires human expertise. An expert system's knowledge is obtained
from expert sources and coded in a form suitable for the system to use in its inference or reasoning
processes. The expert knowledge must be obtained from specialists or other sources of expertise, such
116
as texts, journal articles, and data bases. This type of knowledge usually requires much training and
experience in some specialized field such as medicine, geology, system configura- tion, or engineering
design. Once a sufficient body of expert knowledge has been acquired, it must be encoded in some
form, loaded into a knowledge base, then tested, and refined continually throughout the life of the
system.
Characteristic Features of Expert Systems
Expert s'ystems differ from conventional computer systems in several important ways
i. Expert systems use knowledge rather than data to control the solution process. "In the knowledge
lies the power" is a theme repeatedly followed and supported throughout this book. Much of the
knowledge used is heuristic in nature rather than algorithmic.
ii. The knowledge is encoded and maintained as an entity separate from the control program. As
such, it is not compiled together with the control program itself. This permits the incremental
addition and modification (refinement) of the knowledge base without recompilation of the
control programs
iii. Expert systems are capable of explaining how a particular conclusion was reached, and why
requested information is needed during a consultation. This is important as it gives the user a
chance to assess and understand the system's reasoning ability, thereby improving the user's
confidence in the system.
iv. Expert systems use symbolic representations for knowledge (rules, networks, or frames) and
perform their inference through symbolic computations that closely resemble manipulations of
natural language. (An exception to this is the expert system based on neural network
architectures.)
v. Expert systems often reason with metaknowledge; that is, they reason with knowledge about
themselves, and their own knowledge limits and capabilities.
Background History
Expert systems first emerged from the research laboratories of a few leading U.S. universities
during the 1960 and 1970s. They were developed as specialized problem solvers which emphasized
the use of knowledge rather than algorithms and general search methods. This approach marked a
significant departure from conventional Al systems architectures at the time.
The first expert system to be completed was DENDRAL, developed at Stanford University in the
late 1960s. This system was capable of determining the structure of chemical compounds given a
specification of the compound's Constituent elements and mass spectrometry data obtained from
samples of the compound. DENDRAL used heuristic knowledge obtained from experienced chemists
to help constrain the problem and thereby reduce the search space. During tests, DENDRAL
discovered a number of structures previously unknown to expert chemists.
MYCIN's performance improved significantly over a several year period as additional knowledge
was added. Tests indicate that MYCIN's performance now equals or exceeds that of experienced
physicians. The initial MYCIN knowledge base contained about only 200 rules. This number was
gradually increased to more than 600 rules by the early 1980s. The added rules significantly improved
117
MYCIN's performance leading to a 65% success record which compared favorably with experi- enced
physicians who demonstrated only an average 60% success rate (Lenat, 1984).
Introduction Applications
Since the introduction of these early expert systems, the range and depth of applications has
broadened dramatically. Applications can now be found in almost all areas of business and
government. They include such areas as
Different types of medical diagnoses (internal medicine, pulmonary diseases. infectious blood
diseases, and so On)
Diagnosis of complex electronic and electromechanical systems
Diagnosis of diesel electric locomotion systems
Diagnosis of software development projects
Planning experiments in biology, chemistry, and molecular genetics Forecasting crop damage
Identification of chemical compound structures and chemical compounds Locatiin of faults in
computer and communications systems
Scheduling of customer orders, job shop production operations, computer re- sources for
operating systems, and various manufacturing tasks
Evaluation of loan applicants for lending institutions Assessment of geologic structures from
dip meter logs
Analysis of structural systems for design or as a result of earthquake daniaee The optimal
configuration of components to meet given specihcanonc for a complex system (like
computers or manufacturing facilities)
Estate planning for minimal taxation and other specified goals Stock and bond portfolio
selection and management
The design of very large scale integration (VLSI) systems
Numerous military applications ranging from battlefield assessment to ocean surveillance
Numerous applications related to space planning-and exploration
Numerous areas of law including civil case evaluation, product liabilit y , assault and battery,
and general assistance in locating different lass precedents Planning curricula for students
Teaching students specialized tasks (like trouble shooting equipment faults)
Importance of Expert Systems
The value of expert systems was well established by the early 1980s. A number of successful
applications had been completed by then and they proved to he cost effective. An example which
illustrates this point well is the diagnostic s stem developed by the Campbell Soup Company.
Campbell Soup uses large sterilizers or cookers to cook soups and other canned products at eight
plants located throughout the country. Some of the larger cookers hold up to 68,000 cans of food for
short periods of cooking time. When difficult maintenance problems occur with the cookers, the fault
118
must be found and corrected quickly or the batch of foods being prepared will spoil. Until recently, the
company had been depending on a single expert to diagnose and cure the more difficult prob- lems,
flying him tothe site when necessary. Since this individual will retire in a few years taking his
expertise with him, the company decided to develop an expert system to diagnose these difficult
problems.
RULE-BASED SYSTEM ARCHITECTURES
The most common form of architecture used in expert and other types of knowledge- based systems is
the production system, also called the rule-based system. This type of system uses knowledge encoded
in the form of production rules, that is, if then rules. We may remember from Chapter 4 that rules have
an antecedent or condition part, the left-hand side, and a conclusion or action part, the right-hand side.
IF: Condition-1 and Condition-2 and Condition-3
THEN: Take Action-4
• IF: The temperature is greater than 200 degrees, and The water level is low
THEN: Open the safety valve.
A&B&C&D—E&F
Each rule represents a small chunk of knowledge relating to the given domain of expertise. A number
of related rules collectively may correspond to a chain of inferences which lead from some initially
known facts to some useful conclusions. When the known facts support the conditions in the rule's left
side, the conclusion or action part of the rule is then accepted as known (or at least known with some
degree of certainty).
119
Inference in production systems is accomplished by a process of chaining through the rules
recursively, either in a forward or backward direction, until a conclusion is reached or until failure
occurs. The selection of rules used in the chaining process is determined by matching current facts
against the doñain knowl- edge or variables in rules and choosing among a candidate set of rules the
ones that meet some given criteria, such as specificity. The inference process is typically carried out in
an interactive mode with the user providing input parameters needed to complete the rule chaining
process. The main components of a typical expert system are depicted in Figure. The solid lined boxes
in the figure represent components found in most systems whereas the broken lined boxes are found in
only a few such systems.
NONPRODUCTION SYSTEM ARCHITECTURES
Other, less common expert system architectures (although no less important) are those based on
nonproduction rule-representation schemes. Instead of rules, these systems employ more structured
representation schemes like associative or semantic networks, frame and rule structures, decision trees,
or 'even specialized networks like neural networks. In this section we examine some typical s y stem
architectures based on these methods.
Associative or Semantic Network Architectures
The associative network is a network made up of nodes connected by directed arcs. The nodes
represent objects, attributes. Concepts, or other basic entities, and the arcs, which are labeled, descr be
the relationship between the two nodes they connect. Special network links include the IS\ and
HASPART links which designate an object as being a certain type of ohec (belonging to a class of oh
jects) and as being a subpart of another object respectively.
Associatiye network representations ale especially useful in depicting hierarchical knowledge
structures. Where property inheritance is common. Objects belong- ing to a class of other objects may
inherit many of the characteristics of the class. Inheritance can also he treated as a form of detault
reasoning. This facilitates the storage of information when shared by many objects as well as the
inkrermeingprocess.
Associative network representations are not a popular form of representation for standard expert
systems. More often, these network representations are used in natural language or computer vision
systems or in conjunction with some other form of representation.
One expert system based on the use of an associative network representation is CASNET (Causal
Associational Network), which was developed at Ruteers Univer- sity during the early 1970s.
CASNET is used to diagnose and recommend treatment for glaucoma, one of the leading causes of
blindness.
The network in CASNET is divided into three planes or types ol knowledge as depicted in Figure 15.4.
The different knowledge types are
Patient observations (tests, symptoms, other signs)
Pathophvsiokgical states
Disease cate tries
120
Patient observations are provided by the user during an interactive session with the system. The
system presents menu type queries, and the user selects one of several possible choices. These
observations help to establish an abnormal condition caused by a disease process. The condition is
established through the causal network model as part of the cause and effect relationship relating
symptoms and other signs to diseases.
Inference is accomplished by traversing the network, following the most plausi- ble paths of
causes and effects. Once a sufficiently strong path has been determined through the network,
diagnostic conclusions are inferred using classification tables that interpret patterns of the causal
network. These tables are similar to rule interpreta- tions.
Frame Architectures
Frames are structured sets of closely related knowledge, such as an object or concept name, the
object's main attributes and their corresponding values, anti possibly some attached procodurs (if-
needed, if-added, if-removed procedures). The attributes and procedures are stored in specified slots
and slot facets of the frame. Individual frames are usually linked together as a network much like the
nodes in an associative network. Thus, frames may have many of the features of associative networks,
namely, property inheritance and default reasoning. Several expert systems have been constructed with
121
frame architectures, and a number of building tools which create and manipulate frame structured
systems have been developed.
Decision Tree Architectures
Knoedge for expert systems may be stored in the form of a decision tree when the knowledge can
be structured in a top-to-bottom manner. For example, the identifi- cation of objects (equipment faults,
physical objects, diseases, and the like) an he made through a decision tree structure. Initial and
intermediate nodes in the tree correspond to object attributes, and terminal nodes correspond to the
identities of objects. Attribute values for an object determine a path to a leaf node in the tree which
contains the object's identification. Each object attribute corresponds to a nontemiinal node in the tree
and each branch of the decision tree corresponds to all attribute value or set of values.
A segment of a decision tree knowledge structure taken from an expert system used to identify
objects such as liquid chemical waste products is illustrated in Figure. Each node in the tree
corresponds to an identif'yin attribute such as molecular weight, boiling point, burn test color, or
solubilit y test results. Each branch emanating froni a node corresponds to a value or ranee of values
for the attribute such as 20-37 degrees C, yellow, or nonsoluble in sulphuric acid.
Identification is made by traversing a path through the free (or network until the path k-ads to a un i
que leaf node which corresponds to the unknown object, identity.
The knowledge base, which is the decision tree for an identification system can be constructed
with a special tree-building ediior or with a learning module. In either case, a set of the most
discriminating attributes for the class of objects being identified should he selected. Only those
attributes that discriminate well among different objects need be used. PriiiissihIe values for each of
the attributes arc grouped into separable sets, and each such set determines a branch front attribute
node to the next node.
Knowledge Acquisition
Knowledge engineer interviews a domain expert to elucidate expert knowledge, which is then
translated into rules. After the initial system is built, it must be iteratively refined until it approximates
expert-level performance. This process is expensive and time-consuming, so it is worthwhile to look
for more automatic ways of constructing expert knowledge bases. While no totally automatic
knowledge acquisition systems yet exist, there are many programs that interact with domain experts to
extract expert knowledge efficiently. These programs provide support for the following activities:
• Entering knowledge
• Maintaining knowledge base consistency
• Ensuring knowledge base completeness
The most useful knowledge acquisition programs are those that are restricted to a particular
problem-solving paradigm, e.g., diagnosis or design. It is important to be able to enumerate the roles
that knowledge can play in the problem-solving process. For example, if the paradigm is diagnosis,
then the program can structure its knowledge base around symptoms, hypotheses, and causes. It can
identify symptoms for which the expert has not yet provided causes. Since one symptom may have
multiple causes, the program can ask for knowledge about how to decide when one hypothesis is better
122
than another. If we move to another type of problem-solving, say designing artifacts, then these
acquisition strategies no longer apply, and we must look for other ways of profitably interacting with
an expert. We now examine two knowledge acquisition systems in detail.
MOLE is a knowledge acquisition system for heuristic classification problems, such as
diagnosing diseases. In particular, it is used in conjunction with the cover-and-differentiate problem-
solving method. An expert system produced by MOLE accepts input data, comes up with a set of
candidate explanations or classifications that cover (or explain) the data, then uses differentiating
knowledge to determine which one is best. The process is iterative, since explanations must
themselves be justified, until ultimate causes are ascertained.
MOLE interacts with a domain expert to produce a knowledge base that a system called MOLE-p
(for MOLE-performance) uses to solve problems. The acquisition proceeds through several steps:
1. Initial knowledge base construction. MOLE asks the expert to list common symptoms or
complaints that might require diagnosis. For each symptom, MOLE prompts for a list of possible
explanations. MOLE then iteratively seeks out higher-level explanations until it comes up with a
set of ultimate causes.
2. Whenever an event has multiple explanations, MOLE tries to determine the conditions under
which one explanation is correct. The expert provides covering knowledge, that is, the knowledge
that a hypothesized event might be the cause of a certain symptom. MOLE then tries to infer
anticipatory knowledge, which says that if the hypothesized event does occur, then the symptom
will definitely appear. This knowledge allows the system to rule out certain hypotheses on the
basis that specific symptoms are absent.
3. Refinement of the knowledge base. MOLE now tries to identify the weaknesses of the knowledge
base. One approach is to find holes and prompt the expert to fill them. It is difficult, in general, to
know whether a knowledge base is complete, so instead MOLE lets the expert watch MOLE-p
solving sample problems. Whenever MOLE-p makes an incorrect diagnosis, the expert adds new
knowledge. There are several ways in which MOLE-p can reach the wrong conclusion. It may
incorrectly reject a hypothesis because it does not feel that the hypothesis is needed to explain any
symptom. It may advance a hypothesis because it is needed to explain some otherwise
inexplicable hypothesis. Or it may lack differentiating knowledge for choosing between alternative
hypotheses.
MOLE has been used to build systems that diagnose problems with car engines, problems in
steel-rolling mills, and inefficiencies in coal-burning power plants. For MOLE to be applicable,
however, it must be possible to preenumerate solutions or classifications. It must also be practical to
encode the knowledge in terms of covering and differentiating.
But suppose our task is to design an artifact, for example, an elevator system. It is no longer
possible to preenumerate all solutions. Instead, we must assign values to a large number of parameters,
such as the width of the platform, the type of door, the cable weight, and the cable strength. These
parameters must be consistent with each other, and they must result in a design that satisfies external
constraints imposed by cost factors, the type of building involved and expected payloads.
One problem-solving method useful for design tasks is called propose-and-revise. Propose-and-
revise systems build up solutions incrementally. First, the system proposes an extension to the current
123
design. Then it checks whether the extension violates any global or local constraints. Constraint
violations are then fixed, and the process repeats. It turns out that domain experts are good at listing
overall design constraints and at providing local constraints on individual parameters, but not so well
at explaining how to arrive at global solutions. The SALT program provides mechanisms for
elucidating this knowledge from the expert.
Like MOLE, SALT builds a dependency network as it converses with the expert. Each node
stands for a value of a parameter that must be acquired or generated. There are three kinds of links:
contributes-to, constrains, and suggests-revision-of Associated with the first type of link are
procedures that allow SALT to generate a value for one parameter based on the value of another. The
second type of link, constraints rules out certain parameter values. The third link, suggests-revision-of
points to ways in which a constraint violation can be fixed. SALT uses the following heuristics to
guide the acquisition process:
i. Every noninput node in the network needs at least one contributes-to link coming into it. If links
are missing, the expert is prompted to fill them in.
ii. No contributes-to loops are allowed in the network. Without a value for at least one parameter in
the loop, it is impossible to compute values for any parameter in that loop. If a loop exists, SALT
tries to transform one of the contributes-to links into a constrains link.
iii. Constraining links should have suggests-revision-of links associated with them. These include
constrains links that are created when dependency loops are broken.
Control knowledge is also important. It is critical that the system propose extensions and
revisions that lead toward a design solution. SALT allows the expert to rate revisions in terms of how
much trouble they tend to produce.
SALT compiles its dependency network into a set of production rules. As with MOLE, an expert
can watch the production system solve problems and can override the system’s decision. At that point,
the knowledge base can be changed or the override can be logged for future inspection.
The process of interviewing a human expert to extract expertise presents a number of difficulties,
regardless of whether the interview is conducted by a human or by a machine. Experts are surprisingly
inarticulate when it comes to how they solve problems. They do not seem to have access to the low-
level details of what they do and are especially inadequate suppliers of any type of statistical
information. There is, therefore, a great deal of interest in building systems that automatically induce
their own rules by looking at sample problems and solutions. With inductive techniques, an expert needs
only to provide the conceptual framework for a problem and a set of useful examples.
124