Expert Systems

Chapter 1 Articial Intelligence and Neural Networks
Sven Helmer Birkbeck College, London
What is AI?
Articial Intelligence (AI) is a very new eld (term was coined in 1956)
Introduction
Along with molecular biology cited as eld I would most like to be in Quite a universal eld: systematizes and automates intellectual tasks, potentially relevant to many human activities
Articial Intelligence andNeural Networks p.1/410
Articial Intelligence andNeural Networks p.3/
Beyond the Hype

Many different denitions Roughly vary along two main dimensions
Acting Humanly
Turing (1950): Computing machinery and intelligence Can machines think? Can machines behave intelligently? Turing Test:
HUMAN HUMAN INTERROGATOR
Acting Humanly(2)
Although researchers dont focus on passing Turing test, it is still valid AI researchers concentrate on underlying issues Articial ight succeeded after turning away from imitating birds and looking at aerodynamics
reasoning
human
rational
?
AI SYSTEM
behavior
All four approaches have been pursued
To pass test, computer would need following capabilities: language, knowledge, reasoning, understanding, learning
Thinking Humanly
1960s: cognitive revolution, information-processing psychology started replacing behaviorism Requires scientic theories on internal activities of brain Cognitive science (top-down): predicting and testing human behavior Cognitive Neuroscience (bottom-up): identication from neurological data Distinct approaches, although there is some overlap
Thinking Rationally
Aristotle: one of the rst to codify laws of thought From this the eld of logic evolved Provides patterns for argument that given correct premises yield correct solutions By 1965 programs that could (in principle) solve any solvable problem described in logical notation Two main obstacles: Difcult to state informal, uncertain knowledge in formal, logical notation Solving in principle doesnt mean solving in practice
Acting Rationally
Rational behavior: doing the right thing Given available information, try to achieve best outcome Doesnt necessarily involve thinking (e.g. reexes) More general than thinking rationally (which is only concerned with correct inference) Also better suited for scientic development than human-based approaches Human-based approaches mimic behavior that is result of complicated and largely unknown evolutionary process Therefore we focus on the approach of acting rationally
Rational Agents
An agent is an entity that perceives and acts (autonomously) Formally, an agent is a function from a perception history to actions: f : P A We look for agents with best performance for a given class of environments and tasks Perfect rationality is unrealistic, aim for best program for given resources Philosophy
Foundations of AI
Can formal rules be used to draw valid conclusions? How does mental mind arise from physical brain? Where does knowledge come from? How does knowledge lead to action? Mathematics What are formal rules to draw valid conclusions? What can be computed? How do we reason with uncertain information?
Foundations of AI(2)
Economics How to make decisions to maximize payoff? What about others not going along? What if the payoff is far in the future? Neuroscience How do brains process information? Psychology How do humans and animals think and act?
Foundations of AI(3)
Computer engineering How can we build efcient computers? Control theory and cybernetics How can artifacts operate under their own control? Linguistics How does language relate to thought?
History of AI
1952-1969: Early enthusiasm, great expectations 1966-1973: A dose of reality 1969-1979: Knowledge-based systems 1980-present: AI becomes an industry 1986-present: Return of neural networks 1987-present: AI becomes a science 1995-present: Intelligent agents
State of the Art

Which of the following can be done today? Play a (decent) game of air hockey (Safely) drive through the desert (Safely) drive through Central London at rush hour Buy a weeks worth of groceries on the web Buy a weeks worth of groceries at the local supermarket on Saturday Play a game of chess Translate languages Converse successfully with another person for an hour
Chapter Summary
Different people think differently of AI Are you concerned with thinking or behavior? Do you want to model humans or work from ideal standard We focus on rational action
Overview of Lecture
General introduction to AI Problem Solving Neural Networks Evolutionary Computing Swarm Intelligence Fuzzy Systems Social and Philosophical Implications of AI
Chapter 2
Intelligent Agents
Outline
Agents and environments Rationality PEAS (Performance measure, Environment, Actuators, Sensors) Environment types Agent types
Agents and Environments

sensors percepts environment actions ? agent
Vacuum Cleaner World

A B
actuators
Agents include humans, robots, programs Remember: agent function maps from perception history to actions: f : P A Agent program runs on a physical architecture to produce f
Articial Intelligence andNeural Networks p.19/410 Articial Intelligence andNeural Networks p.20/410
Perception: location and contents, e.g. [ A, dirty ] Actions: Left, Right, Suck, Do Nothing
Vacuum Cleaner Agent

Perception history
[A,Clean] [A,Dirty] [B,Clean] [B,Dirty] [A,Clean], [A,Clean] [A,Clean], [A,Dirty]
Vacuum Cleaner Agent(2)

function VACUUM-AGENT([location,status]) if status = Dirty then return Suck else if location = A then return Right else return Left
Rationality
What is rational at any given time depends on Performance measure that denes success Agents prior knowledge of the environment Actions that an agent can perform Agents perception history to date Rational agent: For each possible perception history selects an action that is expected to maximize performance measure given history and built-in knowledge Is our vacuum-cleaning function rational?
Action Right Suck Left Suck Right Suck
Is this the right function (i.e. is it rational)?
Can we implement this in a small agent program?
Rationality(2)
Depends on the situation Assume following performance measure One point per clean location per time step Environment is known to exist only of locations A and B Clean locations stay clean, sucking cleans dirty locations We only have the aforementioned four actions Agent correctly perceives location and status In this case it is rational Is it still rational, if we give a penalty of one point per move?
Rationality(3)
Rational = omniscient: perceptions may not supply all relevant information Rational = clairvoyant: outcomes of actions may not be as expected Hence, rational does not always mean successful Being rational is also about Exploration: doing actions to modify future perceptions Learning: changing prior knowledge due to perceptions Autonomy: not relying on designers input alone
PEAS
When designing a rational agent, we need to consider performance measure, environment, actuators, sensors Example: designing an automated taxi Performance measure: safety, destination, prots, legality, comfort, . . . Environment: streets and motorways, trafc, pedestrians, weather, day-/nighttime, . . . Actuators: steering, accelerator, brake, horn, speaker/display, . . . Sensors: video, accelerometer, gauges, engine sensors, keyboard, microphone, GPS, . . .
PEAS
Example: internet shopping Performance measure: price, quality, appropriateness, efciency Environment: current and future web sites of vendors and shippers Actuators: displaying to user, following URLs, lling in forms Sensors: HTML pages (text, graphics, scripts)
Environment Types
Fully observable vs. partially observable: sensors are able to detect all relevant aspects / noisy or inaccurate sensors Deterministic vs. stochastic: next state completely dened by current state / uncertainty Episodic vs. sequential: future episodes do not depend on decisions in previous ones / short-term actions can have long-term consequences
Environment Types(2)
Static vs. dynamic: no change in environment while agent is deliberating / change in environment Discrete vs. continuous: nite number of distinct states / smooth range of values Single agent vs. multi agent: agent working by itself / agent has to compete or cooperate with other agents
Agent Types
Four basic types in order of increasing generality Simple reex agent Reex agent with state Goal-based agent Utility-based agent All of them can be turned into learning agents
Simple Reex Agent

Agent
Sensors What the world is like now
Example
The vacuum agent from before is a simple reex agent
function VACUUM-AGENT([location,status]) if status = Dirty then return Suck else if location = A then return Right else return Left
Environment
Conditionaction rules
What action I should do now
Actuators
Reex Agent with State

Sensors State How the world evolves What the world is like now
Example
cleaned[A], cleaned[B] = false function VACUUM-AGENT([location,status]) if status = Dirty then cleaned[location] = true return Suck else if location = A then cleaned[A] = true if cleaned[B] then return Do nothing else return Right else cleaned[B] = true if cleaned[A] then return Do nothing else return Left
Goal-based Agent
Knowing about the current state is not always enough Correct decision depends on what our goal is Agent has to determine if action will bring it towards goal Doing this involves planning Vacuum-cleaning agent had goal hardwired into program: clean up all locations (so no planning involved)
Environment
What my actions do
Conditionaction rules
Agent
Actuators
Goal-based Agent(2)
Sensors State How the world evolves What the world is like now What it will be like if I do action A
Utility-based Agent
Goals alone are not enough E.g. many ways for a taxi to get to destination, but some are quicker, safer, more reliable, cheaper, . . . How happy does a state make an agent? Happy doesnt sound very scientic, therefore its called (high) utility
Utility-based Agent(2)
Sensors State How the world evolves What the world is like now What it will be like if I do action A How happy I will be in such a state What action I should do now
Environment
Environment
What my actions do
What my actions do
Utility
Goals
Agent
Actuators
Agent
Actuators
Learning Agent
How do programs for selecting actions come into being? Programming everything by hand is very laborious Alternative: build machines that can learn and then teach them
Learning Agent(2)
Performance standard
Chapter Summary
Agents interact with environments through actuators and sensors
Critic
Sensors
An agent function f describes what the agent does in all circumstances A performance measure evaluates the sequence of actions and its effects A rational agent maximizes the expected performance
Environment
feedback changes
Learning element learning goals Problem generator
knowledge
Performance element
Agent
Actuators
Chapter 3
Outline
Problem-solving agents Problem formulation Example problems
Problem-solving Agents
Simplest agents discussed so far are reex agents They use direct mapping from states to actions Unsuitable for very large mappings Goal-based agents consider future actions and their outcome Problem-solving agents are one kind of goal-based agents Problem-solving agents nd action sequences that lead to desirable states
Problem Solving and Search
Basic search algorithms
Problem-solving Agents(2)
seq: an action sequence, initially empty state: a description of current world state goal: a goal, initially null problem: a problem formulation function SIMPLE-PROBLEM-SOLVER(perception) state = UPDATE-STATE(state, perception) if seq is empty then goal = FORMULATE-GOAL(state) problem = FORMULATE-PROBLEM(state,goal) seq = SEARCH(problem) action = FIRST(seq) seq = REST(seq) return action
Problem-solving Agents(3)
Agent on previous slide does ofine problem solving Simple formulate, search, execute design When executing the sequence of actions, it ignores the perceptions Assumes that solution that was found will always work
Example: Romania
On holiday in Romania Currently in Arad, ight leaves tomorrow from Bucharest Formulate goal: Be in Bucharest in time Formulate problem: States: being in various cities Actions: drive between cities Find solution: Sequence of cities, e.g. Arad, Sibiu, Fagaras, Bucharest Execute solution
Simplied Road Map

Oradea
A More Detailed Look

In a rst step we look at the process of problem formulation
Problem Formulation
Problem is dened by four components Initial state: the state in which agent starts e.g. In(Arad) Successor function S(x): description of possible actions and their outcome e.g. S(Arad) = { Go(Sibiu), In(Sibiu) , . . . } Goal test: determines if given state is a goal state e.g. explicit: In(Bucharest), implicit: king checkmated Path cost: function that assigns numeric cost to each path (reects performance measure) e.g. route distances between cities
71
Zerind
Neamt
75
Arad
151
87
Iasi
Then take a look at how to search

92
140
Sibiu
99
Fagaras Vaslui
118
Timisoara
80 Rimnicu Vilcea
Pitesti
111 70
Lugoj Mehadia
211
142 98
97 146
101 138
85
Hirsova
Urziceni
75
Dobreta
86
Bucharest
120
Craiova
90
Giurgiu Eforie
Problem Formulation(2)
At the moment we assume a single-state problem Deterministic, fully observable environment Agent knows exactly which state it will be in Solution is a sequence of actions from initial state to a goal state Solution quality is measured by path cost Optimal solution has lowest path cost among all solutions
Selecting State Space

Formulation of Romanian holiday problem seems reasonable, yet omits many details Real state of the world is much more complex: condition of the road, condition of the car, weather, fuel level, . . . Actions are not trivial either: keep car on the road, fuel car, switch on/off lights, use indicator, . . . Outcomes of actions are also more complex: besides changing location, it takes up time, consumes fuel, generates pollution, . . . Agent trying to consider all these details would be completely swamped
Selecting State Space(2)

State space and actions need to be abstracted for problem solving Removing irrelevant and keeping relevant details An abstract solution corresponds to a large number of detailed plans E.g. driving with lights on between Sibiu and Rimnicu Vilcea, then switch them off, fuel the car between Pitesti and Bucharest, . . . Abstraction is valid if we can expand abstract solution into a real-world solution straightforwardly
Example: Vacuum World

States: agent has two possible locations, each of which might be dirty or not, so 2 22 = 8 possible states Initial state: any state can be designated as initial state Successor function: generates legal states that results from trying (Left, Right, Suck) (Do nothing stays in current state) Goal test: checks whether all locations are clean Path cost: Lets assume each step costs 1, path cost = number of steps in solution
Example: Vacuum World(2)

R L L S R L L S S R L L S S R S S R L L S R R R
Example: 8-puzzle
3 3 board with eight numbered tiles and a blank space
Tiles adjacent to blank space can slide into the space
7 5 8
4 6
5 1 4 7
2 5 8
Goal State
3 6
3
Start State
Lower two states are goal states
Example: 8-puzzle
States: Specifying the location of each tile and the blank in one of the nine squares Initial state: any state can be designated as initial state Successor function: generates legal states that results from the four actions (blank moves Left, Right, Up, Down) Goal test: checks whether conguration matches goal state from previous slides Path cost: Lets assume each step costs 1, path cost = number of steps in solution
Real-world Examples
Touring problems: traveling salesman, delivery tour VLSI layout: positioning millions of components and connections on a chip (minimizing area, circuit delays, stray capacitance, . . . ) Robot navigation: no discrete sets of routes, continuous space Automatic assembly sequence: nd an order in which to assemble parts of an object Protein design: nd a sequence of amino acids that will fold into three-dimensional protein with right properties Internet searching: looking for relevant information on the Web
Searching for Solutions

Having formulated problems, we now need to solve them This is done by searching through the state space Search tree is generated by taking initial state and applying successor function to it
Basic Tree Search Algorithm

function TREE-SEARCH(problem,strategy) initialize tree with initial problem state loop do if no candidates can be expanded then return failure choose leaf node for expansion (according to strategy) expand node add resulting nodes to tree if one of nodes contains goal state then return solution end
Tree Search Example

Arad
Tree Search Example(2)

Arad
Sibiu
Timisoara
Zerind
Sibiu
Timisoara
Zerind
Arad
Fagaras
Oradea
Rimnicu Vilcea
Arad
Lugoj
Arad
Oradea
Arad
Fagaras
Oradea
Rimnicu Vilcea
Arad
Lugoj
Arad
Oradea
Tree Search Example(3)

Arad
States vs. Nodes

It is important to distinguish between the state space and the search tree
General Tree Search Algorithm

Collection of generated nodes not yet expanded is called fringe Nodes with bold outlines on previous slides make up fringe Each element of fringe is a leaf node A search strategy is implemented by a function that removes a node from the fringe expands it adds generated nodes to the fringe Fringe often realized in form of a queue to simplify selection process
Sibiu
Timisoara
Zerind
E.g. Romanian route planning

Oradea
Arad
Fagaras
Oradea
Rimnicu Vilcea
Arad
Lugoj
Arad
20 states, one for each city Innite number of paths, innite number of nodes (good search algorithm avoids this) A node in the search tree is a bookkeeping data structure A state corresponds to a conguration of the world Mere convenience in example naming nodes after states
General Tree Search Algorithm(2)

function GEN-TREE-SEARCH(problem,fringe) fringe = INSERT(INITIAL-STATE(problem),fringe) loop do if fringe is empty then return failure node = REMOVE-FIRST(fringe) if GOAL-TEST(problem,STATE(node)) then return node fringe = INSERT(EXPAND(node,problem),fringe) end
General Tree Search Algorithm(3)

function EXPAND(node,problem) successors = empty set for each <action,result> in S(problem,STATE(node)) do n = new node PARENT-NODE[n] = node ACTION[n] = action STATE[n] = result PATH-COST[n] = PATH-COST(node) + STEP-COST(node,action,n) DEPTH[n] = DEPTH[node] + 1 add n to successors end return successors
Search Strategies
Strategy is dened by picking the order of node expansion Uninformed strategies use only the information available in problem denition Breadth-rst search Uniform-cost search Depth-rst search Depth-limited search Iterative deepening search
Breadth-rst Search (BFS)

Expand shallowest unexpanded node Implementation: fringe is a FIFO queue, i.e., new successors are added at the end
Breadth-rst Search(2)
A B D E F C G D B E
A C F G D B E
A C F G
Quality of Strategy
How do we measure the quality of a search strategy? Four aspects: Completeness: does algorithm nd a solution if there is one? Optimality: is strategy able to nd optimal solution? Time complexity: how long does it take to nd a solution? Space complexity: how much memory is needed to perform search?
Quality of Strategy(2)
Complexity is expressed in terms of three quantities Branching factor (b): maximum number of successors of any node Depth (d): depth of the shallowest goal node Maximum length of any path (m) in search space
A B D E F C G
How does BFS fare?

Completeness: yes (if b is nite) Optimality: only if cost = 1 per step (not optimal in general) Time complexity: 1 + b + b2 + b3 + + bd + (bd+1 b) = O (bd+1 ) Space complexity: O (bd+1 ) (keeps whole fringe in memory) Big problem: can easily generate nodes at 100MB/sec, searching large search space can take hours
Uniform-cost Search(UCS)
Expand least-cost unexpanded node Implementation: fringe is a queue ordered by path cost Equivalent to BFS if step costs are all equal
How does UCS fare?

Completeness: yes (if step cost ) Optimality: yes (nodes expanded in increasing order of costs) Time complexity: number of nodes with costs C (cost of optimal solution), O (bC / ) Space complexity: number of nodes with costs C , O (bC / )
Depth-rst Search (DFS)

Expand deepest unexpanded node Implementation: fringe is a LIFO queue (or stack), i.e., new successors are added at front
Depth-rst Search(2)
Depth-rst Search(3)
A B D H I J E K L F M N C G O H D I J B E K
A C F L M N G O H D I J B E K
A C F L M N G O
Depth-rst Search(4)
Depth-rst Search(5)
Depth-rst Search(6)
A C F L M N G O
Depth-rst Search(7)
Depth-rst Search(8)
Depth-rst Search(9)
A C F L M N G O
Depth-rst Search(10)
A C F L M N G O
How does DFS fare?

Completeness: no (fails in innitely deep spaces or spaces with loops, complete in nite spaces; modify to avoid repeated states) Optimality: no Time complexity: O (bm ), terrible if m much larger than d, if solutions are dense, faster than BFS Space complexity: O (bm), newly expanded nodes are expanded all the way down to a leaf node
Depth-limited Search (DLS)

DFS with depth limit l, i.e., nodes at depth l are not expanded
function DLS(problem,limit) RECURSIVE-DLS(INITIAL-STATE(problem),problem,limit) function RECURSIVE-DLS(node,problem,limit) limit_reached = false if GOAL-TEST(problem,STATE(node)) then return node else if DEPTH(node) = limit then return cutoff else for each successor in EXPAND(node,problem) do result = RECURSIVE-DLS(successor,problem,limit) if result = cutoff then limit_reached = true else if result <> failure then return result end if limit_reached then return cutoff else return failure
Iterative Deepening Search (IDS)

IDS is an iterative depth-limited search that gradually increases the limit Combines the benets of BFS and DFS Lower memory consumption than BFS Is complete (and optimal, if step cost = 1) Disadvantage: states are generated multiple times
Iterative Deepening Search(2)

function IDS(problem) for depth = 0 to infinity do result = DLS(problem,depth) if result <> cutoff then return result end

Limit = 0
A A
Limit = 1

A B C B A C B A C B A C

Limit = 2
B D E F A C G D B E F A C G D B E F A C G D B E F A C G H A B D E F C G D B E F A C G D B E F A C G D B E F A D I

Limit = 3
B E J K L F M N A C G O H D I J B E K L F M N A C G O H D I J B E K L F M N A C G O H D I J B E K L F M N A C G O
How does IDS fare?

Completeness: yes Optimality: yes (if step cost = 1; can also be modied to explore uniform-cost tree) Time complexity: (d + 1)b0 + db1 + (d 1)b2 + + bd = O (bd )
G M N O
A C B G D H I J E K L F M N G O H D I J E K C B
A C F L
Space complexity: O (bd)
A C F L M N G O
Avoiding Repeated States

One complication ignored up to now: wasting space by expanding states that have already been encountered Failure to detect repeated states can turn linear problem into an exponential one:
A B C D C A
Avoiding Repeated States(2)

Sometimes repeated states are unavoidable (e.g. problems where actions are reversible) Then search trees are innite, but often we can prune some of the repeated states DFS can detect loops (as it knows path from root to current node), but cannot avoid situation on previous slide
C
Avoiding Repeated States(3)

General tree search can be modied to include new data structure Closed list stores every expanded node If current node matches a node in closed list, it is discarded New algorithm is called graph search
B C C
Only way to avoid these is to keep more nodes in memory Fundamental tradeoff between time and space
Graph Search
function GRAPH-SEARCH(problem,fringe) closed = empty set fringe = INSERT(INITIAL-STATE(problem),fringe) loop do if fringe is empty then return failure node = REMOVE-FIRST(fringe) if GOAL-TEST(problem,STATE(node)) then return node if STATE(node) is not in closed then add STATE(node) to closed fringe = INSERTALL(EXPAND(node,problem), fringe) end
How does Graph Search fare?

Completeness: yes Optimality: no (when two paths to a state exist, the newly discovered is discarded; can be modied by using UCS) Time complexity: proportional to the size of the state space Space complexity: proportional to the size of the state space Graph search is more efcient for problems with many repeated states
Chapter Summary
This chapter dealt with problem formulation and (simple) search strategies Problem formulation involves abstracting away irrelevant real-world details for feasibility reasons Variety of uninformed search strategies IDS is often preferred search method for large search spaces and unknown depths of goals Graph search can be exponentially more efcient than tree search
Chapter 4
Outline
Using problem-specic knowledge searching can be improved considerably We look at following strategies in this chapter: Best-rst search Greedy search A search Heuristics
Review: Tree Search
Informed Search Algorithms
function GEN-TREE-SEARCH(problem,fringe) fringe = INSERT(INITIAL-STATE(problem),fringe loop do if fringe is empty then return failure node = REMOVE-FIRST(fringe) if GOAL-TEST(problem,STATE(node)) then return node fringe = INSERT(EXPAND(node,problem),fringe end
Strategy is dened by picking order of node expansion

INSERT (or REMOVE-FIRST) are crucial functions
Best-rst Search
Main idea: node is selected for expansion based on an evaluation function f (n) Usually node with lowest score is selected (as f (n) normally measures distance to goal) Implementation: fringe is a queue sorted in ascending order of evaluation scores
Best-rst Search(2)
Rarely possible to nd perfect evaluation function f (n) There is a whole family of best-rst search algorithms Key component is heuristics function h(n) that estimates cost of the cheapest path from node n to a goal E.g. for Romanian route h(n) could be straight-line distance from n to Bucharest We look at two members of family: Greedy search A search
Greedy Search
Evaluation function f (n) = h(n) Greedy search expands node that appears to be closest to goal For Romanian route example we choose hSLD (n) = straight-line distance from n to Bucharest
Heuristic for Romania

71
Oradea Neamt
Greedy Search Example

Straightline distance to Bucharest Arad 366 Bucharest 0 Craiova 160 Dobreta 242 Eforie 161 Fagaras 178 Giurgiu 77 Hirsova 151 Iasi 226 Lugoj 244 Mehadia 241 Neamt 234 Oradea 380 Pitesti 98 Rimnicu Vilcea 193 Sibiu 253 Timisoara 329 Urziceni 80 Vaslui 199 Zerind 374 Arad 366
Greedy Search Example(2)

Arad
75
Arad
Zerind
151
87
Iasi
Sibiu 253
Timisoara 329
Zerind 374
140
Sibiu
92 99
Fagaras Vaslui Rimnicu Vilcea
118 80
Timisoara
111 70
Lugoj
97 146
Pitesti
211
142 98
Mehadia
101 138
85
Hirsova
Urziceni
75
Dobreta
86
Bucharest
120
Craiova
90
Giurgiu Eforie

Arad

Arad
How does Greedy Search fare?

Completeness: no (can get stuck in loops, e.g. getting from Neamt to Fagaras: Neamt - Iasi - Neamt - . . . )
Sibiu
Timisoara 329
Zerind 374
Sibiu
Timisoara 329
Zerind 374
Complete in nite spaces with checking for repeated states Optimality: no (only optimal for problems with matroid property) Time complexity: O (bm ), but a good heuristic can give dramatic improvement Space complexity: O (bm ) (keeps all nodes in memory)
Arad 366
Fagaras 176
Oradea 380
Rimnicu Vilcea
193
Arad 366
Fagaras
Oradea 380
Rimnicu Vilcea
193
Sibiu 253
Bucharest 0
A Search
Idea: avoid expanding paths that are already expensive Evaluation function f (n) = g (n) + h(n)
g (n) = cost so far to reach n h(n) = estimated cost from goal to n f (n) = estimated cost of path through n to goal
A Search(2)
A search is optimal, if we use an admissible heuristic This means, if we are wrong, we only underestimate the true costs: h(n) h (n) where h (n) is the true cost We also require h(G) = 0 for any goal G and h(n) > 0 otherwise For our example, hSLD (n) is an admissible heuristic
A Search Example
Arad 366=0+366
If h(n) has certain properties, then A search is optimal!
A Search Example(2)
Arad
A Search Example(3)
Arad
A Search Example(4)
Arad
Sibiu 393=140+253
Timisoara 447=118+329
Zerind 449=75+374 Arad Fagaras
Sibiu
Timisoara 447=118+329 Oradea

Rimnicu Vilcea
Zerind 449=75+374 Arad Fagaras
Sibiu

Rimnicu Vilcea
Zerind 449=75+374
646=280+366 415=239+176 671=291+380 413=220+193
646=280+366 415=239+176 671=291+380 Craiova Pitesti Sibiu
526=366+160 417=317+100 553=300+253
A Search Example(5)
Arad
A Search Example(6)
Arad
Optimality of A
Suppose some suboptimal goal G2 has been generated and is in the queue
Sibiu
Timisoara 447=118+329
Zerind 449=75+374 Arad 646=280+366 Fagaras
Sibiu

Rimnicu Vilcea
Zerind 449=75+374
Let n be an unexpanded node on a shortest path to an optimal goal

Start
Arad 646=280+366 Sibiu
Fagaras
Oradea 671=291+380
Rimnicu Vilcea
671=291+380 Bucharest 450=450+0 Craiova 526=366+160 Bucharest 418=418+0 Craiova Pitesti Sibiu 553=300+253
Rimnicu Vilcea
Bucharest 450=450+0
Craiova
Pitesti
Sibiu
Sibiu 591=338+253
591=338+253
526=366+160 417=317+100 553=300+253
615=455+160 607=414+193
G2
Optimality of A(2)
f (G2 ) = g (G2 ) since h(G2 ) = 0 > g (G1 ) since G2 is suboptimal f (n) since h is admissible
How does A fare?

Completeness: yes (unless innitely many nodes with f f (G ) Optimality: yes Time complexity: exponential in relative error in h length of solution Space complexity: keeps all nodes in memory
Heuristics Functions
Well now shed some light on h(n) in general Lets take another look at the 8-puzzle
7 5 8 3
Start State
4 6 1
5 1 4 7
2 5 8
Goal State
3 6
Since f (G2 ) > f (n), expansion
will never select G2 for
Average solution cost for a randomly generated instance is about 22 steps Branching factor is about 3
Heuristics Functions(2)
Exhaustive search would look at about 3.1 101 0 states Keeping track of repeated states this can be cut down to 181,440 For 8-puzzle its manageable, for 15-puzzle corresponding number is roughly 101 3 We need a good heuristic function
Commonly-used candidates:
h1 (n) = the number of misplaced tiles h2 (n) = total Manhattan distance (i.e., number of squares from desired location of each tile)
7 5 8 3
Start State
Both, h1 and h2 are admissible How efcient are they? Typical search costs:
d = 14 IDS = 3,473,941 nodes A (h1 ) = 539 nodes A (h2 ) = 113 nodes d = 24 IDS 54,000,000,000 nodes A (h1 ) = 39,135 nodes A (h2 ) = 1,641 nodes
4 6 1
5 1 4 7
2 5 8
Goal State
3 6
h1 (Start State) = 6 (tile 2 and 6 are placed correctly) h2 (Start State) = 4 + 0 + 3 + 3 + 1 + 0 + 2 + 1 = 14

Quality of Heuristics
h2 seems to be better than h1
Inventing Admissible Heuristics

Admissible heuristics can be derived from the exact solution of a relaxed version of the problem Relaxed means we have fewer restrictions on the allowed actions If a tile can move anywhere (even to occupied spaces), h1 (n) gives shortest solution If a tile can move to any adjacent square (even to occupied spaces), h2 (n) gives shortest solution Key point: optimal solution cost of relaxed problem optimal solution cost of real problem
Inventing Admissible Heuristics(2)

Relaxed problem should be much easier to solve If it is hard to solve, obtaining the values for h(n) will be expensive Heuristics can be combined via max to get dominant one Process has been automated in a program called ABSOLVER (problem denition needs to be formalized)
Is this always the case? Yes, as h2 dominates h1 , i.e., h2 (n) h1 (n) for all n In terms of efciency this means, A using h2 will never expand more nodes than A using h1 If you are not sure about dominance: given any admissible heuristics ha and hb , h(n) = max(ha (n), hb (n)) is also admissible and dominates ha and hb
Chapter Summary
Heuristic functions estimate costs of shortest paths Good heuristics can dramatically reduce search cost Greedy search expands node with lowest h(n) Incomplete and not always optimal A search expands node with lowest g (n) + h(n) Complete and optimal Also very efcient Admissible heuristics can be derived by relaxing problems
Chapter 5
Hill climbing
Outline
Simulated Annealing Local Beam Search
Local Search Algorithms
Genetic algorithms (very briey)
Motivation
Search algorithms up to now memorize path from initial state to goal In many problems path is irrelevant, we are only interested in a solution (e.g. 8-queens problem) This class of problems includes many important applications Integrated-circuit design Factory-oor layout Job scheduling Network optimization Vehicle routing Portfolio management
Iterative Improvement (II)

In such cases we can use iterative improvement algorithms Keep a single current state and try to improve this state Constant space, suitable for online as well as ofine search
Example: Traveling Salesman

Start with any complete tour, perform pairwise exchange (that improves current tour)
Variants of this approach get within 1% of optimum very quickly (for thousands of cities)
Example: n-queens
Put n queens on an n n board with no two queens sharing a row, column, or diagonal II: move a queen to reduce number of conicts
Hill-climbing Search
Simply a loop that continues moving in the direction of increasing value Terminates when it reaches a peak (where no neighboring state has a higher value) Does not look beyond immediate neighbors of current state (greedy local search) Like climbing Everest in thick fog with amnesia
Hill-climbing Search(2)
function HILL-CLIMBING(problem) current = INITIAL-STATE(problem) loop do neighbor = highest-valued successor of current if VALUE(neighbor) <= VALUE(current) then return STATE(current) current = neighbor end
h=5
h=2
h=0
Solves n-queens problem very quickly for very large n
Problems with Hill-climbing

Hill-climbing can get stuck without nding optimum Is trapped in local maxima or shoulders
objective function
Solving the Problems

Random sideway moves allow escapes from shoulders But still lead to innite loops on at maxima Common technique: restart hill climbing with new random initial state Depending on the shape of the state space landscape this is more or less successful
Simulated Annealing (SA)

Idea: escape local maxima by allowing some bad moves These bad moves are gradually decreased in their size and frequency Modeled after gradually cooling down material in a heat bath to grow crystals
global maximum
shoulder local maximum "flat" local maximum
current state
state space
Simulated Annealing(2)
function SIM-ANNEALING(problem,schedule) current = INITIAL-STATE(problem) t = 1 loop do temperature = schedule[t] if temperature = 0 then return current next = randomly selected successor of current diff = VALUE(next) - VALUE(current) if diff > 0 then current = next else current = next only with probability e(diff/temperature) end
Simulated Annealing(3)
Vivid description: Getting a ping-pong ball into the deepest crevice of a bumpy surface (turning around the hill) Left alone by itself, ball will roll into a local minimum If we shake the surface, we can bounce the ball out of a local minimum The trick is to shake hard enough to get it out of local minimum, but not hard enough to dislodge it from global one We start by shaking hard and then gradually reduce the intensity of shaking
Local Beam Search

Idea: keep k states instead of just 1 Begins with k randomly generated states At each step all the successors of all k states are generated If one is a goal, we stop, otherwise select k best successors from complete list and repeat
Local Beam Search(2)

At rst glance, local beam search looks like running k iterative improvement algorithms in parallel However, it isnt, as the results of all k states inuence each other If one state generates several good successor, they all end up in the next iteration States generating bad successor are weeded out
Local Beam Search(3)

This is a strength and a weakness Unfruitful searches are quickly abandoned and searches making the most progress are intensied Can lead to a lack of diversity: concentration in a small region of the search space Remedy: choose k successors randomly, biased towards good ones
Genetic Algorithms (GA)

Genetic algorithms can be seen as a variant of stochastic local beam search However, successor states are generated by combining two parent states (rather than modifying one parent state) A more detailed look at GA later in the lecture
Chapter Summary
We covered search algorithms that do not care about path from initial state to goal Only solutions are relevant for local search algorithms
Chapter 6
Outline
Constraint Satisfaction Problems (CSP) examples Backtracking search for CSPs Problem structure and problem decomposition
Constraint Satisfaction Problems
Local search for CSPs
Motivation
Up to now the states in search spaces were black boxes to the search algorithms Only accessible by problem-specic routines: successor function, heuristic function, and goal test The search algorithm itself had no knowledge about the internals of the states We now look at CSPs, whose states and goal tests conform to a standard, structured, and simple representation Consequence: search algorithms can use general-purpose rather than problem-specic heuristics
Representation of a CSP
A CSP is dened by a set of variables X1 , X2 , . . . , Xn and a set of constraints C1 , C2 , . . . , Cm Each variable has a domain Di of possible values Each constraint species the allowable combination of values for some subset of variables An assignment not violating constraints is called consistent (or legal) A consistent, complete assignment (involving every variable) is a solution Some CSPs also require a solution that maximizes an objective function
Example: Map-coloring
Northern Territory Western Australia South Australia New South Wales Queensland
Victoria
Tasmania
Variables: WA, NT, Q, NSW, V, SA, T Domains: Di = {red, green, blue} Constraints: adjacent regions must have different colors
Example: Map-coloring(2)
Formal description of constraints: WA = NT, WA = SA, NT = SA, NT = Q, . . . Or, depending on description language allowed: (WA,NT) { (red,green),(red,blue),(green,red), . . . } (WA,SA) { (red,green),(red,blue),(green,red), . . . } (NT,SA) { (red,green),(red,blue),(green,red), . . . } We now have to nd a complete assignment that does not violate any constraint
Example: Map-coloring(3)
Possible solution: { WA=red, NT=green, Q=red, NSW=green, V=red, SA=blue, T=green }
Constraint Graph
Binary CSP: each constraint relates at most two variables Constraint graph: nodes are variables, edges are constraints
NT Q
Northern Territory Western Australia South Australia New South Wales Queensland
WA SA V Victoria
NSW
Victoria
Tasmania
Constraint Graph(2)
General-purposes CSP algorithms use graph structure A relatively simple way to describe a CSP Can speed up search, e.g. Tasmania is an independent subproblem
Varieties of Constraints
Unary constraints involve a single variable E.g. SA = green Binary constraints involve a pair of variables E.g. SA = WA Higher-order constraints involve three or more variables E.g. cryptarithmetic column constraints (example in just a moment) Preferences (soft constraints) E.g. red is better than green Often represented by costs for a variable assignments, also called constrained optimization problems
Example: Cryptarithmetic
T WO + T WO F O U R
F T U W R
X3
X2
X1
Variables: F, T, U, W, R, O, X1 , X2 , X3 Domains: { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0 } Constraints: alldifferent(F, T, U, W, R, O) O+O=R+10X1 , W+W+X1 =R+10X2 , . . .
Varieties of CSPs
Discrete variables Finite domains: size d O (dn ) complete assignments with n variables E.g. Boolean CSPs: Boolean satisability (NP-complete) Innite domains: (integers, strings, etc.) with n variables E.g. job scheduling, variables are start/end days for each job Formulated via constraint language: e.g. StartJob1 + 5 StartJob3 Linear constraints solvable, nonlinear undecidable (in general case)
Varieties of CSPs(2)
Continuous variables E.g. start/end times for Hubble Telescope observations Linear constraints solvable in polynomial time by Linear Programming methods Very common in the real world, widely studied in Operations Research
Real-world CSPs
Assignment problems, e.g. who teaches what class? Timetabling problems, e.g. which train arrives and leaves when and where? Transportation scheduling, e.g. which vehicle leaves when and where and carries which goods with it? Usually very hard to solve
Standard Search Formulation

CSPs can be formulated as standard search problems: Initial state: the empty assignment Successor function: assign a value to an unassigned value in a consistent way Goal test: current assignment is complete Path cost: constant cost (e.g. 1) for each step Every solution is a complete assignment Assuming n variables, it appears at depth n That means, the search tree extends only to depth n We could use standard search algorithms: BFS, DFS, IDS
Standard Search Formulation(2)

Path to solution is irrelevant Alternative: use complete-state formulation Every state is a complete assignment, which may or may not satisfy constraints Then apply local search methods to this state
Standard Search Formulation(3)

Why is naive application of search algorithms a bad idea? Lets have a look at breadth-rst search: Branching factor at top level is nd, any of d values can be assigned to any of the n variables On the next level it is (n 1)d, and so on for n levels This generates a tree with n! dn leaves, although there are only dn possible complete assignments!
Improving the Search

We have ignored a crucial property of CSPs: commutativity Order of the application of actions has no effect on the outcome When assigning values to variables, the order of assignment doesnt matter CSP search algorithms generate successors by considering assignment of only a single variable at each node: E.g. at root node for coloring map, we choose between SA=red, SA=green, SA=blue, but not between SA=red, WA=blue, NT=green, NSW=red, ...
Improving the Search(2)

With this restriction we get number of leaves back down to dn CSP search is done via backtracking search Depth-rst search that chooses values for one variable at a time If no legal values are left (due to constraints), it backtracks
Backtracking Search
function BACKTRACK(csp) return RECURSIVE-BACKTRACK({},csp)
function RECURSIVE-BACKTRACK(assignment,csp) if assignment is complete then return assignment var = SELECT-UNASSIGNED(VARIABLES(csp), assignment,csp) for each value in DOMAIN(var,assignment,csp) d if value is consistent with CONSTRAINTS(csp) then add {var = value} to assignment result = RECURSIVE-BACKTRACK(assignment csp) if result <> failure then return result remove {var = value} from assignment end return failure
Backtracking Example
Backtracking Example(2)
Improving Efciency
Backtracking can be improved in terms of efciency by looking at: Which variable should be assigned next? In what order should its values be tried? Can we detect inevitable failure early? Can we take advantage of the problem structure?
Minimum Remaining Values (MRV)

Choose variable with the fewest legal values
Degree Heuristic
If there are ties among MRVs use degree heuristic Choose variable with the most constraints on remaining variables
Least Constraining Value

Given a variable, choose least constraining value The one that rules out fewest values in remaining variables
Forward Checking
Idea: keep track of remaining legal values for unassigned variables Terminate branch of search when any variable has no legal values
Allows 1 value for SA
Allows 0 values for SA WA NT Q NSW V SA T
Forward Checking(2)
Forward Checking(3)
Forward Checking(4)
WA
NT
NSW
SA
WA
NT
NSW
SA
WA
NT
NSW
SA
Constraint Propagation
Forward checking propagates information from assigned to unassigned variables Doesnt provide early detection for all failures NT and SA cannot both the blue:
Constraint Propagation(2)
Constraint propagation repeatedly enforces constraints locally Forward checking propagates from WA and Q onto NT and SA We want to continue by propagating onto the constraint between NT and SA And we want to do this efciently, reducing search space is no good if it takes longer than simple search
Arc Consistency
Method of constraint propagation that is stronger than forward checking Arc refers to a (directed) edge in the constraint graph E.g. there is an arc from SA to NSW Arc is consistent, iff for every value x of SA, there is some value y of NSW that is consistent
WA
NT
NSW
SA
WA
NT
NSW
SA
Arc Consistency(2)
Simplest form of propagation makes each arc consistent If we nd a value x for which no consistent y exists, we delete x E.g. arc from NSW to SA
Arc Consistency(3)
If a variable loses a value, neighbors of this variable need to be rechecked E.g. arc from V to NSW
Arc Consistency(4)
Arc consistency detects failure earlier than forward checking E.g. arc from SA to NT
WA
NT
NSW
SA
WA
NT
NSW
SA
WA
NT
NSW
SA
Arc Consistency Algorithm

function AC-3(csp) queue = all arcs in csp while queue is not empty do (Xi,Xj) = REMOVE-FIRST(queue) if REMOVE-INCONSISTENT(Xi,Xj) then for each Xk in NEIGHBORS(Xi) do add (Xk,Xi) to queue end end
Arc Consistency Algorithm(2)

function REMOVE-INCONSISTENT(Xi,Xj) removed = false for each x in DOMAIN(Xi) do if no value y in DOMAIN(Xj) allows (x,y) to satisfy constraint between Xi and Xj then delete x from DOMAIN(Xi) removed = true end return removed
Problem Structure
Structure of the constraint graph can often be exploited We are going to look at two techniques for independent subproblems tree-structured problems
Independent Subproblems
NT Q WA SA V Victoria
NSW
Independent Subproblems(2)
Each of these (smaller) subproblems can be solved independent of each other Performance gains can be quite high: Suppose each subproblem has c variables out of n total Worst-case solution cost is n/c dc Compare with dn for whole problem: Assume n = 80, d = 2, c = 20 and 10 million nodes/sec processing speed Whole problem: 280 4 billion years All subproblems: 4 220 0.4 seconds
Tree-structured CSPs
A B C D F E
If the constraint graph has no loops, CSP can be solved in O (nd2 ) time (instead of worst-case O (dn ))
Tasmania and mainland are independent subproblems Identiable as connected components of constraint graphs
Tree-structured CSPs(2)
Choose a variable as root, order variables from root to leaves (parent precedes all children)
Nearly Tree-Structured CSPs

Conditioning: instantiate a variable, prune its neighbors domains
NT Q NT Q WA SA
NSW NSW
Local Search for CSPs

Hill-climbing, simulated annealing work with complete states To apply to CSPs Allow states with violated constraints Successor function reassigns variable values Variable selection: randomly Better: use min-conict heuristic Change value that leads to fewest constraints violated E.g. do hill-climbing with h(n) = total number of violated constraints
A B C D
E A F B C D E F
WA
Victoria
Victoria
For j from n down to 2, apply REMOVE-INCONSISTENT(Parent(Xj ),Xj ) For j from 1 to n, assign Xj consistently with Parent(Xj )
Cutset conditioning: instantiate (in all ways) as set of variables, such that remaining graph is a tree
Example: 4-queens
States: 4 queens in 4 columns (44 = 256 states) Successor function: move queen up or down in column Goal test: no attacking queens Evaluation function: h(n) = number of attacks
Performance
Given random initial state, can solve n-queens in almost constant time for large n with high probability In general very good for any randomly-generated CSP Exceptions are problems in a narrow range of the ratio R = number of constraints number of variables
CPU time
Chapter Summary
CSPs are a special kind of problem States dened by values of a xed set of variables Goal test dened by constraints on variable values Backtracking = depth-rst search with one variable assignment per node Various techniques to improve performance Alternative: local search with min-conicts heuristic Usually efcient in practice
h=5
h=2
h=0
R critical ratio
Chapter 7
Games Perfect play
Outline
Motivation
Up to now search problems were hard, but nobody was working against us How do we plan if other agents are planning against us? Games are an ideal domain for exploring capabilities of AI in terms of adversarial search: The rules are xed The scope of the problem is constrained The interactions between players are well dened Yet, problems are far from simple Can be seen as the Formula 1 of AI research
Adversarial Search
Minimax decisions - -pruning Resource limits and approximate evaluation Games of chance Games of imperfect information
Games vs. Searching

In games we have an unpredictable opponent Solution is a strategy to nd an answer to each move of our opponent Due to time limits, perfect play often not possible, need to approximate perfect information imperfect information
Types of Games
deterministic chess, checkers, go, othello battleships, stratego random element backgammon, monopoly bridge, poker, scrabble
Representation
Well rst consider games with two players, called MAX and MIN A game can be formally dened as a kind of search problem Initial state: includes board position and identies player to move Successor function: returns a list of legal moves and resulting states Terminal test: determines when the game is over (terminal states are states where the game has ended) Utility function: gives numeric values for terminal states (e.g. win = +1, loss = -1, draw = 0)
Representation(2)
Game tree for Tic-Tac-Toe
MAX (X)
Minimax
In normal search, optimal solution is a sequence of moves leading to a goal (each terminal state is a win) In a game, however, opponent MIN has something to say about it
Minimax(2)
Idea: choose move to position with highest minimax values Best achievable payoff against best play
MAX
X MIN (O)
X X X X X X X
X O MAX (X)
X O
Well rst look at deterministic, perfect-information games MAX must nd a contingent strategy: Specify MAXs move in initial state Then MAXs moves in the states resulting from every possible response by MIN Then MAXs moves replying to MINs response to those moves, and so on
3
A1 A2 A3
...
MIN
A 11 A 12
3
A 13 A 21
2
A 22 A 23 A 31
2
A 32 A 33
X O X MIN (O)
X O X
X O X
...
...
...
...
...
TERMINAL Utility
X O X O X O 1
X O X O O X X X O 0
X O X X X O O +1
...
12
14
Minimax(3)
function MINIMAX-DECISION(state) return the a in ACTIONS(state) maximizing MIN-VALUE(RESULT(a,state)) function MIN-VALUE(state) if TERMINAL-TEST(state) then return UTILITY(state) v = infinity for a,s in SUCCESSORS(state) do v = MIN(v, MAX-VALUE(s)) end return v function MAX-VALUE(state) if TERMINAL-TEST(state) then return UTILITY(state) v = - infinity for a,s in SUCCESSORS(state) do v = MAX(v, MIN-VALUE(s)) end return v
How does Minimax fare?

Completeness: yes (if tree is nite; even for some innite trees, nite strategy can exist) Optimality: yes (against an optimal opponent) Time complexity: O (bm ) For chess, b 35, m 100, infeasible Space complexity: O (bm) (depth-rst search) But do we need to explore every path?
- Pruning
Problem with minimax: number of examined games states exponential in the number of moves We cant eliminate exponent, but can effectively cut it in half
- pruning cuts off branches that cannot possibly inuence nal decision
- Pruning(2)
MAX
- Pruning(3)
MAX
- Pruning(4)
MAX
MIN
MIN
MIN
14
12
12
12
14
- Pruning(5)
MAX
- Pruning(6)
MAX
Why Is It Called - ?
MAX
3 3
MIN
MIN
14
MIN
14
5 2
.. .. .. MAX MIN
12
14
12
14
is best value (to MAX) found so far (in another branch)
If v is worse than , MAX will avoid it (cut off this branch)

is best values (to MIN) so far
- Algorithm
function ALPHA-BETA-DECISION(state) return the a in ACTIONS(state) maximizing MIN-VALUE(RESULT(a,state),-infinity,infinity) function MIN-VALUE(state,alpha,beta) if TERMINAL-TEST(state) then return UTILITY(state) v = infinity for a,s in SUCCESSORS(state) do v = MIN(v, MAX-VALUE(s,alpha,beta)) if v <= alpha then return v beta = MIN(beta,v) end return v function MAX-VALUE(state) same as MAX-VALUE but with roles of alpha and beta reversed
Properties of - Pruning
Pruning does not affect nal result Good move ordering improves effectiveness of pruning With perfect ordering, time complexity = Unfortunately, 3550 is still infeasible
O (bm/2 )
Resource Limits
Standard approach Use CUTOFF-TEST instead of TERMINAL-TEST, e.g. depth limit Use EVAL instead of UTILITY, i.e. evaluation function that estimates desirability of a position State-of-the-art: Deep Blue, up to 2 108 nodes/sec Assume we have 300 seconds We can go through 6 1010 nodes 3514/2 - reaches depth 14 Evaluation function is the crucial element in quality of play
Evaluation Functions
For chess, typically linear weighted sum of features: EVAL(s) = w1 f1 (s) + w2 f2 (s) + + wn fn (s) E.g., f1 (s) = (# white queens - # black queens) with w1 = 9
MAX
Evaluation Functions(2)
Exact values dont matter, only the order matters Behavior is preserved under any monotonic transformation of EVAL
Deterministic Games in Practice

Checkers: Chinook defeats human world champion Marion Tinsley in 1994 Chess: Deep Blue defeats human world champion Gary Kasparov in 1997
MIN
20
Othello:
400
20
20
Human champions refuse to play computers, which are too good Go: Human champions refuse to play computers, which are too bad (in Go, b > 300)
Black to move White slightly better
White to move Black winning

Nondeterministic Games
In nondeterministic games, chance introduced by throwing dice, shufing cards, etc. MAX knows his own moves, but does not know next possible moves of MIN We have to add chance nodes in addition to MAX and MIN nodes The branches leading from each chance node denote possible events (each with a probability)
MAX
Example: Coin Flip
Algorithm
EXPECTIMINIMAX gives perfect play
Works like MINIMAX with addition of chance nodes:

3 0.5
MIN
CHANCE
1 0.5 4 0 0.5 2 0.5
... if state is a MAX node then return the highest EXPECTMINIMAX-VALUE of SUCCESSORS(state) if state is a MIN node then return the lowest EXPECTMINIMAX-VALUE of SUCCESSORS(state) if state is a chance node then return weighted average EXPECTMINIMAX-VALU of SUCCESSORS(state) ...
Evaluation Functions
Exact values do matter here Behavior is preserved only by positive linear transformation of EVAL Hence, EVAL should be proportional to expected payoff
MAX
Nondeterministic Games in Practice

Dice rolls increase b: 36 ways to roll two dice (21 of them are distinct ways) Backgammon: 20 legal moves depth 4 = 20 (21cdot20)3 109 nodes As depth increases, probability of reaching a certain node shrinks Value of lookahead is diminished
Games of Imperfect Information

Example: card games, where opponents initial cards are unknown Could be seen as a game where all the dice are rolled at beginning Unfortunately, this is not quite right. . . Lets look at an example: two players (MAX and MIN) playing four-card hands of bridge with all cards showing
DICE
2.1 .9 .1 3 1 .9
1.3 .1 4 20
21 .9 30 .1 1 .9
40.9 .1 400
- pruning is less effective
MIN
TDGAMMON uses depth-2 search + very good EVAL world-champion level
20
20 30 30
1 400 400
Example
MAXs hand: 6 6 9 8 MINs hand: 4 2 10 5 MAX leading the 9 is an optimal play (as is leading any other card in this case) MAX will get two tricks on optimal play of MIN MIN will get two tricks (with 2 10) Replacing MINs hand with 4 2 10 5 does not make a difference Can be shown with a suitable variant of minimax
Example(2)
Now lets hide one of MINs cards MAX does not know if MIN has a 4 or a 4 One could argue: leading 9 against rst hand and against second hand is optimal; as MIN has one of these hands, its still optimal But: MIN takes trick with 10, leads with 2 MAX has to discard 6 or 6 If the wrong card is discarded, MAX will get only one trick
Example(3)
MAX is using what we might call averaging over clairvoyancy: Computing the minimax value of each action for each possible deal of cards Then computing the expected value over all deals (using probability of each deal) If you think this is reasonable, consider the following
Story Example
Day 1: Road A leads to heap of gold; Road B leads to a fork: turn left and nd a mound of jewels, turn right and get run over by a bus Day 2: Road A leads to heap of gold; Road B leads to a fork: turn left and get run over by a bus, turn right and nd a mound of jewels Day 3: Road A leads to heap of gold; Road B leads to a fork: guess correctly and nd a mound of jewels, guess incorrectly and get run over by a bus Choosing Road B on the rst two days is as optimal as choosing Road A Would you choose Road B on the third day?
Proper analysis
With partial observability intuition that value of an action is average of its values in all states is wrong value of an action depends on the information state or belief state an agent is in Correct strategy is to generate and search a tree of information states Leads to rational behavior as Acting to obtain information Signaling to ones partner Acting randomly to minimize information disclosure
Chapter Summary
Games illustrate several important points about AI Perfection is unattainable, we need to approximate Uncertainty constrains the assignment of values to states Optimal decisions depend on information state, not real state
Chapter 8
Brains Neural networks
Outline
Brains
A neuron is a brain cell whose function is to collect and process electrical signals
Feed-forward networks
Neural Networks
Single-layer networks Multi-layer networks Recurrent networks Elman networks Learning Supervised Learning Unsupervised Learning Reinforcement Learning
Nucleus Dendrite Synapse Axon
Axonal arborization Axon from another cell
Synapses
Cell body or Soma

Brains(2)
The brains information-processing capacity is thought to emerge primarily from networks of neurons There are approx. 1011 neurons in human brain, connected via approx. 1014 synapses 1ms-10ms cycle time Some of the earliest AI work aimed to create articial neural networks
Articial Neuron
McCulloch and Pitts devised simple mathematical model of a neuron Gross oversimplication of real neurons Its purpose was to develop understanding of what networks of simple units can do
aj a0 = 1
Articial Neuron(2)
Bias Weight
W0,i Wj,i
ai = g(ini)
ini
g ai
Input Links
Input Activation Function Function
Output
Output Links
Each unit i rst computes weighted sum of its inputs: ini = n j =0 Wj,i aj Then applies activation function g to derive output:
ai = g (ini ) = g
n j =0 Wj,i aj
Activation Function
Activation function g is designed to meet two desiderata: Unit should be active (near 1) when right inputs are given and inactive (near 0) when the wrong inputs are given Activation needs to be nonlinear, otherwise entire neural network collapses into a simple linear function Two typical activation functions are Threshold function (or step function) Sigmoid function In general, activation functions are monotonically increasing
Activation Function(2)
g(ini) +1 +1 g(ini)
What Can We Do?

With single neurons, Boolean functions can be implemented
W0 = 1.5 W1 = 1 W2 = 1
AND
W0 = 0.5 W1 = 1 W2 = 1
OR
W0 = 0.5
ini
ini
(a)
(b)
W1 = 1
(a) is threshold function g (x) = 1 for x > 0, = 0 otherwise (b) is sigmoid function g (x) = 1/(1 + ex ) Usually, bias weight W0,i is used to move threshold location: g (ini ) = g (x W0,i )
NOT
Using neurons, we can build a network to compute any Boolean function of the inputs
Network Structures
Two main categories of neural networks structures Feed-forward networks Represents a function of its current input No internal states other than weights (Cyclic or) recurrent networks Feeds outputs back into inputs Dynamical system (may reach stable state, exhibit oscillations or even chaotic behavior) Can support short-term memory This makes them more interesting, but also harder to understand
Feed-forward Example
1 W1,3 W1,4 3 W3,5 5 W2,3 2 W2,4 4 W4,5
Single-layer Networks
Network with all inputs connected directly to the outputs is called a single-layer neural network (or perceptron network) Each output unit is independent of the others, so we look at a single output unit We start by examining the expressiveness of perceptrons As already seen, simple Boolean functions are possible Majority function (outputs 1 if more than half of inputs are 1) is also possible: Wj = 1, threshold W0 = n/2
Simple neural network with two inputs, one hidden layer of two units, and one output Feed-forward networks are usually arranged in layers (each unit receives input from the immediately preceding layer)
Expressiveness
Expressiveness(2)
Perceptron represents a linear separator in input space Threshold perceptron returns 1, iff the weighted sum of its inputs is positive: n j =0 Wj xj > 0 Or, interpreting the Wj s and xj s as a vector, W x > 0 This denes a hyperplane in the input space, perceptron returns 1 if input is on one side of that plane
?
0 1 (c) x1 xor x2 x2
Input Units
Expressiveness(3)
?? ?? ??? ?????
x1 1 x1 x1 1 1 0 0 1 x2 0 0 1 x2 0 (a) x1 and x2 (b) x1 or x2
Consider perceptron with threshold function Can represent AND, OR, NOT, majority, but e.g. not XOR:
Wj,i
For this reason, threshold perceptron is called linear separator
Output Units
Perceptron output 1 0.8 0.6 0.4 0.2 0 -4 -2 0 x1
-4
-2
4 2 x2
Output units all operate separately, no shared weights Adjusting weights moves the location, orientation, and steepness of cliff
Multilayer Networks
Layers are usually fully connected
Output units Oi Wj,i Hidden units a j Wk,j Input units Ik
hW(x1, x2) 1 0.8 0.6 0.4 0.2 0 -4 -2 x1
Expressiveness
All continuous functions with 2 layers, all functions with 3 layers
hW(x1, x2) 1 0.8 0.6 0.4 0.2 0 -4 -2 x1
Recurrent Networks
Recurrent neural networks have feedback connection (to store information over time) Elman network is a simple one that makes a copy of the hidden layer This copy is called context layer Context layer stores the previous state of the hidden layer
-4
4 2 0 x2 -2
-4
4 2 0 x2 -2
Combine two opposite-facing threshold functions to make ridge Combine two perpendicular ridges to make bump Add various bumps to t any surface
Elman network
Output Units
Elman Network(2)
The context layer feeds previous network states into the hidden layer Input vector: x = ( x1 , . . . , xn , xn+1 , . . . , x2n ) actual inputs context units Connections from each hidden unit to corresponding context unit has weight 1 Context units are fully interconnected with all hidden units (not necessarily with weight 1)
Learning
For simple Boolean or majority functions it is easy to nd appropriate weights Generally, by adjusting the weights, we change the function that a network represents That is how learning occurs in neural networks When we have no prior knowledge about the function except for data we have to learn values for Wj from this data
Hidden Units
Context Layer
Input Units
Learning(2)
Three main types of learning (we are looking at rst two) Supervised learning Network is provided with a data set of input vectors and desired output (training set) Adjust the weights so that the error between the real output and the desired output is minimized Unsupervised learning Clusters the training set to discover patterns or features in the input data Reinforcement learning Reward the network for good performance, penalize it for bad performance
Supervised Learning
Gradient descent is widely popular approach to train (single-layer) networks Idea: adjust the weights in of the network to minimize some measure of the error on training set Classical measure of error is sum of squared errors Squared error for a single training example with input x 1 2 2 and desired output y is E = 1 2 Err = 2 (y hW (x)) (where hW (x) is output of perceptron)
Gradient Descent
Depending on the gradient of the error, we increase or decrease the weight
Error
Minimum
Weight
Gradient Descent(2)
For calculating the gradient, we need some calculus We need to determine a partial derivative of E with respect to each weight:
E Wj
1 2 Err2 Err = = Err Wj Wj n Wj x j yg = Err Wj j =0
Gradient Descent(3)
In the gradient descent algorithm, Wj s are updated as follows:
Wj = Wj + Err g (in) xj is the learning rate:
Gradient Descent(4)
Complete algorithm runs training examples through the net one at a time (adjusting the weights slightly) Each cycle is called an epoch Epochs are repeated until some stopping criterion is reached E.g. weight changes become very small Only converges for linearly separable data set
Size of the steps taken in the negative direction of the gradient
= Err (g (in) xj ) g
= derivative of activation function (e.g. for sigmoid, g = g (1 g ))

Gradient Descent(5)
Different variants for cycling through training examples: Batch: adding up all gradient contributions and adjusting weights at end of epoch Stochastic: select examples randomly There are many other methods besides gradient descent: Widrow-Hoff Generalized Delta Error-Correction ...
Back-propagation Learning
Learning in a multi-layer network is a little different Minor difference: we now have several outputs and an output vector hW (x) Major difference: error at output layer is clear, error in hidden layers is unclear Idea: back-propagate error from output layer to hidden layers
Back-propagation Learning(2)
At output layer, weight update is identical to gradient descent We have multiple output units, so let Erri be the i-th component of the error vector y hW (x), so
Wj,i = Wj,i + Erri g (ini ) xj
i
Now we need to connect the output units to the hidden units
Idea: hidden node j is responsible for some fraction of the error in the nodes to which it connects
i values are divided according to the weights of the connections: j = g (inj )
i
Back-propagation process can be summarized as follows: Compute the values for the output units (using the observed error) Starting with output layer, repeat for each layer until earliest layer is reached: Propagate the values back to the previous layer Update the weights in the previous layer
Unsupervised Learning
In supervised learning, supervisor (or teacher) presents an input pattern and a desired response Neural networks try to learn functional mapping between input and output Unsupervised learnings objective is to discover patterns of features in input data This is done with no help or feedback from teacher No explicit target outputs are prescribed, however, similar inputs will result in similar outputs
Wj,i i
Now we can use same weight-update rule for the hidden nodes
Wk,j = Wk,j + j xk
Hebbian Learning Rule

Developed by Donald Hebb, a neuropsychologist, is one of the oldest learning rules (1949) Idea: when neuron A repeatedly participates in ring neuron B, the strength of the action of A onto B increases Formally speaking, this means
Wj,i = Wj,i + g (ini ) g (inj ) is learning rate and neuron j feeds neuron i
Hebbian Learning Rule(2)

Summary of Hebbian learning rule Initialize all weights (e.g. small random values) For each input pattern, compute corresponding output vector Adjust the weights as shown on last slide Repeat from step 2 until stopping criterion has been reached
Hebbian Learning Rule(3)

Problem with Hebbian learning: repeated presentations of input patterns leads to an unlimited growth in weight values Solution: impose limit on increase in weight One type of limit is to introduce a nonlinear forgetting factor :
Wj,i = Wj,i + g (ini ) g (inj ) g (inj ) Wj,i
Reinforcement Learning
In supervised learning an input data set and a full set of desired outputs is presented In reinforcement learning the feedback is not as elaborate Desired output is not described explicitly Learning network only gets feedback whether output was a success or not Learning with a critic (rather than learning with a teacher) Main objective is to maximize the (expected) reward or reinforcement signal
Reinforcement Learning(2)
General situation:
Learning Rule
Neural network reinforcement learning usually requires a multi-layer architecture
Learner
An external evaluator is needed to decide whether network has scored a success or not Every node in the network receives a scalar reinforcement signal r representing quality of output
Sensory Input
Reward
Action
r is between 0 and 1, 0 meaning maximum error, 1 meaning optimal
Environment
Compared to back-propagation (where output nodes receive error signal, which is propagated backward), here every node receives same signal
Learning Rule(2)
Mazzoni et al. presented following weight-update algorithm (based on Hebbian learning):
Wj,i = Wj,i + ( (g (ini ) pi ) g (inj ) r + (1 g (ini ) pi ) g (inj ) (1 r))
Learning Structures
So far, we have only looked at learning weights (given a xed network structure) How do we nd the best network structure? Choosing a network that is too small: May not be powerful enough to get task done Choosing a network that is too big: Problem of overtting: network memorizes examples rather than generalizing
Learning Structures(2)
If we stick to fully connected networks, the only choices are: The number of hidden layers The number of neurons in each Usual approach: try several and keep the best Try to keep it small to avoid overtting
where is a constant, and pi is the probability of neuron i ring A correct response (large r) will strengthen connections that were active during the response An incorrect response (small r) will weaken active synapses
Let us now consider networks that are not fully connected We need some effective search method to weed out connections One approach is optimal brain damage algorithm Starts with a fully connected network and removes connections from it After rst training an information-theoretic approach identies a selection of connections to be dropped Network is retrained and if performance has not decreased, process is repeated It is also possible to remove neurons that are not contributing much to result
Several algorithms for growing larger network from smaller one Tiling algorithm starts with a single unit that tries its best Subsequent units are added to take care of examples that rst unit got wrong Algorithm adds only as many units as are needed to cover all examples
Applications:Speech Recognition
Applications:Handwriting Recognition
Applications:Fraud Detection
Banks are using AI software (including neural networks) to detect fraud Have the ability to detect fraudulent behavior by analyzing transactions and alerting staff
Applications:CNC
Neural networks are also used in computer numerically controlled (CNC) machines E.g. Siemens SINUMERIK 840D controller for drilling, turning, milling, grinding and special-purpose machines
400-300-10 unit network: 1.6% error 768-192-30-10 unit LeNet: 0.9% error
Credit card fraud losses in the UK fell for the rst time in nearly a decade in 2003 (by more than 5% to 402.4m pounds) Barclays reported that after installing a system in 1997, fraud was reduced by 30% by 2003
Applications:Drug Design
Used for testing if certain anti-inammatory drugs cause adverse reactions The rate of these reactions is about 10% (with 1% serious and 0.1% fatal) Three-layer, backpropagated network was used to predict serious reactions Predicted rate matched within 5% of observed rate
Chapter Summary
Neural networks are an AI technique modeled on the brain Single-layer feed-forward networks can represent linearly separable functions Multi-layer feed-forward networks can represent any function (given enough units) Recurrent networks can store information over time Many different techniques to train networks Neural networks have been used for hundreds of applications
Chapter 9
Evolutionary Computing
Outline
Introduction to Evolutionary Computing Genetic algorithms Evolutionary programming
Introduction
Genetic algorithms already mentioned when discussing local search algorithms; now we have a closer look (Biological) evolution is an optimization process with the aim to improve ability to survive Characteristics of an individual are contained in his/her chromosomes After sexual reproduction the offsprings chromosomes consist of a combination of parents chromosomes Process of natural selection allows more t individuals to produce more offspring One expects to have offspring similar or even better tness
Introduction(2)
Occasionally mutations occur These have a random effect on the chromosomes of an individual May improve or worsen the tness of an individual (or the offspring) Introduces some variation into a population Evolutionary Computing (EC) emulates the process of natural selection in a search procedure
Evolutionary Computing
An evolutionary algorithm (EA) is a stochastic search algorithm comprising: An encoding of solutions to a problem in form of chromosomes Initial state: starting population (usually with randomly determined chromosomes) Successor function: generating offspring given two parents Evaluation function (or tness function): determining the tness of an individual Selection function: choosing the individuals to reproduce
Genetic Algorithms (GA)

Genetic algorithms model genetic evolution One of the rst EC paradigms to be developed and applied (1975) Original GAs by Holland had as distinct features: Bit string chromosome representation Proportional selection Cross-over as primary successor function
Example
Solving the 8-queens problem using GAs
n-th number in chromosome stands for position of queen within n-th column
Position above is encoded as 74258136
Example(2)
24748552 32752411 24415124 32543213
24 31% 23 29% 20 26% 11 14%
Example(3)
32748152 24752411 32252124 24415417
Mutation
Algorithm
function GENETIC-ALG(population,FITNESS-FN) repeat new_pop = empty set loop for i from 1 to SIZE(population) do x = RAND-SELECT(population,FITNESS-FN) y = RAND-SELECT(population,FITNESS-FN) child = REPRODUCE(x,y) if small random probability then child = MUTATE(child) add child to new_pop end population = new_pop until some individual fit enough or enough time has elapsed return best individual
32752411 24748552 32752411 24415124

Pairs
32748552 24752411 32752124 24415411

CrossOver
Graphical representation of crossover of rst two parents (before mutation):
Fitness
Selection
As tness function we use the number of nonattacking pairs of queens (here probability of being chosen is proportional to tness) Two individuals are chosen randomly (biased by probabilities) for reproduction Random crossover point determines which fragments will be exchanged when reproducing
Variants for Selection

Several different techniques for selecting parents exist Random selection Individuals are selected randomly with no reference to tness at all Proportional selection The chance of individuals being selected is proportional to the tness value Tournament selection Group of k individuals is selected randomly These individuals take part in tournament, i.e. best individual is selected (done for both parents)
Variants for Selection(2)

Elitism Next generation will not consist entirely of new individuals A certain number of individuals from current generation survive into the next one: The k best individuals k individuals selected using any of the previous techniques
Variants for Crossover

There are also different techniques for doing crossover One-point crossover: A single position is randomly selected and substrings after that point are swapped
000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 111 000 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111
111 000 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111
Variants for Crossover(2)

Two-point crossover: Two positions are randomly selected and substrings between points are swapped
Variants for Crossover(3)

Uniform crossover: Any random parts are swapped
Mutation
Aim of Mutation is to introduce new genetic material Adding diversity to the population Usually a small probability for mutations to occur is chosen
000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 111 000 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111
000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 111 000 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111
To ensure that good solutions are not distorted too much Initial large mutation rate that decreases exponentially can also be quite successful Similarity to simulated annealing
111 000 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111
111 000 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111
111 000 000 111 000 111 000 111 000 111
111 000 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111
111 000 000 111 000 111 000 111 000 111
Assessment of GAs
Genetic algorithms are similar to stochastic local beam searches Combine an uphill tendency in searching with random exploration Exchange information among parallel search threads Crossover seems to be the crucial component of GAs
Assessment of GAs(2)
However, crossover conveys no advantage if positions of the chromosomes are randomly permuted initially Advantage comes from combining large blocks that have involved independently to perform useful function E.g. putting queens in positions 7,4, and 2 in the rst three columns They dont attack each other (useful block) Could be combined with other useful block to construct a solution (e.g. 58136, which is also a useful block) This raises the level of granularity at which search takes place
Theory of GAs explains this with the idea of a schema A schema is a substring in which some of the positions can be left unspecied E.g. 742***** describes all states in which the rst three queens are at position 7,4, and 2 Strings that match a schema are called instances of a schema E.g. 74213378
If the average tness of a schemas instances is above average, the number of this schemas instances will grow over time This effect is unlikely to be signicant if adjacent positions are totally unrelated There will be few contiguous blocks that provide consistent benet GAs work best when schemas correspond to meaningful components of a solution Successful use of genetic algorithms requires careful engineering of the representation
Applications
Parametric Design of Aircraft Optimizing aircraft designs Task is posed as that of optimizing a list of parameters Routing in circuit-switched telecommunications networks Optimize the routing of telephone networks in order to minimize costs to US West Hybridization with other algorithms can lead to better performance
Applications(2)
Robot trajectory generation Planning the path that a robot arm takes Not only optimizing the length of the path, but also wear and tear on arm (by acceleration and deceleration) Tuning for sonar information processing Training neural networks classifying sonar signals using GAs
Evolutionary Programming
Evolutionary Programming (EP) emphasizes behavioral models and not genetic models EP is derived from the simulation of adaptive behavior in evolution EP considers phenotypes (not genotypes) About nding a set of optimal behaviors, the tness function measures behavioral error
Finite-State Machines
How do we model behavior? A popular way to do this is using nite-state machines (FSMs) Describes a sequence of actions that are taken Each action depend on the current state of the machine and the input
Finite-State Machines(2)
Formal denition:
F SM = (S, , I, O, , )
where
S is a nite set of states is the initial state I is a nite set of input symbols O is a nite set of output symbols is the next-state function is the next-output function
Finite-State Machines(3)
Example: S = {X, Y, Z }, = Z, I = {0, 1}, O = {f, t}
0/t Y 0/f 1/t 1/f X 1/t Z 0/f
FSMs and EP
Evolutionary programming was originally developed to evolve nite-state machines Aim of early applications was to make predictions about future Given a sequence of previously observed symbols, evolve a program to predict next symbol
Making Predictions
For the example FSM the following sequence would produce the following output Sequence 0 0 1 0 0 1 1 . . . f t t f t t t ... Output Interpreting f=0 and t=1, the FSM made just one mistake
Present state X X Y Y Z Z Input symbol 0 1 0 1 0 1 Y X Y Z Y X Next state Next output f f t t f t

Algorithm
function EVOLUTION-PROG(population,FITNESS-FN) repeat new_pop, tmp_pop = empty set loop for i from 1 to SIZE(population) do child = SELECT-NTH(i,population) child = MUTATE(child) add child to tmp_pop end new_pop = SELECT-FITTEST(SIZE(population), population + tmp_pop, FITNESS-FN) population = new_pop until some individual fit enough or enough time has elapsed return best individual
Mutation
In the original EP, there was no crossover, only mutation One point in the behavioral space was seen as standing for a species, not an individual However, EP can be combined with GAs
Variants for Selection

The function SELECT-FITTEST can be realized in different ways Any selection process used for GAs An elitist mechanism picking k best parents, remainder of population from best remaining parents and offspring First cull the worst k among parents and offspring, then select best n among rest
Application
Nark: a bug-nding compiler extension developed at Stanford Static analysis techniques are used to nd bugs in software systems One system, Metacompilation (MC), allows users to encode rules such as Dont use freed memory Dont call blocking functions with interrupts disabled To encode a rule, user describes it as a state machine Source code is used as input to the state machine
Application(2)
Unfortunately, encoding those rules is quite complicated Nark allows the user to simply give examples of (a class of) bugs Nark evolves the checker for the rule itself The complexity of the classes of bugs that Nark is able to found is still somewhat limited
Chapter Summary
In practice, genetic algorithms have had a widespread impact on optimization problems At present it is not quite clear whether the appeal arises from their performance or their aesthetically pleasing origins Similar things can be said about evolutionary programming (although it is not as widely spread as GAs)
Chapter 10
Outline
General Introduction to Swarm Intelligence Particle swarm optimization (PSO) Ant colony optimization (ACO)
Swarm Intelligence
Simple agents interacting locally with one another and their environment No central control or data source (Simple) local interactions often lead to the emergence of (complex) global behavior Examples found in nature: Ant colonies/bee hives Bird ocking Animal herding Bacteria molding Fish schooling
Swarm Intelligence
Examples
Examples(2)
Examples(3)
Examples(4)
Particle Swarm Optimization

Particle swarm optimization (PSO) is a global optimization technique A swarm consists of a set of particles Each particle represents a potential solution Assumes that solution can be represented as a point in n-dimensional space Different starting solutions are plotted in this space Each with an initial velocity As well as a communication channel to other particles
Particle Swarm Optimization(2)

Particles then move through solution space After each timestep, particles are evaluated according to some tness criterion Particles are accelerated towards particles with better tness values within their communication group A large number of members makes technique resilient against getting stuck in local optima
Algorithm
Formally speaking, each particle i has a position vector xi (t) describing its position at time t a current velocity vi (t) at time t
Algorithm(2)
The position of a particle is changed by adding the velocity vector to the position vector:
xi (t + 1) = xi (t) + vi (t)
Algorithm(3)
Change the velocity vector slightly (random factor r) to point into direction of best neighbor:
vi (t + 1) = vi (t) + r (xbesti (t) xi (t))
As this may lead to an unanimous, unchanging directions, sometimes random craziness factor is added
function SWARM-OPT(population,FITNESS-FN) repeat loop for i from 1 to SIZE(population) do fit_i = FITNESS-FN(SELECT-NTH(i,populatio end loop for i from 1 to SIZE(population) do look for fittest neighbor of particle i change velocity vector v_i change position x_i end until some individual fit enough or enough time has elapsed return best individual
Neighborhood
Different neighborhood types have been dened and studied Star topology Every particle can communicate with every other particle Each particle is attracted to the best global solution Was used in the rst version of PSO
Neighborhood(2)
Ring topology Every particle communicates with its n immediate neighbors Diagram below shows case for n = 2 Hybrids with star topology are possible (vi is changed towards best neighbor and best overall)
Neighborhood(3)
Wheel topology Only one particle is connected to all others, all other particles are only neighbors to this focal particle Isolates particles from each other, all particles communicate through focal particle Creates a follow the leader effect
Applications
Biochemistry: Improving the fermentation medium for Echinocandin B production Military: Traveling Salesman Problem for Surveillance Mission Electrical engineering: Reactive Power and Voltage Control
PSO vs. Evolutionary Computing

Both are optimization paradigms using adaption of a population of individuals There are some differences, however: Memory: PSOs velocity vectors reect all reactions to previous best solutions EC does not conserve all information from ancestors Driving forces for changes: Learning from peers in case of PSO Genetic recombinations and mutations in case of EC
Ant Colony Optimization (ACO)

PSO is modeled on swarms of individuals having the same behavior and characteristics When looking at social insects, we have a large number of individuals with different morphological structures and tasks but all contributing to a common goal (e.g. survival of the hive) Here we look at ants
Ant Behavior in the Wild
Tasks within an Ant Colony
Tasks within an Ant Colony(2)

How do individual ants know which task to do? There is no globally centered command center
Reproduction: queen Brood care: specialized worker Food collection: specialized worker
Seems to occur magically Actually is based on two different things: Anatomical differences Stigmergy
Defense: soldier Nest cleaning: specialized worker Nest building & maintenance: specialized worker
Tasks within an Ant Colony(3)

Anatomical differences: Size and larger jaws e.g. distinguish between soldier ants and food collectors Stigmergy: Indirect interactions between ants (as opposed to direct interactions, like antenna, mandibular, or visual contact) Based on local modication of the environment Unfortunately, many aspects of ant behavior are not fully understood yet We are going to look at one aspect that has been studied intensively: food collection
Food Collection
Ants have the ability to nd the shortest path between a food source and their nest: food source
Food Collection(2)
Several experiments have been conducted to study this behavior Initially, paths are chosen randomly With time, more and more ants follow the shorter path
food food
nest
nest
nest
Food Collection(3)
Whats the reason for that? Common ant is not very intelligent (a few hundred neurons) Its done via stigmergy When walking around, each ant leaves behind a pheromone trail When an ant has to decide which path to follow, usually it picks the one with higher pheromone concentration Ants on the shorter path will return faster, leaving more pheromone on this path in shorter time Also, pheromone evaporates with time, pheromone on longer path will vanish faster
Modeling Ants and Optimization

Next we show, how to model the ant behavior to help solve traveling salesman problem (TSP) Ants are given a list of cities they have already visited Ant follows trail probabilistically (depending on pheromones) without updating pheromones After completing a tour, ant deposits pheromone depending on quality of solution After each update step, pheromones evaporate
Ants and TSP

Number of ants is constant (e.g. one for each node):
Ants and TSP(2)
Ants and TSP(3)
Ants and TSP(4)

The probability of ant k at city i to visit city j next is:
i,k (j ) = ij
k ic cJi
ACO usually performs better if mixed with other heuristics (e.g. greedy local optimization taking shortest path):
i,k (j ) = ij is the amount of pheromone on edge between two nodes i and j
ij ij k cJi ic ic
ij is the inverted distance 1/dij between cities i and j
and control the inuence of pheromone and heuristic
Ants and TSP(5)

k At the end of each tour, ant k deposits pheromone ij on links k ij k ij =
Ants and TSP(6)

The pheromone for all ants are added up:
n k ij k =1
Algorithm
function ACO-TSP() nant = NUMBER-OF-ANTS() nnode = NUMBER-OF-NODES() place ants on nodes repeat loop for k from 1 to nant do loop for step from 1 to nnode do choose next node according to probability phi end end update pheromone trails until some tour good enough or enough time has elapsed return best tour found so far
=0 Q Lk
when link (i,j) has not been used otherwise
ij =
The pheromone for all links is adjusted:

ij = (1 )ij + ij is the evaporation coefcient
Q is a xed amount of pheromone and Lk is length of tour of ant k
The longer the tour, the worse the solution, the smaller the amount of pheromone awarded each link
Applications
One important eld in which ACO has been applied is telecommunications routing When routing calls through a network, they go through a number of intermediate switching stations In a large network there are many possible routes Some network parts may experience congestion while others have spare capacity Load balancing tries to distribute calls over the network such that almost no calls will be lost there is a short route between callers
Applications(2)
ACO has been used to optimize BT network Right hand shows British Synchronous Digital hierarchy network (SDH) M. Ward, "Theres an ant in my phone", New Scientist, 24 January 1998
Applications(3)
Centralized control systems scale badly Usually decentralized approach with several routers is used, each with (local) routing information Main idea of ACO: Enhance routing tables with pheromone information Send virtual ants through the network going from a random source to a random destination Ant going through network updates pheromone information depending on quality of connection (length, congestion)
Chapter Summary
Particle swarm optimization seems to be an efcient and robust technique Although full potential has not been tapped yet Study of ant colonies is still a young eld in computational intelligence More interesting applications still to be explored
Chapter 11
Motivation
Outline
Fuzzy sets and fuzzy logic Approximate Reasoning/fuzzy controllers
Fuzzy Systems
Motivation
Development of logic has a long and rich history (many philosopher played a role) Foundations of two-valued logic come from Aristotle (Laws of Thought) 400 B.C.: Law of the Excluded Middle: Every proposition must have only one of two outcomes: true or false Even back then, there were objections: Cretan philosopher Epimenides of Knossos said: All Cretans are liars
Motivation(2)
Many successes have been achieved with two-valued logic However, not all problems can be mapped into the domain of two-valued variables In most real-world problems incomplete, imprecise, vague, or uncertain data has to represented With fuzzy logic domains are characterized by linguistic terms (rather than numbers), e.g. It is partly cloudy John is very tall partly and very describe the magnitude of the (fuzzy) variables cloudy and tall
Motivation(3)
In the 1900s ukasiewicz proposed an alternative in form of a three-valued logic The possible values are true, false, and undecided Later on, he extends it to a four- and ve-valued logic In 1965 Zadeh produced the foundations of an innite-valued logic in form of fuzzy logic Was ignored for some time, really took off after reimporting it from Japan
Set Theory
Set Theory(2)
Regular sets or crisp sets have a rigid distinction Either an element belongs to the set or not Formally speaking, we have a membership function mA (x) for set A, which maps elements x of the domain X onto 0 or 1:
Set Theory(3)
A graphical presentation of our set large ants looks like this:
mlarge ants (x) 1
We want to construct the set of all large ants Suppose ants longer than 1.5cm are considered large Clearly, an ant with length of 3cm will belong, one with 0.5cm will not What about an ant with length 1.48cm or 1.52cm?
mA : X {0, 1}
cm 0.5 1 1.5 2 2.5

Fuzzy Sets
In contrast to crisp sets, fuzzy sets have membership degrees That means, in addition to the values 1 (belongs to) and 0 (does not belong to) an element can have any value in between (kind of belongs to) Formally speaking, the membership function A (x) for a fuzzy set A maps elements x to any value in the interval [0, 1]:
A : X [0, 1]
Fuzzy Sets(2)
A fuzzy set for large ants could look like this:
Comparing Fuzzy Sets

Equality: Crisp sets: two sets A and B are equal, if they contain the same elements Fuzzy sets: all membership degrees have to be equal, i.e. A (x) = B (x) for all x X Containment: Crisp sets: A is contained in B (A B ), if all elements in A are also elements of B Fuzzy sets: again membership degrees have to be considered, i.e. A B A (x) B (x) for all x X
large ants (x) 1
cm 0.5 1 1.5 2 2.5

Fuzzy Operators
Complement (logical NOT): A (u) = 1 A (u) Union (logical OR): AB (u) = max(A (u), B (u)) Intersection (logical AND): AB (u) = min(A (u), B (u)) There are alternatives to these operators (which we will not look at here) All operators need to satisfy certain axioms (e.g. commutativity, associativity for union and intersection)
Fuzzy Operators(2)
Complement Union (green) and Intersection (red)
Fuzziness and Probability

Fuzzy logic and probability are often confused Both refer to uncertainty, but the similarity stops there There are conceptual differences
Fuzziness and Probability(2)

Fuzzy truth represents membership in vaguely dened sets Describes an imprecision in facts E.g. coin lying in doorway has a degree of 0.5 belonging to kitchen and 0.5 belonging to dining room Probabilities refer to the likelihood of some event or condition Deals with chances of an event happening (the result, however, is precise) E.g. ipping a coin has a probability of 0.5 for turning up heads or tails; after ipping there is no imprecision
Rudimentary Reasoning
Using the before mentioned operators we can do some simple reasoning For example, consider the three fuzzy sets tall, good_athlete, and good_basketball_player Now assume:
tall (Michael Jordan) = 0.9 good_athlete (Michael Jordan) = 0.9 tall (Sven) = 0.9 good_athlete (Sven) = 0.2
Rudimentary Reasoning(2)
If we know that a good basketball player is tall and a good athlete, then which one is the better player? We can apply the intersection operator and get:
good_basketball_player (Michael Jordan) = min(0.9, 0.9) = 0.9
good_basketball_player (Sven) = min(0.9, 0.2) = 0.2
So Michael Jordan is the better player However, this is a very simplistic situation For most real-world problems, we have to model much more complex scenarios For these cases (rule-based) fuzzy controllers are used
Fuzzy Controllers
Mainly used for controlling complex dynamic systems In that case, formal description by mathematical models is very difcult or even impossible Instead of mathematical model, knowledge of human experts in form of linguistic variables and rules is employed
Fuzzy Controllers(2)
In principle, fuzzy controllers work as follows It observes its environment checking for unusual events In a fuzzication phase the input data is transformed into fuzzy sets Based on a (fuzzy) rule set the input data is evaluated and certain actions may be triggered The output data (which is also described in terms of fuzzy sets) needs to be defuzzied
Fuzzy Controllers(3)
Fuzzy controllers can also be seen as intelligent agents using fuzzy logic for their reasoning:
Fuzzy Controller
Sensors Fuzzification Environment
Rule Set
Inference Defuzzification Actuators
Fuzzy Rules
Fuzzy rules are of the general form if antecedent(s) then consequent(s) Antecedents of a rule form a combination of fuzzy sets (which are connected via logic operator) The consequent part is usually a single fuzzy set (multiple combined fuzzy sets can also appear)
Example
Let us look at an exemplary application to clarify the functionality of a fuzzy controller We want to monitor the performance of a Web server running on a cluster The goal is to do (automatic) load balancing in order to use resources efciently
Example(2)
The cpu load of a machine is described using fuzzy sets:
1 low
medium high
20 40 60 80 100 cpu load

Example(3)
A machine with a cpu load of 60% has a medium load to a degree of 0.5 and a high load to a degree of 0.2:
Example(4)
We have different ways to react to a situation To keep things simple, we look at two of them Scale-up: moving a service to a more powerful machine Scale-out: starting a new instance of a service
Example(5)
Lets assume that cpuLoad and performanceIndex are input variables (performanceIndex expressing how powerful a machine is) and scaleUp and scaleOut are output variables Then rules could look like this IF (cpuLoad IS high AND (performanceIndex IS low OR performanceIndex IS medium)) THEN scaleUp IS applicable IF (cpuLoad IS high AND performanceIndex IS high) THEN scaleOut IS applicable
1 0.5 0.2 20 40 60 80 100 cpu load

Example(6)
Lets assume that we have a cpu load of 90%, then for degrees of membership we get:
low_load (90) = 0.0 medium_load (90) = 0.0 high_load (90) = 0.8
Example(7)
For the antecedent of the rst rule we get:
0.8 AND (0.0 OR 0.6) = min(0.8, max(0.0, 0.6)) = 0.6
Example(8)
The applicability of a scale-up is also described with the help of a linguistic fuzzy variable:
For the antecedent of the rst rule we get:

0.8 AND 0.3 = min(0.8, 0.3) = 0.3
Furthermore assume that for the performance index 5 we have the following degrees of membership:
low_perf (5) = 0.0 medium_perf (5) = 0.6 high_perf (5) = 0.3
In classical logic: if the antecedents are true, then the implications are true In fuzzy logic there are several different approaches, we use min-max inference
0.2 0.4 0.6 0.8 1.0 applicability (of scaleup)

Example(9)
Using min-max inference the result set is cut off at the degree of the antecedent (for scale-up 0.6):
Example(10)
We use the left-most point of the maximal value to defuzzify the result In this case we say that scale-up is applicable to a degree of 0.6
Applications
The rst application of fuzzy control comes from the work of Mamdani and Assilan (1975) Design of a fuzzy controller for a steam engine Objective was to maintain a constant speed by controlling the pressure on pistons Was done by adjusting the heat supplied to a boiler
1 0.6
Assuming a similar set describing the applicability of a scale-out, for the second rule we get an applicability of 0.3 Since 0.6 > 0.3, we decide to scale-up the service in this case
0.2 0.4 0.6 0.8 1.0 applicability (of scaleup)

Applications(2)
Since then, a vast number of fuzzy controllers have been developed: Washing machines Video cameras Air conditioners Robot control Underground trains Hydro-electrical power plants
Chapter Summary
Fuzzy controllers have been very successful in commercial products Although critics argue that these applications are successful because they are quite simple (e.g. have small rule base) There have been attempts to merge fuzzy set theory and probability theory, however, there remain many open questions
Chapter 12
Social and Philosophical Implications of AI
Outline
Can machines act intelligently? Can machines really think? Ethics and risks of developing AI
Terminology
Weak AI: assertion that machines could possibly act intelligently (or act as if they were intelligent) Strong AI: assertion that machines that do so are actually thinking Opinion of most AI researchers: Take weak AI hypothesis for granted Dont care about strong AI hypothesis (as long as programs work)
Weak AI
Some philosophers have tried to prove that AI is impossible If it is possible or impossible depends on how it is dened Engineering point of view: nding best agent program on a given architecture Philosophical point of view: Comparing two architectures: human and machine Traditionally posed the question: can machines think? Unfortunately, there is no unambiguous denition of thinking
Strong AI
Main criticism: even if a machine passes the Turing test, is it actually thinking or just simulating the thinking process? Chinese room problem: Human (who doesnt understand Chinese) is put into a room with sheets of paper and detailed instructions Sheets of paper with Chinese writing are slipped under the door Human looks up in the instructions what to do, paints some characters on a paper, slips it back From the outside this may seem like an intelligent agent understanding Chinese is at work
Ethics and Risks

AI may pose some serious problems: People might lose their jobs to automation People might have too much (or too little) leisure time People might lose their sense of being unique People might lose some their privacy rights Use of AI systems might result in a loss of accountability Success of AI might mean the end of human race
Automation
This is not a new problem, happens every time new technology is deployed Some people lose their jobs New jobs are created elsewhere Main problem: usually new jobs demand higher qualication
Leisure Time
Arthur C. Clarke once wrote that people might face a future of utter boredom No risk of that yet, due to integrated computerized systems that run 7/24, people tend to work longer hours Winner-Takes-All-Society: Traditional industrial economy: working 10% more result roughly in 10% more prot Fast-paced information age economy: an edge of 10% over competitor might mean 100% more prot
Uniqueness
AI research might suggest that human capabilities are not that unique after all Mankind has survived similar setbacks before: Copernicus moving the Earth out of the center of the universe Darwin putting Homo sapiens on the level of other species
Privacy Rights
Widespread wiretapping becomes possible Computer systems using language translation, speech recognition, and keyword search already sift through telephone, email and fax trafc There is an ongoing controversial debate about this: Scott McNealy (CEO Sun): You have zero privacy anyway. Get over it. Louis Brandeis (Judge, 1890): Privacy is the most comprehensive of all rights . . . the right to ones personality.
Accountability
What is the legal liability of an AI system? Who takes responsibility if something goes wrong? This is magnied when money changes hands: Who is liable for any debts made by an intelligent agent It may also play a role in life and death situations: When a physician uses a medical expert system, who is at fault if the diagnosis is wrong?
End of Human Race

Almost any technology can cause harm in the wrong hands However, with AI, technology might take things into its own hands (Terminator, Matrix) Nevertheless, AI might achieve a sort of conquest by serving and becoming indispensable
What if AI Does Succeed?

We already covered some of the ethical question Modest successes in AI have already changed computer science in some way, making possible new applications Medium-level successes in AI would affect all kinds of people in their daily lives (like cell phones or the Internet) Large-scale success would change the lives of a majority of humankind, the very nature of our work and play would be altered
Lecture Summary
This lecture can only be seen as a brief introduction into this subject AI has made quite a progress in its short history Final word belongs to Alan Turing: We can see only a short distance ahead, but we can see that much remains to be done.

Expert Systems

Uploaded by

Copyright:

Available Formats

Expert Systems

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Expert Systems

Uploaded by

Copyright:

Available Formats

Chapter 1 Articial Intelligence and Neural Networks

Sven Helmer Birkbeck College, London

Articial Intelligence andNeural Networks p.1/410

Articial Intelligence andNeural Networks p.2/410

Articial Intelligence andNeural Networks p.3/

Beyond the Hype

Articial Intelligence andNeural Networks p.6/

Articial Intelligence andNeural Networks p.7/410

Articial Intelligence andNeural Networks p.8/410

Articial Intelligence andNeural Networks p.9/

Articial Intelligence andNeural Networks p.10/410

Articial Intelligence andNeural Networks p.11/410

Articial Intelligence andNeural Networks p.12/

State of the Art

Articial Intelligence andNeural Networks p.13/410

Articial Intelligence andNeural Networks p.14/410

Articial Intelligence andNeural Networks p.15/

Articial Intelligence andNeural Networks p.16/410

Articial Intelligence andNeural Networks p.17/410

Articial Intelligence andNeural Networks p.18/

Agents and Environments

Vacuum Cleaner World

Articial Intelligence andNeural Networks p.21/

Vacuum Cleaner Agent

Vacuum Cleaner Agent(2)

Action Right Suck Left Suck Right Suck

Is this the right function (i.e. is it rational)?

Can we implement this in a small agent program?

Articial Intelligence andNeural Networks p.22/410

Articial Intelligence andNeural Networks p.23/410

Articial Intelligence andNeural Networks p.24/

Articial Intelligence andNeural Networks p.26/410

Articial Intelligence andNeural Networks p.27/

Articial Intelligence andNeural Networks p.28/410

Articial Intelligence andNeural Networks p.29/410

Articial Intelligence andNeural Networks p.30/

Simple Reex Agent

What action I should do now

Articial Intelligence andNeural Networks p.31/410

Articial Intelligence andNeural Networks p.32/410

Articial Intelligence andNeural Networks p.33/

Reex Agent with State

What action I should do now

Articial Intelligence andNeural Networks p.34/410

Articial Intelligence andNeural Networks p.36/

What action I should do now

Articial Intelligence andNeural Networks p.37/410

Articial Intelligence andNeural Networks p.38/410

Articial Intelligence andNeural Networks p.39/

Articial Intelligence andNeural Networks p.41/410

Articial Intelligence andNeural Networks p.42/

Problem Solving and Search

Basic search algorithms

Articial Intelligence andNeural Networks p.43/410

Articial Intelligence andNeural Networks p.44/410

Articial Intelligence andNeural Networks p.45/

Articial Intelligence andNeural Networks p.47/410

Articial Intelligence andNeural Networks p.48/