Expert Systems
Expert Systems
Expert Systems
What is AI?
Articial Intelligence (AI) is a very new eld (term was coined in 1956)
Introduction
Along with molecular biology cited as eld I would most like to be in Quite a universal eld: systematizes and automates intellectual tasks, potentially relevant to many human activities
Acting Humanly
Turing (1950): Computing machinery and intelligence Can machines think? Can machines behave intelligently? Turing Test:
HUMAN HUMAN INTERROGATOR
Acting Humanly(2)
Although researchers dont focus on passing Turing test, it is still valid AI researchers concentrate on underlying issues Articial ight succeeded after turning away from imitating birds and looking at aerodynamics
reasoning
human
rational
?
AI SYSTEM
behavior
All four approaches have been pursued
Articial Intelligence andNeural Networks p.4/410
To pass test, computer would need following capabilities: language, knowledge, reasoning, understanding, learning
Articial Intelligence andNeural Networks p.5/410
Thinking Humanly
1960s: cognitive revolution, information-processing psychology started replacing behaviorism Requires scientic theories on internal activities of brain Cognitive science (top-down): predicting and testing human behavior Cognitive Neuroscience (bottom-up): identication from neurological data Distinct approaches, although there is some overlap
Thinking Rationally
Aristotle: one of the rst to codify laws of thought From this the eld of logic evolved Provides patterns for argument that given correct premises yield correct solutions By 1965 programs that could (in principle) solve any solvable problem described in logical notation Two main obstacles: Difcult to state informal, uncertain knowledge in formal, logical notation Solving in principle doesnt mean solving in practice
Acting Rationally
Rational behavior: doing the right thing Given available information, try to achieve best outcome Doesnt necessarily involve thinking (e.g. reexes) More general than thinking rationally (which is only concerned with correct inference) Also better suited for scientic development than human-based approaches Human-based approaches mimic behavior that is result of complicated and largely unknown evolutionary process Therefore we focus on the approach of acting rationally
Rational Agents
An agent is an entity that perceives and acts (autonomously) Formally, an agent is a function from a perception history to actions: f : P A We look for agents with best performance for a given class of environments and tasks Perfect rationality is unrealistic, aim for best program for given resources Philosophy
Foundations of AI
Can formal rules be used to draw valid conclusions? How does mental mind arise from physical brain? Where does knowledge come from? How does knowledge lead to action? Mathematics What are formal rules to draw valid conclusions? What can be computed? How do we reason with uncertain information?
Foundations of AI(2)
Economics How to make decisions to maximize payoff? What about others not going along? What if the payoff is far in the future? Neuroscience How do brains process information? Psychology How do humans and animals think and act?
Foundations of AI(3)
Computer engineering How can we build efcient computers? Control theory and cybernetics How can artifacts operate under their own control? Linguistics How does language relate to thought?
History of AI
1952-1969: Early enthusiasm, great expectations 1966-1973: A dose of reality 1969-1979: Knowledge-based systems 1980-present: AI becomes an industry 1986-present: Return of neural networks 1987-present: AI becomes a science 1995-present: Intelligent agents
Chapter Summary
Different people think differently of AI Are you concerned with thinking or behavior? Do you want to model humans or work from ideal standard We focus on rational action
Overview of Lecture
General introduction to AI Problem Solving Neural Networks Evolutionary Computing Swarm Intelligence Fuzzy Systems Social and Philosophical Implications of AI
Chapter 2
Intelligent Agents
Outline
Agents and environments Rationality PEAS (Performance measure, Environment, Actuators, Sensors) Environment types Agent types
actuators
Agents include humans, robots, programs Remember: agent function maps from perception history to actions: f : P A Agent program runs on a physical architecture to produce f
Articial Intelligence andNeural Networks p.19/410 Articial Intelligence andNeural Networks p.20/410
Perception: location and contents, e.g. [ A, dirty ] Actions: Left, Right, Suck, Do Nothing
Rationality
What is rational at any given time depends on Performance measure that denes success Agents prior knowledge of the environment Actions that an agent can perform Agents perception history to date Rational agent: For each possible perception history selects an action that is expected to maximize performance measure given history and built-in knowledge Is our vacuum-cleaning function rational?
Rationality(2)
Depends on the situation Assume following performance measure One point per clean location per time step Environment is known to exist only of locations A and B Clean locations stay clean, sucking cleans dirty locations We only have the aforementioned four actions Agent correctly perceives location and status In this case it is rational Is it still rational, if we give a penalty of one point per move?
Articial Intelligence andNeural Networks p.25/410
Rationality(3)
Rational = omniscient: perceptions may not supply all relevant information Rational = clairvoyant: outcomes of actions may not be as expected Hence, rational does not always mean successful Being rational is also about Exploration: doing actions to modify future perceptions Learning: changing prior knowledge due to perceptions Autonomy: not relying on designers input alone
PEAS
When designing a rational agent, we need to consider performance measure, environment, actuators, sensors Example: designing an automated taxi Performance measure: safety, destination, prots, legality, comfort, . . . Environment: streets and motorways, trafc, pedestrians, weather, day-/nighttime, . . . Actuators: steering, accelerator, brake, horn, speaker/display, . . . Sensors: video, accelerometer, gauges, engine sensors, keyboard, microphone, GPS, . . .
PEAS
Example: internet shopping Performance measure: price, quality, appropriateness, efciency Environment: current and future web sites of vendors and shippers Actuators: displaying to user, following URLs, lling in forms Sensors: HTML pages (text, graphics, scripts)
Environment Types
Fully observable vs. partially observable: sensors are able to detect all relevant aspects / noisy or inaccurate sensors Deterministic vs. stochastic: next state completely dened by current state / uncertainty Episodic vs. sequential: future episodes do not depend on decisions in previous ones / short-term actions can have long-term consequences
Environment Types(2)
Static vs. dynamic: no change in environment while agent is deliberating / change in environment Discrete vs. continuous: nite number of distinct states / smooth range of values Single agent vs. multi agent: agent working by itself / agent has to compete or cooperate with other agents
Agent Types
Four basic types in order of increasing generality Simple reex agent Reex agent with state Goal-based agent Utility-based agent All of them can be turned into learning agents
Example
The vacuum agent from before is a simple reex agent
function VACUUM-AGENT([location,status]) if status = Dirty then return Suck else if location = A then return Right else return Left
Environment
Conditionaction rules
Actuators
Example
cleaned[A], cleaned[B] = false function VACUUM-AGENT([location,status]) if status = Dirty then cleaned[location] = true return Suck else if location = A then cleaned[A] = true if cleaned[B] then return Do nothing else return Right else cleaned[B] = true if cleaned[A] then return Do nothing else return Left
Articial Intelligence andNeural Networks p.35/410
Goal-based Agent
Knowing about the current state is not always enough Correct decision depends on what our goal is Agent has to determine if action will bring it towards goal Doing this involves planning Vacuum-cleaning agent had goal hardwired into program: clean up all locations (so no planning involved)
Environment
What my actions do
Conditionaction rules
Agent
Actuators
Goal-based Agent(2)
Sensors State How the world evolves What the world is like now What it will be like if I do action A
Utility-based Agent
Goals alone are not enough E.g. many ways for a taxi to get to destination, but some are quicker, safer, more reliable, cheaper, . . . How happy does a state make an agent? Happy doesnt sound very scientic, therefore its called (high) utility
Utility-based Agent(2)
Sensors State How the world evolves What the world is like now What it will be like if I do action A How happy I will be in such a state What action I should do now
Environment
Environment
What my actions do
What my actions do
Utility
Goals
Agent
Actuators
Agent
Actuators
Learning Agent
How do programs for selecting actions come into being? Programming everything by hand is very laborious Alternative: build machines that can learn and then teach them
Learning Agent(2)
Performance standard
Chapter Summary
Agents interact with environments through actuators and sensors
Critic
Sensors
An agent function f describes what the agent does in all circumstances A performance measure evaluates the sequence of actions and its effects A rational agent maximizes the expected performance
Environment
feedback changes
Learning element learning goals Problem generator
knowledge
Performance element
Agent
Articial Intelligence andNeural Networks p.40/410
Actuators
Chapter 3
Outline
Problem-solving agents Problem formulation Example problems
Problem-solving Agents
Simplest agents discussed so far are reex agents They use direct mapping from states to actions Unsuitable for very large mappings Goal-based agents consider future actions and their outcome Problem-solving agents are one kind of goal-based agents Problem-solving agents nd action sequences that lead to desirable states
Problem-solving Agents(2)
seq: an action sequence, initially empty state: a description of current world state goal: a goal, initially null problem: a problem formulation function SIMPLE-PROBLEM-SOLVER(perception) state = UPDATE-STATE(state, perception) if seq is empty then goal = FORMULATE-GOAL(state) problem = FORMULATE-PROBLEM(state,goal) seq = SEARCH(problem) action = FIRST(seq) seq = REST(seq) return action
Articial Intelligence andNeural Networks p.46/410
Problem-solving Agents(3)
Agent on previous slide does ofine problem solving Simple formulate, search, execute design When executing the sequence of actions, it ignores the perceptions Assumes that solution that was found will always work
Example: Romania
On holiday in Romania Currently in Arad, ight leaves tomorrow from Bucharest Formulate goal: Be in Bucharest in time Formulate problem: States: being in various cities Actions: drive between cities Find solution: Sequence of cities, e.g. Arad, Sibiu, Fagaras, Bucharest Execute solution
Problem Formulation
Problem is dened by four components Initial state: the state in which agent starts e.g. In(Arad) Successor function S(x): description of possible actions and their outcome e.g. S(Arad) = { Go(Sibiu), In(Sibiu) , . . . } Goal test: determines if given state is a goal state e.g. explicit: In(Bucharest), implicit: king checkmated Path cost: function that assigns numeric cost to each path (reects performance measure) e.g. route distances between cities
71
Zerind
Neamt
75
Arad
151
87
Iasi
140
Sibiu
99
Fagaras Vaslui
118
Timisoara
80 Rimnicu Vilcea
Pitesti
111 70
Lugoj Mehadia
211
142 98
97 146
101 138
85
Hirsova
Urziceni
75
Dobreta
86
Bucharest
120
Craiova
90
Giurgiu Eforie
Problem Formulation(2)
At the moment we assume a single-state problem Deterministic, fully observable environment Agent knows exactly which state it will be in Solution is a sequence of actions from initial state to a goal state Solution quality is measured by path cost Optimal solution has lowest path cost among all solutions
Example: 8-puzzle
3 3 board with eight numbered tiles and a blank space
7 5 8
4 6
5 1 4 7
2 5 8
Goal State
3 6
3
Start State
Example: 8-puzzle
States: Specifying the location of each tile and the blank in one of the nine squares Initial state: any state can be designated as initial state Successor function: generates legal states that results from the four actions (blank moves Left, Right, Up, Down) Goal test: checks whether conguration matches goal state from previous slides Path cost: Lets assume each step costs 1, path cost = number of steps in solution
Real-world Examples
Touring problems: traveling salesman, delivery tour VLSI layout: positioning millions of components and connections on a chip (minimizing area, circuit delays, stray capacitance, . . . ) Robot navigation: no discrete sets of routes, continuous space Automatic assembly sequence: nd an order in which to assemble parts of an object Protein design: nd a sequence of amino acids that will fold into three-dimensional protein with right properties Internet searching: looking for relevant information on the Web
Articial Intelligence andNeural Networks p.58/410 Articial Intelligence andNeural Networks p.59/410
Sibiu
Timisoara
Zerind
Sibiu
Timisoara
Zerind
Arad
Fagaras
Oradea
Rimnicu Vilcea
Arad
Lugoj
Arad
Oradea
Arad
Fagaras
Oradea
Rimnicu Vilcea
Arad
Lugoj
Arad
Oradea
Sibiu
Timisoara
Zerind
Arad
Fagaras
Oradea
Rimnicu Vilcea
Arad
Lugoj
Arad
20 states, one for each city Innite number of paths, innite number of nodes (good search algorithm avoids this) A node in the search tree is a bookkeeping data structure A state corresponds to a conguration of the world Mere convenience in example naming nodes after states
Search Strategies
Strategy is dened by picking the order of node expansion Uninformed strategies use only the information available in problem denition Breadth-rst search Uniform-cost search Depth-rst search Depth-limited search Iterative deepening search
Breadth-rst Search(2)
Expand shallowest unexpanded node Implementation: fringe is a FIFO queue, i.e., new successors are added at the end
Breadth-rst Search(3)
Expand shallowest unexpanded node Implementation: fringe is a FIFO queue, i.e., new successors are added at the end
A B D E F C G D B E
A C F G D B E
A C F G
Breadth-rst Search(4)
Expand shallowest unexpanded node Implementation: fringe is a FIFO queue, i.e., new successors are added at the end
Quality of Strategy
How do we measure the quality of a search strategy? Four aspects: Completeness: does algorithm nd a solution if there is one? Optimality: is strategy able to nd optimal solution? Time complexity: how long does it take to nd a solution? Space complexity: how much memory is needed to perform search?
Quality of Strategy(2)
Complexity is expressed in terms of three quantities Branching factor (b): maximum number of successors of any node Depth (d): depth of the shallowest goal node Maximum length of any path (m) in search space
A B D E F C G
Uniform-cost Search(UCS)
Expand least-cost unexpanded node Implementation: fringe is a queue ordered by path cost Equivalent to BFS if step costs are all equal
Depth-rst Search(2)
Expand deepest unexpanded node Implementation: fringe is a LIFO queue (or stack), i.e., new successors are added at front
Depth-rst Search(3)
Expand deepest unexpanded node Implementation: fringe is a LIFO queue (or stack), i.e., new successors are added at front
A B D H I J E K L F M N C G O H D I J B E K
A C F L M N G O H D I J B E K
A C F L M N G O
Depth-rst Search(4)
Expand deepest unexpanded node Implementation: fringe is a LIFO queue (or stack), i.e., new successors are added at front
Depth-rst Search(5)
Expand deepest unexpanded node Implementation: fringe is a LIFO queue (or stack), i.e., new successors are added at front
Depth-rst Search(6)
Expand deepest unexpanded node Implementation: fringe is a LIFO queue (or stack), i.e., new successors are added at front
A B D H I J E K L F M N C G O H D I J B E K
A C F L M N G O H D I J B E K
A C F L M N G O
Depth-rst Search(7)
Expand deepest unexpanded node Implementation: fringe is a LIFO queue (or stack), i.e., new successors are added at front
Depth-rst Search(8)
Expand deepest unexpanded node Implementation: fringe is a LIFO queue (or stack), i.e., new successors are added at front
Depth-rst Search(9)
Expand deepest unexpanded node Implementation: fringe is a LIFO queue (or stack), i.e., new successors are added at front
A B D H I J E K L F M N C G O H D I J B E K
A C F L M N G O H D I J B E K
A C F L M N G O
Depth-rst Search(10)
Expand deepest unexpanded node Implementation: fringe is a LIFO queue (or stack), i.e., new successors are added at front
Depth-rst Search(11)
Expand deepest unexpanded node Implementation: fringe is a LIFO queue (or stack), i.e., new successors are added at front
Depth-rst Search(12)
Expand deepest unexpanded node Implementation: fringe is a LIFO queue (or stack), i.e., new successors are added at front
A B D H I J E K L F M N C G O H D I J B E K
A C F L M N G O H D I J B E K
A C F L M N G O
A C B G D H I J E K L F M N G O H D I J E K C B
A C F L M N G O H D I J B E K
A C F L M N G O H D I J B E K
A C F L
A B D H I J E K L F M N C G O H D I J B E K
A C F L M N G O H D I J B E K
A C F L M N G O H D I J B E K
A C F L M N G O
B C C
Only way to avoid these is to keep more nodes in memory Fundamental tradeoff between time and space
Graph Search
function GRAPH-SEARCH(problem,fringe) closed = empty set fringe = INSERT(INITIAL-STATE(problem),fringe) loop do if fringe is empty then return failure node = REMOVE-FIRST(fringe) if GOAL-TEST(problem,STATE(node)) then return node if STATE(node) is not in closed then add STATE(node) to closed fringe = INSERTALL(EXPAND(node,problem), fringe) end
Articial Intelligence andNeural Networks p.103/410
Chapter Summary
This chapter dealt with problem formulation and (simple) search strategies Problem formulation involves abstracting away irrelevant real-world details for feasibility reasons Variety of uninformed search strategies IDS is often preferred search method for large search spaces and unknown depths of goals Graph search can be exponentially more efcient than tree search
Chapter 4
Outline
Using problem-specic knowledge searching can be improved considerably We look at following strategies in this chapter: Best-rst search Greedy search A search Heuristics
function GEN-TREE-SEARCH(problem,fringe) fringe = INSERT(INITIAL-STATE(problem),fringe loop do if fringe is empty then return failure node = REMOVE-FIRST(fringe) if GOAL-TEST(problem,STATE(node)) then return node fringe = INSERT(EXPAND(node,problem),fringe end
Best-rst Search
Main idea: node is selected for expansion based on an evaluation function f (n) Usually node with lowest score is selected (as f (n) normally measures distance to goal) Implementation: fringe is a queue sorted in ascending order of evaluation scores
Best-rst Search(2)
Rarely possible to nd perfect evaluation function f (n) There is a whole family of best-rst search algorithms Key component is heuristics function h(n) that estimates cost of the cheapest path from node n to a goal E.g. for Romanian route h(n) could be straight-line distance from n to Bucharest We look at two members of family: Greedy search A search
Greedy Search
Evaluation function f (n) = h(n) Greedy search expands node that appears to be closest to goal For Romanian route example we choose hSLD (n) = straight-line distance from n to Bucharest
75
Arad
Zerind
151
87
Iasi
Sibiu 253
Timisoara 329
Zerind 374
140
Sibiu
92 99
Fagaras Vaslui Rimnicu Vilcea
118 80
Timisoara
111 70
Lugoj
97 146
Pitesti
211
142 98
Mehadia
101 138
85
Hirsova
Urziceni
75
Dobreta
86
Bucharest
120
Craiova
90
Giurgiu Eforie
Sibiu
Timisoara 329
Zerind 374
Sibiu
Timisoara 329
Zerind 374
Complete in nite spaces with checking for repeated states Optimality: no (only optimal for problems with matroid property) Time complexity: O (bm ), but a good heuristic can give dramatic improvement Space complexity: O (bm ) (keeps all nodes in memory)
Arad 366
Fagaras 176
Oradea 380
Rimnicu Vilcea
193
Arad 366
Fagaras
Oradea 380
Rimnicu Vilcea
193
Sibiu 253
Bucharest 0
A Search
Idea: avoid expanding paths that are already expensive Evaluation function f (n) = g (n) + h(n)
g (n) = cost so far to reach n h(n) = estimated cost from goal to n f (n) = estimated cost of path through n to goal
A Search(2)
A search is optimal, if we use an admissible heuristic This means, if we are wrong, we only underestimate the true costs: h(n) h (n) where h (n) is the true cost We also require h(G) = 0 for any goal G and h(n) > 0 otherwise For our example, hSLD (n) is an admissible heuristic
A Search Example
Arad 366=0+366
A Search Example(2)
Arad
A Search Example(3)
Arad
A Search Example(4)
Arad
Sibiu 393=140+253
Timisoara 447=118+329
Sibiu
Sibiu
Zerind 449=75+374
A Search Example(5)
Arad
A Search Example(6)
Arad
Optimality of A
Suppose some suboptimal goal G2 has been generated and is in the queue
Sibiu
Timisoara 447=118+329
Sibiu
Zerind 449=75+374
Fagaras
Oradea 671=291+380
Rimnicu Vilcea
671=291+380 Bucharest 450=450+0 Craiova 526=366+160 Bucharest 418=418+0 Craiova Pitesti Sibiu 553=300+253
Rimnicu Vilcea
Bucharest 450=450+0
Craiova
Pitesti
Sibiu
Sibiu 591=338+253
591=338+253
615=455+160 607=414+193
G2
Optimality of A(2)
f (G2 ) = g (G2 ) since h(G2 ) = 0 > g (G1 ) since G2 is suboptimal f (n) since h is admissible
Heuristics Functions
Well now shed some light on h(n) in general Lets take another look at the 8-puzzle
7 5 8 3
Start State
4 6 1
5 1 4 7
2 5 8
Goal State
3 6
Average solution cost for a randomly generated instance is about 22 steps Branching factor is about 3
Articial Intelligence andNeural Networks p.127/410 Articial Intelligence andNeural Networks p.128/410
Heuristics Functions(2)
Exhaustive search would look at about 3.1 101 0 states Keeping track of repeated states this can be cut down to 181,440 For 8-puzzle its manageable, for 15-puzzle corresponding number is roughly 101 3 We need a good heuristic function
Heuristics Functions(3)
Commonly-used candidates:
h1 (n) = the number of misplaced tiles h2 (n) = total Manhattan distance (i.e., number of squares from desired location of each tile)
7 5 8 3
Start State
Heuristics Functions(4)
Both, h1 and h2 are admissible How efcient are they? Typical search costs:
d = 14 IDS = 3,473,941 nodes A (h1 ) = 539 nodes A (h2 ) = 113 nodes d = 24 IDS 54,000,000,000 nodes A (h1 ) = 39,135 nodes A (h2 ) = 1,641 nodes
4 6 1
5 1 4 7
2 5 8
Goal State
3 6
Quality of Heuristics
h2 seems to be better than h1
Is this always the case? Yes, as h2 dominates h1 , i.e., h2 (n) h1 (n) for all n In terms of efciency this means, A using h2 will never expand more nodes than A using h1 If you are not sure about dominance: given any admissible heuristics ha and hb , h(n) = max(ha (n), hb (n)) is also admissible and dominates ha and hb
Chapter Summary
Heuristic functions estimate costs of shortest paths Good heuristics can dramatically reduce search cost Greedy search expands node with lowest h(n) Incomplete and not always optimal A search expands node with lowest g (n) + h(n) Complete and optimal Also very efcient Admissible heuristics can be derived by relaxing problems
Chapter 5
Hill climbing
Outline
Simulated Annealing Local Beam Search
Motivation
Search algorithms up to now memorize path from initial state to goal In many problems path is irrelevant, we are only interested in a solution (e.g. 8-queens problem) This class of problems includes many important applications Integrated-circuit design Factory-oor layout Job scheduling Network optimization Vehicle routing Portfolio management
Articial Intelligence andNeural Networks p.139/410
Variants of this approach get within 1% of optimum very quickly (for thousands of cities)
Example: n-queens
Put n queens on an n n board with no two queens sharing a row, column, or diagonal II: move a queen to reduce number of conicts
Hill-climbing Search
Simply a loop that continues moving in the direction of increasing value Terminates when it reaches a peak (where no neighboring state has a higher value) Does not look beyond immediate neighbors of current state (greedy local search) Like climbing Everest in thick fog with amnesia
Hill-climbing Search(2)
function HILL-CLIMBING(problem) current = INITIAL-STATE(problem) loop do neighbor = highest-valued successor of current if VALUE(neighbor) <= VALUE(current) then return STATE(current) current = neighbor end
h=5
h=2
h=0
global maximum
current state
state space
Simulated Annealing(2)
function SIM-ANNEALING(problem,schedule) current = INITIAL-STATE(problem) t = 1 loop do temperature = schedule[t] if temperature = 0 then return current next = randomly selected successor of current diff = VALUE(next) - VALUE(current) if diff > 0 then current = next else current = next only with probability e(diff/temperature) end
Articial Intelligence andNeural Networks p.148/410
Simulated Annealing(3)
Vivid description: Getting a ping-pong ball into the deepest crevice of a bumpy surface (turning around the hill) Left alone by itself, ball will roll into a local minimum If we shake the surface, we can bounce the ball out of a local minimum The trick is to shake hard enough to get it out of local minimum, but not hard enough to dislodge it from global one We start by shaking hard and then gradually reduce the intensity of shaking
Chapter Summary
We covered search algorithms that do not care about path from initial state to goal Only solutions are relevant for local search algorithms
Chapter 6
Outline
Constraint Satisfaction Problems (CSP) examples Backtracking search for CSPs Problem structure and problem decomposition
Motivation
Up to now the states in search spaces were black boxes to the search algorithms Only accessible by problem-specic routines: successor function, heuristic function, and goal test The search algorithm itself had no knowledge about the internals of the states We now look at CSPs, whose states and goal tests conform to a standard, structured, and simple representation Consequence: search algorithms can use general-purpose rather than problem-specic heuristics
Representation of a CSP
A CSP is dened by a set of variables X1 , X2 , . . . , Xn and a set of constraints C1 , C2 , . . . , Cm Each variable has a domain Di of possible values Each constraint species the allowable combination of values for some subset of variables An assignment not violating constraints is called consistent (or legal) A consistent, complete assignment (involving every variable) is a solution Some CSPs also require a solution that maximizes an objective function
Articial Intelligence andNeural Networks p.157/410 Articial Intelligence andNeural Networks p.158/410
Example: Map-coloring
Northern Territory Western Australia South Australia New South Wales Queensland
Victoria
Tasmania
Variables: WA, NT, Q, NSW, V, SA, T Domains: Di = {red, green, blue} Constraints: adjacent regions must have different colors
Example: Map-coloring(2)
Formal description of constraints: WA = NT, WA = SA, NT = SA, NT = Q, . . . Or, depending on description language allowed: (WA,NT) { (red,green),(red,blue),(green,red), . . . } (WA,SA) { (red,green),(red,blue),(green,red), . . . } (NT,SA) { (red,green),(red,blue),(green,red), . . . } We now have to nd a complete assignment that does not violate any constraint
Example: Map-coloring(3)
Possible solution: { WA=red, NT=green, Q=red, NSW=green, V=red, SA=blue, T=green }
Constraint Graph
Binary CSP: each constraint relates at most two variables Constraint graph: nodes are variables, edges are constraints
NT Q
Northern Territory Western Australia South Australia New South Wales Queensland
WA SA V Victoria
NSW
Victoria
Tasmania
Constraint Graph(2)
General-purposes CSP algorithms use graph structure A relatively simple way to describe a CSP Can speed up search, e.g. Tasmania is an independent subproblem
Varieties of Constraints
Unary constraints involve a single variable E.g. SA = green Binary constraints involve a pair of variables E.g. SA = WA Higher-order constraints involve three or more variables E.g. cryptarithmetic column constraints (example in just a moment) Preferences (soft constraints) E.g. red is better than green Often represented by costs for a variable assignments, also called constrained optimization problems
Example: Cryptarithmetic
T WO + T WO F O U R
F T U W R
X3
X2
X1
Varieties of CSPs
Discrete variables Finite domains: size d O (dn ) complete assignments with n variables E.g. Boolean CSPs: Boolean satisability (NP-complete) Innite domains: (integers, strings, etc.) with n variables E.g. job scheduling, variables are start/end days for each job Formulated via constraint language: e.g. StartJob1 + 5 StartJob3 Linear constraints solvable, nonlinear undecidable (in general case)
Articial Intelligence andNeural Networks p.166/410
Varieties of CSPs(2)
Continuous variables E.g. start/end times for Hubble Telescope observations Linear constraints solvable in polynomial time by Linear Programming methods Very common in the real world, widely studied in Operations Research
Real-world CSPs
Assignment problems, e.g. who teaches what class? Timetabling problems, e.g. which train arrives and leaves when and where? Transportation scheduling, e.g. which vehicle leaves when and where and carries which goods with it? Usually very hard to solve
Backtracking Search
function BACKTRACK(csp) return RECURSIVE-BACKTRACK({},csp)
function RECURSIVE-BACKTRACK(assignment,csp) if assignment is complete then return assignment var = SELECT-UNASSIGNED(VARIABLES(csp), assignment,csp) for each value in DOMAIN(var,assignment,csp) d if value is consistent with CONSTRAINTS(csp) then add {var = value} to assignment result = RECURSIVE-BACKTRACK(assignment csp) if result <> failure then return result remove {var = value} from assignment end return failure
Backtracking Example
Backtracking Example(2)
Backtracking Example(3)
Backtracking Example(4)
Improving Efciency
Backtracking can be improved in terms of efciency by looking at: Which variable should be assigned next? In what order should its values be tried? Can we detect inevitable failure early? Can we take advantage of the problem structure?
Degree Heuristic
If there are ties among MRVs use degree heuristic Choose variable with the most constraints on remaining variables
Forward Checking
Idea: keep track of remaining legal values for unassigned variables Terminate branch of search when any variable has no legal values
Forward Checking(2)
Idea: keep track of remaining legal values for unassigned variables Terminate branch of search when any variable has no legal values
Forward Checking(3)
Idea: keep track of remaining legal values for unassigned variables Terminate branch of search when any variable has no legal values
Forward Checking(4)
Idea: keep track of remaining legal values for unassigned variables Terminate branch of search when any variable has no legal values
WA
NT
NSW
SA
WA
NT
NSW
SA
WA
NT
NSW
SA
Constraint Propagation
Forward checking propagates information from assigned to unassigned variables Doesnt provide early detection for all failures NT and SA cannot both the blue:
Constraint Propagation(2)
Constraint propagation repeatedly enforces constraints locally Forward checking propagates from WA and Q onto NT and SA We want to continue by propagating onto the constraint between NT and SA And we want to do this efciently, reducing search space is no good if it takes longer than simple search
Arc Consistency
Method of constraint propagation that is stronger than forward checking Arc refers to a (directed) edge in the constraint graph E.g. there is an arc from SA to NSW Arc is consistent, iff for every value x of SA, there is some value y of NSW that is consistent
WA
NT
NSW
SA
WA
NT
NSW
SA
Arc Consistency(2)
Simplest form of propagation makes each arc consistent If we nd a value x for which no consistent y exists, we delete x E.g. arc from NSW to SA
Arc Consistency(3)
If a variable loses a value, neighbors of this variable need to be rechecked E.g. arc from V to NSW
Arc Consistency(4)
Arc consistency detects failure earlier than forward checking E.g. arc from SA to NT
WA
NT
NSW
SA
WA
NT
NSW
SA
WA
NT
NSW
SA
Problem Structure
Structure of the constraint graph can often be exploited We are going to look at two techniques for independent subproblems tree-structured problems
Independent Subproblems
NT Q WA SA V Victoria
NSW
Independent Subproblems(2)
Each of these (smaller) subproblems can be solved independent of each other Performance gains can be quite high: Suppose each subproblem has c variables out of n total Worst-case solution cost is n/c dc Compare with dn for whole problem: Assume n = 80, d = 2, c = 20 and 10 million nodes/sec processing speed Whole problem: 280 4 billion years All subproblems: 4 220 0.4 seconds
Tree-structured CSPs
A B C D F E
If the constraint graph has no loops, CSP can be solved in O (nd2 ) time (instead of worst-case O (dn ))
Tasmania and mainland are independent subproblems Identiable as connected components of constraint graphs
Tree-structured CSPs(2)
Choose a variable as root, order variables from root to leaves (parent precedes all children)
A B C D
E A F B C D E F
WA
Victoria
Victoria
For j from n down to 2, apply REMOVE-INCONSISTENT(Parent(Xj ),Xj ) For j from 1 to n, assign Xj consistently with Parent(Xj )
Cutset conditioning: instantiate (in all ways) as set of variables, such that remaining graph is a tree
Example: 4-queens
States: 4 queens in 4 columns (44 = 256 states) Successor function: move queen up or down in column Goal test: no attacking queens Evaluation function: h(n) = number of attacks
Performance
Given random initial state, can solve n-queens in almost constant time for large n with high probability In general very good for any randomly-generated CSP Exceptions are problems in a narrow range of the ratio R = number of constraints number of variables
CPU time
Chapter Summary
CSPs are a special kind of problem States dened by values of a xed set of variables Goal test dened by constraints on variable values Backtracking = depth-rst search with one variable assignment per node Various techniques to improve performance Alternative: local search with min-conicts heuristic Usually efcient in practice
h=5
h=2
h=0
R critical ratio
Articial Intelligence andNeural Networks p.202/410 Articial Intelligence andNeural Networks p.203/410
Chapter 7
Games Perfect play
Outline
Motivation
Up to now search problems were hard, but nobody was working against us How do we plan if other agents are planning against us? Games are an ideal domain for exploring capabilities of AI in terms of adversarial search: The rules are xed The scope of the problem is constrained The interactions between players are well dened Yet, problems are far from simple Can be seen as the Formula 1 of AI research
Adversarial Search
Minimax decisions - -pruning Resource limits and approximate evaluation Games of chance Games of imperfect information
Types of Games
deterministic chess, checkers, go, othello battleships, stratego random element backgammon, monopoly bridge, poker, scrabble
Representation
Well rst consider games with two players, called MAX and MIN A game can be formally dened as a kind of search problem Initial state: includes board position and identies player to move Successor function: returns a list of legal moves and resulting states Terminal test: determines when the game is over (terminal states are states where the game has ended) Utility function: gives numeric values for terminal states (e.g. win = +1, loss = -1, draw = 0)
Representation(2)
Game tree for Tic-Tac-Toe
MAX (X)
Minimax
In normal search, optimal solution is a sequence of moves leading to a goal (each terminal state is a win) In a game, however, opponent MIN has something to say about it
Minimax(2)
Idea: choose move to position with highest minimax values Best achievable payoff against best play
MAX
X MIN (O)
X X X X X X X
X O MAX (X)
X O
Well rst look at deterministic, perfect-information games MAX must nd a contingent strategy: Specify MAXs move in initial state Then MAXs moves in the states resulting from every possible response by MIN Then MAXs moves replying to MINs response to those moves, and so on
3
A1 A2 A3
...
MIN
A 11 A 12
3
A 13 A 21
2
A 22 A 23 A 31
2
A 32 A 33
X O X MIN (O)
X O X
X O X
...
...
...
...
...
TERMINAL Utility
X O X O X O 1
X O X O O X X X O 0
X O X X X O O +1
...
12
14
Minimax(3)
function MINIMAX-DECISION(state) return the a in ACTIONS(state) maximizing MIN-VALUE(RESULT(a,state)) function MIN-VALUE(state) if TERMINAL-TEST(state) then return UTILITY(state) v = infinity for a,s in SUCCESSORS(state) do v = MIN(v, MAX-VALUE(s)) end return v function MAX-VALUE(state) if TERMINAL-TEST(state) then return UTILITY(state) v = - infinity for a,s in SUCCESSORS(state) do v = MAX(v, MIN-VALUE(s)) end return v
Articial Intelligence andNeural Networks p.214/410
- Pruning
Problem with minimax: number of examined games states exponential in the number of moves We cant eliminate exponent, but can effectively cut it in half
- pruning cuts off branches that cannot possibly inuence nal decision
- Pruning(2)
MAX
- Pruning(3)
MAX
- Pruning(4)
MAX
MIN
MIN
MIN
14
12
12
12
14
- Pruning(5)
MAX
- Pruning(6)
MAX
Why Is It Called - ?
MAX
3 3
MIN
MIN
14
MIN
14
5 2
.. .. .. MAX MIN
12
14
12
14
- Algorithm
function ALPHA-BETA-DECISION(state) return the a in ACTIONS(state) maximizing MIN-VALUE(RESULT(a,state),-infinity,infinity) function MIN-VALUE(state,alpha,beta) if TERMINAL-TEST(state) then return UTILITY(state) v = infinity for a,s in SUCCESSORS(state) do v = MIN(v, MAX-VALUE(s,alpha,beta)) if v <= alpha then return v beta = MIN(beta,v) end return v function MAX-VALUE(state) same as MAX-VALUE but with roles of alpha and beta reversed
Properties of - Pruning
Pruning does not affect nal result Good move ordering improves effectiveness of pruning With perfect ordering, time complexity = Unfortunately, 3550 is still infeasible
O (bm/2 )
Resource Limits
Standard approach Use CUTOFF-TEST instead of TERMINAL-TEST, e.g. depth limit Use EVAL instead of UTILITY, i.e. evaluation function that estimates desirability of a position State-of-the-art: Deep Blue, up to 2 108 nodes/sec Assume we have 300 seconds We can go through 6 1010 nodes 3514/2 - reaches depth 14 Evaluation function is the crucial element in quality of play
Evaluation Functions
For chess, typically linear weighted sum of features: EVAL(s) = w1 f1 (s) + w2 f2 (s) + + wn fn (s) E.g., f1 (s) = (# white queens - # black queens) with w1 = 9
MAX
Evaluation Functions(2)
Exact values dont matter, only the order matters Behavior is preserved under any monotonic transformation of EVAL
MIN
20
Othello:
400
20
20
Human champions refuse to play computers, which are too good Go: Human champions refuse to play computers, which are too bad (in Go, b > 300)
Nondeterministic Games
In nondeterministic games, chance introduced by throwing dice, shufing cards, etc. MAX knows his own moves, but does not know next possible moves of MIN We have to add chance nodes in addition to MAX and MIN nodes The branches leading from each chance node denote possible events (each with a probability)
MAX
Algorithm
EXPECTIMINIMAX gives perfect play
CHANCE
... if state is a MAX node then return the highest EXPECTMINIMAX-VALUE of SUCCESSORS(state) if state is a MIN node then return the lowest EXPECTMINIMAX-VALUE of SUCCESSORS(state) if state is a chance node then return weighted average EXPECTMINIMAX-VALU of SUCCESSORS(state) ...
Evaluation Functions
Exact values do matter here Behavior is preserved only by positive linear transformation of EVAL Hence, EVAL should be proportional to expected payoff
MAX
DICE
2.1 .9 .1 3 1 .9
1.3 .1 4 20
21 .9 30 .1 1 .9
40.9 .1 400
MIN
20
20 30 30
1 400 400
Example
MAXs hand: 6 6 9 8 MINs hand: 4 2 10 5 MAX leading the 9 is an optimal play (as is leading any other card in this case) MAX will get two tricks on optimal play of MIN MIN will get two tricks (with 2 10) Replacing MINs hand with 4 2 10 5 does not make a difference Can be shown with a suitable variant of minimax
Example(2)
Now lets hide one of MINs cards MAX does not know if MIN has a 4 or a 4 One could argue: leading 9 against rst hand and against second hand is optimal; as MIN has one of these hands, its still optimal But: MIN takes trick with 10, leads with 2 MAX has to discard 6 or 6 If the wrong card is discarded, MAX will get only one trick
Example(3)
MAX is using what we might call averaging over clairvoyancy: Computing the minimax value of each action for each possible deal of cards Then computing the expected value over all deals (using probability of each deal) If you think this is reasonable, consider the following
Story Example
Day 1: Road A leads to heap of gold; Road B leads to a fork: turn left and nd a mound of jewels, turn right and get run over by a bus Day 2: Road A leads to heap of gold; Road B leads to a fork: turn left and get run over by a bus, turn right and nd a mound of jewels Day 3: Road A leads to heap of gold; Road B leads to a fork: guess correctly and nd a mound of jewels, guess incorrectly and get run over by a bus Choosing Road B on the rst two days is as optimal as choosing Road A Would you choose Road B on the third day?
Proper analysis
With partial observability intuition that value of an action is average of its values in all states is wrong value of an action depends on the information state or belief state an agent is in Correct strategy is to generate and search a tree of information states Leads to rational behavior as Acting to obtain information Signaling to ones partner Acting randomly to minimize information disclosure
Chapter Summary
Games illustrate several important points about AI Perfection is unattainable, we need to approximate Uncertainty constrains the assignment of values to states Optimal decisions depend on information state, not real state
Chapter 8
Brains Neural networks
Outline
Brains
A neuron is a brain cell whose function is to collect and process electrical signals
Feed-forward networks
Neural Networks
Single-layer networks Multi-layer networks Recurrent networks Elman networks Learning Supervised Learning Unsupervised Learning Reinforcement Learning
Nucleus Dendrite Synapse Axon
Synapses
Brains(2)
The brains information-processing capacity is thought to emerge primarily from networks of neurons There are approx. 1011 neurons in human brain, connected via approx. 1014 synapses 1ms-10ms cycle time Some of the earliest AI work aimed to create articial neural networks
Articial Neuron
McCulloch and Pitts devised simple mathematical model of a neuron Gross oversimplication of real neurons Its purpose was to develop understanding of what networks of simple units can do
aj a0 = 1
Articial Neuron(2)
Bias Weight
W0,i Wj,i
ai = g(ini)
ini
g ai
Input Links
Output
Output Links
Each unit i rst computes weighted sum of its inputs: ini = n j =0 Wj,i aj Then applies activation function g to derive output:
ai = g (ini ) = g
Articial Intelligence andNeural Networks p.244/410 Articial Intelligence andNeural Networks p.245/410
n j =0 Wj,i aj
Activation Function
Activation function g is designed to meet two desiderata: Unit should be active (near 1) when right inputs are given and inactive (near 0) when the wrong inputs are given Activation needs to be nonlinear, otherwise entire neural network collapses into a simple linear function Two typical activation functions are Threshold function (or step function) Sigmoid function In general, activation functions are monotonically increasing
Activation Function(2)
g(ini) +1 +1 g(ini)
W0 = 0.5 W1 = 1 W2 = 1
OR
W0 = 0.5
ini
ini
(a)
(b)
W1 = 1
(a) is threshold function g (x) = 1 for x > 0, = 0 otherwise (b) is sigmoid function g (x) = 1/(1 + ex ) Usually, bias weight W0,i is used to move threshold location: g (ini ) = g (x W0,i )
Articial Intelligence andNeural Networks p.248/410
NOT
Using neurons, we can build a network to compute any Boolean function of the inputs
Network Structures
Two main categories of neural networks structures Feed-forward networks Represents a function of its current input No internal states other than weights (Cyclic or) recurrent networks Feeds outputs back into inputs Dynamical system (may reach stable state, exhibit oscillations or even chaotic behavior) Can support short-term memory This makes them more interesting, but also harder to understand
Feed-forward Example
1 W1,3 W1,4 3 W3,5 5 W2,3 2 W2,4 4 W4,5
Single-layer Networks
Network with all inputs connected directly to the outputs is called a single-layer neural network (or perceptron network) Each output unit is independent of the others, so we look at a single output unit We start by examining the expressiveness of perceptrons As already seen, simple Boolean functions are possible Majority function (outputs 1 if more than half of inputs are 1) is also possible: Wj = 1, threshold W0 = n/2
Simple neural network with two inputs, one hidden layer of two units, and one output Feed-forward networks are usually arranged in layers (each unit receives input from the immediately preceding layer)
Articial Intelligence andNeural Networks p.251/410
Expressiveness
Expressiveness(2)
Perceptron represents a linear separator in input space Threshold perceptron returns 1, iff the weighted sum of its inputs is positive: n j =0 Wj xj > 0 Or, interpreting the Wj s and xj s as a vector, W x > 0 This denes a hyperplane in the input space, perceptron returns 1 if input is on one side of that plane
?
0 1 (c) x1 xor x2 x2
Input Units
Expressiveness(3)
?? ?? ??? ?????
x1 1 x1 x1 1 1 0 0 1 x2 0 0 1 x2 0 (a) x1 and x2 (b) x1 or x2
Consider perceptron with threshold function Can represent AND, OR, NOT, majority, but e.g. not XOR:
Wj,i
Output Units
-4
-2
4 2 x2
Output units all operate separately, no shared weights Adjusting weights moves the location, orientation, and steepness of cliff
Multilayer Networks
Layers are usually fully connected
Output units Oi Wj,i Hidden units a j Wk,j Input units Ik
hW(x1, x2) 1 0.8 0.6 0.4 0.2 0 -4 -2 x1
Expressiveness
All continuous functions with 2 layers, all functions with 3 layers
hW(x1, x2) 1 0.8 0.6 0.4 0.2 0 -4 -2 x1
Recurrent Networks
Recurrent neural networks have feedback connection (to store information over time) Elman network is a simple one that makes a copy of the hidden layer This copy is called context layer Context layer stores the previous state of the hidden layer
-4
4 2 0 x2 -2
-4
4 2 0 x2 -2
Combine two opposite-facing threshold functions to make ridge Combine two perpendicular ridges to make bump Add various bumps to t any surface
Articial Intelligence andNeural Networks p.256/410 Articial Intelligence andNeural Networks p.257/410
Elman network
Output Units
Elman Network(2)
The context layer feeds previous network states into the hidden layer Input vector: x = ( x1 , . . . , xn , xn+1 , . . . , x2n ) actual inputs context units Connections from each hidden unit to corresponding context unit has weight 1 Context units are fully interconnected with all hidden units (not necessarily with weight 1)
Learning
For simple Boolean or majority functions it is easy to nd appropriate weights Generally, by adjusting the weights, we change the function that a network represents That is how learning occurs in neural networks When we have no prior knowledge about the function except for data we have to learn values for Wj from this data
Hidden Units
Context Layer
Input Units
Learning(2)
Three main types of learning (we are looking at rst two) Supervised learning Network is provided with a data set of input vectors and desired output (training set) Adjust the weights so that the error between the real output and the desired output is minimized Unsupervised learning Clusters the training set to discover patterns or features in the input data Reinforcement learning Reward the network for good performance, penalize it for bad performance
Articial Intelligence andNeural Networks p.262/410
Supervised Learning
Gradient descent is widely popular approach to train (single-layer) networks Idea: adjust the weights in of the network to minimize some measure of the error on training set Classical measure of error is sum of squared errors Squared error for a single training example with input x 1 2 2 and desired output y is E = 1 2 Err = 2 (y hW (x)) (where hW (x) is output of perceptron)
Gradient Descent
Depending on the gradient of the error, we increase or decrease the weight
Error
Minimum
Weight
Gradient Descent(2)
For calculating the gradient, we need some calculus We need to determine a partial derivative of E with respect to each weight:
E Wj
1 2 Err2 Err = = Err Wj Wj n Wj x j yg = Err Wj j =0
Gradient Descent(3)
In the gradient descent algorithm, Wj s are updated as follows:
Wj = Wj + Err g (in) xj is the learning rate:
Gradient Descent(4)
Complete algorithm runs training examples through the net one at a time (adjusting the weights slightly) Each cycle is called an epoch Epochs are repeated until some stopping criterion is reached E.g. weight changes become very small Only converges for linearly separable data set
= Err (g (in) xj ) g
Gradient Descent(5)
Different variants for cycling through training examples: Batch: adding up all gradient contributions and adjusting weights at end of epoch Stochastic: select examples randomly There are many other methods besides gradient descent: Widrow-Hoff Generalized Delta Error-Correction ...
Back-propagation Learning
Learning in a multi-layer network is a little different Minor difference: we now have several outputs and an output vector hW (x) Major difference: error at output layer is clear, error in hidden layers is unclear Idea: back-propagate error from output layer to hidden layers
Back-propagation Learning(2)
At output layer, weight update is identical to gradient descent We have multiple output units, so let Erri be the i-th component of the error vector y hW (x), so
Wj,i = Wj,i + Erri g (ini ) xj
i
Back-propagation Learning(3)
Idea: hidden node j is responsible for some fraction of the error in the nodes to which it connects
i values are divided according to the weights of the connections: j = g (inj )
i
Back-propagation Learning(4)
Back-propagation process can be summarized as follows: Compute the values for the output units (using the observed error) Starting with output layer, repeat for each layer until earliest layer is reached: Propagate the values back to the previous layer Update the weights in the previous layer
Unsupervised Learning
In supervised learning, supervisor (or teacher) presents an input pattern and a desired response Neural networks try to learn functional mapping between input and output Unsupervised learnings objective is to discover patterns of features in input data This is done with no help or feedback from teacher No explicit target outputs are prescribed, however, similar inputs will result in similar outputs
Wj,i i
Now we can use same weight-update rule for the hidden nodes
Wk,j = Wk,j + j xk
Reinforcement Learning
In supervised learning an input data set and a full set of desired outputs is presented In reinforcement learning the feedback is not as elaborate Desired output is not described explicitly Learning network only gets feedback whether output was a success or not Learning with a critic (rather than learning with a teacher) Main objective is to maximize the (expected) reward or reinforcement signal
Reinforcement Learning(2)
General situation:
Learning Rule
Neural network reinforcement learning usually requires a multi-layer architecture
Learner
An external evaluator is needed to decide whether network has scored a success or not Every node in the network receives a scalar reinforcement signal r representing quality of output
Sensory Input
Reward
Action
Environment
Compared to back-propagation (where output nodes receive error signal, which is propagated backward), here every node receives same signal
Learning Rule(2)
Mazzoni et al. presented following weight-update algorithm (based on Hebbian learning):
Wj,i = Wj,i + ( (g (ini ) pi ) g (inj ) r + (1 g (ini ) pi ) g (inj ) (1 r))
Learning Structures
So far, we have only looked at learning weights (given a xed network structure) How do we nd the best network structure? Choosing a network that is too small: May not be powerful enough to get task done Choosing a network that is too big: Problem of overtting: network memorizes examples rather than generalizing
Learning Structures(2)
If we stick to fully connected networks, the only choices are: The number of hidden layers The number of neurons in each Usual approach: try several and keep the best Try to keep it small to avoid overtting
where is a constant, and pi is the probability of neuron i ring A correct response (large r) will strengthen connections that were active during the response An incorrect response (small r) will weaken active synapses
Learning Structures(3)
Let us now consider networks that are not fully connected We need some effective search method to weed out connections One approach is optimal brain damage algorithm Starts with a fully connected network and removes connections from it After rst training an information-theoretic approach identies a selection of connections to be dropped Network is retrained and if performance has not decreased, process is repeated It is also possible to remove neurons that are not contributing much to result
Articial Intelligence andNeural Networks p.283/410
Learning Structures(4)
Several algorithms for growing larger network from smaller one Tiling algorithm starts with a single unit that tries its best Subsequent units are added to take care of examples that rst unit got wrong Algorithm adds only as many units as are needed to cover all examples
Applications:Speech Recognition
Applications:Handwriting Recognition
Applications:Fraud Detection
Banks are using AI software (including neural networks) to detect fraud Have the ability to detect fraudulent behavior by analyzing transactions and alerting staff
Applications:CNC
Neural networks are also used in computer numerically controlled (CNC) machines E.g. Siemens SINUMERIK 840D controller for drilling, turning, milling, grinding and special-purpose machines
400-300-10 unit network: 1.6% error 768-192-30-10 unit LeNet: 0.9% error
Credit card fraud losses in the UK fell for the rst time in nearly a decade in 2003 (by more than 5% to 402.4m pounds) Barclays reported that after installing a system in 1997, fraud was reduced by 30% by 2003
Applications:Drug Design
Used for testing if certain anti-inammatory drugs cause adverse reactions The rate of these reactions is about 10% (with 1% serious and 0.1% fatal) Three-layer, backpropagated network was used to predict serious reactions Predicted rate matched within 5% of observed rate
Chapter Summary
Neural networks are an AI technique modeled on the brain Single-layer feed-forward networks can represent linearly separable functions Multi-layer feed-forward networks can represent any function (given enough units) Recurrent networks can store information over time Many different techniques to train networks Neural networks have been used for hundreds of applications
Chapter 9
Evolutionary Computing
Outline
Introduction to Evolutionary Computing Genetic algorithms Evolutionary programming
Introduction
Genetic algorithms already mentioned when discussing local search algorithms; now we have a closer look (Biological) evolution is an optimization process with the aim to improve ability to survive Characteristics of an individual are contained in his/her chromosomes After sexual reproduction the offsprings chromosomes consist of a combination of parents chromosomes Process of natural selection allows more t individuals to produce more offspring One expects to have offspring similar or even better tness
Articial Intelligence andNeural Networks p.292/410 Articial Intelligence andNeural Networks p.293/410
Introduction(2)
Occasionally mutations occur These have a random effect on the chromosomes of an individual May improve or worsen the tness of an individual (or the offspring) Introduces some variation into a population Evolutionary Computing (EC) emulates the process of natural selection in a search procedure
Evolutionary Computing
An evolutionary algorithm (EA) is a stochastic search algorithm comprising: An encoding of solutions to a problem in form of chromosomes Initial state: starting population (usually with randomly determined chromosomes) Successor function: generating offspring given two parents Evaluation function (or tness function): determining the tness of an individual Selection function: choosing the individuals to reproduce
Example
Solving the 8-queens problem using GAs
n-th number in chromosome stands for position of queen within n-th column
Example(2)
24748552 32752411 24415124 32543213
24 31% 23 29% 20 26% 11 14%
Example(3)
32748152 24752411 32252124 24415417
Mutation
Algorithm
function GENETIC-ALG(population,FITNESS-FN) repeat new_pop = empty set loop for i from 1 to SIZE(population) do x = RAND-SELECT(population,FITNESS-FN) y = RAND-SELECT(population,FITNESS-FN) child = REPRODUCE(x,y) if small random probability then child = MUTATE(child) add child to new_pop end population = new_pop until some individual fit enough or enough time has elapsed return best individual
Fitness
Selection
As tness function we use the number of nonattacking pairs of queens (here probability of being chosen is proportional to tness) Two individuals are chosen randomly (biased by probabilities) for reproduction Random crossover point determines which fragments will be exchanged when reproducing
000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 111 000 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111
111 000 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111
Mutation
Aim of Mutation is to introduce new genetic material Adding diversity to the population Usually a small probability for mutations to occur is chosen
000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 111 000 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111
000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 111 000 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111
To ensure that good solutions are not distorted too much Initial large mutation rate that decreases exponentially can also be quite successful Similarity to simulated annealing
111 000 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111
111 000 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111
111 000 000 111 000 111 000 111 000 111
111 000 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111
111 000 000 111 000 111 000 111 000 111
Assessment of GAs
Genetic algorithms are similar to stochastic local beam searches Combine an uphill tendency in searching with random exploration Exchange information among parallel search threads Crossover seems to be the crucial component of GAs
Assessment of GAs(2)
However, crossover conveys no advantage if positions of the chromosomes are randomly permuted initially Advantage comes from combining large blocks that have involved independently to perform useful function E.g. putting queens in positions 7,4, and 2 in the rst three columns They dont attack each other (useful block) Could be combined with other useful block to construct a solution (e.g. 58136, which is also a useful block) This raises the level of granularity at which search takes place
Articial Intelligence andNeural Networks p.307/410 Articial Intelligence andNeural Networks p.308/410
Assessment of GAs(3)
Theory of GAs explains this with the idea of a schema A schema is a substring in which some of the positions can be left unspecied E.g. 742***** describes all states in which the rst three queens are at position 7,4, and 2 Strings that match a schema are called instances of a schema E.g. 74213378
Assessment of GAs(4)
If the average tness of a schemas instances is above average, the number of this schemas instances will grow over time This effect is unlikely to be signicant if adjacent positions are totally unrelated There will be few contiguous blocks that provide consistent benet GAs work best when schemas correspond to meaningful components of a solution Successful use of genetic algorithms requires careful engineering of the representation
Applications
Parametric Design of Aircraft Optimizing aircraft designs Task is posed as that of optimizing a list of parameters Routing in circuit-switched telecommunications networks Optimize the routing of telephone networks in order to minimize costs to US West Hybridization with other algorithms can lead to better performance
Applications(2)
Robot trajectory generation Planning the path that a robot arm takes Not only optimizing the length of the path, but also wear and tear on arm (by acceleration and deceleration) Tuning for sonar information processing Training neural networks classifying sonar signals using GAs
Evolutionary Programming
Evolutionary Programming (EP) emphasizes behavioral models and not genetic models EP is derived from the simulation of adaptive behavior in evolution EP considers phenotypes (not genotypes) About nding a set of optimal behaviors, the tness function measures behavioral error
Finite-State Machines
How do we model behavior? A popular way to do this is using nite-state machines (FSMs) Describes a sequence of actions that are taken Each action depend on the current state of the machine and the input
Finite-State Machines(2)
Formal denition:
F SM = (S, , I, O, , )
where
S is a nite set of states is the initial state I is a nite set of input symbols O is a nite set of output symbols is the next-state function is the next-output function
Finite-State Machines(3)
Example: S = {X, Y, Z }, = Z, I = {0, 1}, O = {f, t}
0/t Y 0/f 1/t 1/f X 1/t Z 0/f
FSMs and EP
Evolutionary programming was originally developed to evolve nite-state machines Aim of early applications was to make predictions about future Given a sequence of previously observed symbols, evolve a program to predict next symbol
Making Predictions
For the example FSM the following sequence would produce the following output Sequence 0 0 1 0 0 1 1 . . . f t t f t t t ... Output Interpreting f=0 and t=1, the FSM made just one mistake
Algorithm
function EVOLUTION-PROG(population,FITNESS-FN) repeat new_pop, tmp_pop = empty set loop for i from 1 to SIZE(population) do child = SELECT-NTH(i,population) child = MUTATE(child) add child to tmp_pop end new_pop = SELECT-FITTEST(SIZE(population), population + tmp_pop, FITNESS-FN) population = new_pop until some individual fit enough or enough time has elapsed return best individual
Articial Intelligence andNeural Networks p.319/410
Mutation
In the original EP, there was no crossover, only mutation One point in the behavioral space was seen as standing for a species, not an individual However, EP can be combined with GAs
Application
Nark: a bug-nding compiler extension developed at Stanford Static analysis techniques are used to nd bugs in software systems One system, Metacompilation (MC), allows users to encode rules such as Dont use freed memory Dont call blocking functions with interrupts disabled To encode a rule, user describes it as a state machine Source code is used as input to the state machine
Application(2)
Unfortunately, encoding those rules is quite complicated Nark allows the user to simply give examples of (a class of) bugs Nark evolves the checker for the rule itself The complexity of the classes of bugs that Nark is able to found is still somewhat limited
Chapter Summary
In practice, genetic algorithms have had a widespread impact on optimization problems At present it is not quite clear whether the appeal arises from their performance or their aesthetically pleasing origins Similar things can be said about evolutionary programming (although it is not as widely spread as GAs)
Chapter 10
Outline
General Introduction to Swarm Intelligence Particle swarm optimization (PSO) Ant colony optimization (ACO)
Swarm Intelligence
Simple agents interacting locally with one another and their environment No central control or data source (Simple) local interactions often lead to the emergence of (complex) global behavior Examples found in nature: Ant colonies/bee hives Bird ocking Animal herding Bacteria molding Fish schooling
Swarm Intelligence
Examples
Examples(2)
Examples(3)
Examples(4)
Algorithm
Formally speaking, each particle i has a position vector xi (t) describing its position at time t a current velocity vi (t) at time t
Algorithm(2)
The position of a particle is changed by adding the velocity vector to the position vector:
xi (t + 1) = xi (t) + vi (t)
Algorithm(3)
Change the velocity vector slightly (random factor r) to point into direction of best neighbor:
vi (t + 1) = vi (t) + r (xbesti (t) xi (t))
As this may lead to an unanimous, unchanging directions, sometimes random craziness factor is added
function SWARM-OPT(population,FITNESS-FN) repeat loop for i from 1 to SIZE(population) do fit_i = FITNESS-FN(SELECT-NTH(i,populatio end loop for i from 1 to SIZE(population) do look for fittest neighbor of particle i change velocity vector v_i change position x_i end until some individual fit enough or enough time has elapsed return best individual
Neighborhood
Different neighborhood types have been dened and studied Star topology Every particle can communicate with every other particle Each particle is attracted to the best global solution Was used in the rst version of PSO
Neighborhood(2)
Ring topology Every particle communicates with its n immediate neighbors Diagram below shows case for n = 2 Hybrids with star topology are possible (vi is changed towards best neighbor and best overall)
Neighborhood(3)
Wheel topology Only one particle is connected to all others, all other particles are only neighbors to this focal particle Isolates particles from each other, all particles communicate through focal particle Creates a follow the leader effect
Applications
Biochemistry: Improving the fermentation medium for Echinocandin B production Military: Traveling Salesman Problem for Surveillance Mission Electrical engineering: Reactive Power and Voltage Control
Reproduction: queen Brood care: specialized worker Food collection: specialized worker
Seems to occur magically Actually is based on two different things: Anatomical differences Stigmergy
Defense: soldier Nest cleaning: specialized worker Nest building & maintenance: specialized worker
Articial Intelligence andNeural Networks p.343/410 Articial Intelligence andNeural Networks p.344/410
Food Collection
Ants have the ability to nd the shortest path between a food source and their nest: food source
Food Collection(2)
Several experiments have been conducted to study this behavior Initially, paths are chosen randomly With time, more and more ants follow the shorter path
food food
nest
nest
Articial Intelligence andNeural Networks p.347/410
nest
Food Collection(3)
Whats the reason for that? Common ant is not very intelligent (a few hundred neurons) Its done via stigmergy When walking around, each ant leaves behind a pheromone trail When an ant has to decide which path to follow, usually it picks the one with higher pheromone concentration Ants on the shorter path will return faster, leaving more pheromone on this path in shorter time Also, pheromone evaporates with time, pheromone on longer path will vanish faster
Articial Intelligence andNeural Networks p.349/410
ACO usually performs better if mixed with other heuristics (e.g. greedy local optimization taking shortest path):
i,k (j ) = ij is the amount of pheromone on edge between two nodes i and j
Articial Intelligence andNeural Networks p.353/410
ij ij k cJi ic ic
Algorithm
function ACO-TSP() nant = NUMBER-OF-ANTS() nnode = NUMBER-OF-NODES() place ants on nodes repeat loop for k from 1 to nant do loop for step from 1 to nnode do choose next node according to probability phi end end update pheromone trails until some tour good enough or enough time has elapsed return best tour found so far
=0 Q Lk
ij =
The longer the tour, the worse the solution, the smaller the amount of pheromone awarded each link
Applications
One important eld in which ACO has been applied is telecommunications routing When routing calls through a network, they go through a number of intermediate switching stations In a large network there are many possible routes Some network parts may experience congestion while others have spare capacity Load balancing tries to distribute calls over the network such that almost no calls will be lost there is a short route between callers
Applications(2)
ACO has been used to optimize BT network Right hand shows British Synchronous Digital hierarchy network (SDH) M. Ward, "Theres an ant in my phone", New Scientist, 24 January 1998
Applications(3)
Centralized control systems scale badly Usually decentralized approach with several routers is used, each with (local) routing information Main idea of ACO: Enhance routing tables with pheromone information Send virtual ants through the network going from a random source to a random destination Ant going through network updates pheromone information depending on quality of connection (length, congestion)
Chapter Summary
Particle swarm optimization seems to be an efcient and robust technique Although full potential has not been tapped yet Study of ant colonies is still a young eld in computational intelligence More interesting applications still to be explored
Chapter 11
Motivation
Outline
Fuzzy sets and fuzzy logic Approximate Reasoning/fuzzy controllers
Fuzzy Systems
Motivation
Development of logic has a long and rich history (many philosopher played a role) Foundations of two-valued logic come from Aristotle (Laws of Thought) 400 B.C.: Law of the Excluded Middle: Every proposition must have only one of two outcomes: true or false Even back then, there were objections: Cretan philosopher Epimenides of Knossos said: All Cretans are liars
Motivation(2)
Many successes have been achieved with two-valued logic However, not all problems can be mapped into the domain of two-valued variables In most real-world problems incomplete, imprecise, vague, or uncertain data has to represented With fuzzy logic domains are characterized by linguistic terms (rather than numbers), e.g. It is partly cloudy John is very tall partly and very describe the magnitude of the (fuzzy) variables cloudy and tall
Articial Intelligence andNeural Networks p.365/410
Motivation(3)
In the 1900s ukasiewicz proposed an alternative in form of a three-valued logic The possible values are true, false, and undecided Later on, he extends it to a four- and ve-valued logic In 1965 Zadeh produced the foundations of an innite-valued logic in form of fuzzy logic Was ignored for some time, really took off after reimporting it from Japan
Set Theory
Set Theory(2)
Regular sets or crisp sets have a rigid distinction Either an element belongs to the set or not Formally speaking, we have a membership function mA (x) for set A, which maps elements x of the domain X onto 0 or 1:
Set Theory(3)
A graphical presentation of our set large ants looks like this:
We want to construct the set of all large ants Suppose ants longer than 1.5cm are considered large Clearly, an ant with length of 3cm will belong, one with 0.5cm will not What about an ant with length 1.48cm or 1.52cm?
mA : X {0, 1}
Fuzzy Sets
In contrast to crisp sets, fuzzy sets have membership degrees That means, in addition to the values 1 (belongs to) and 0 (does not belong to) an element can have any value in between (kind of belongs to) Formally speaking, the membership function A (x) for a fuzzy set A maps elements x to any value in the interval [0, 1]:
A : X [0, 1]
Fuzzy Sets(2)
A fuzzy set for large ants could look like this:
Fuzzy Operators
Complement (logical NOT): A (u) = 1 A (u) Union (logical OR): AB (u) = max(A (u), B (u)) Intersection (logical AND): AB (u) = min(A (u), B (u)) There are alternatives to these operators (which we will not look at here) All operators need to satisfy certain axioms (e.g. commutativity, associativity for union and intersection)
Fuzzy Operators(2)
Complement Union (green) and Intersection (red)
Rudimentary Reasoning
Using the before mentioned operators we can do some simple reasoning For example, consider the three fuzzy sets tall, good_athlete, and good_basketball_player Now assume:
tall (Michael Jordan) = 0.9 good_athlete (Michael Jordan) = 0.9 tall (Sven) = 0.9 good_athlete (Sven) = 0.2
Rudimentary Reasoning(2)
If we know that a good basketball player is tall and a good athlete, then which one is the better player? We can apply the intersection operator and get:
So Michael Jordan is the better player However, this is a very simplistic situation For most real-world problems, we have to model much more complex scenarios For these cases (rule-based) fuzzy controllers are used
Fuzzy Controllers
Mainly used for controlling complex dynamic systems In that case, formal description by mathematical models is very difcult or even impossible Instead of mathematical model, knowledge of human experts in form of linguistic variables and rules is employed
Fuzzy Controllers(2)
In principle, fuzzy controllers work as follows It observes its environment checking for unusual events In a fuzzication phase the input data is transformed into fuzzy sets Based on a (fuzzy) rule set the input data is evaluated and certain actions may be triggered The output data (which is also described in terms of fuzzy sets) needs to be defuzzied
Fuzzy Controllers(3)
Fuzzy controllers can also be seen as intelligent agents using fuzzy logic for their reasoning:
Fuzzy Controller
Rule Set
Fuzzy Rules
Fuzzy rules are of the general form if antecedent(s) then consequent(s) Antecedents of a rule form a combination of fuzzy sets (which are connected via logic operator) The consequent part is usually a single fuzzy set (multiple combined fuzzy sets can also appear)
Example
Let us look at an exemplary application to clarify the functionality of a fuzzy controller We want to monitor the performance of a Web server running on a cluster The goal is to do (automatic) load balancing in order to use resources efciently
Example(2)
The cpu load of a machine is described using fuzzy sets:
1 low
medium high
Example(3)
A machine with a cpu load of 60% has a medium load to a degree of 0.5 and a high load to a degree of 0.2:
Example(4)
We have different ways to react to a situation To keep things simple, we look at two of them Scale-up: moving a service to a more powerful machine Scale-out: starting a new instance of a service
Example(5)
Lets assume that cpuLoad and performanceIndex are input variables (performanceIndex expressing how powerful a machine is) and scaleUp and scaleOut are output variables Then rules could look like this IF (cpuLoad IS high AND (performanceIndex IS low OR performanceIndex IS medium)) THEN scaleUp IS applicable IF (cpuLoad IS high AND performanceIndex IS high) THEN scaleOut IS applicable
Example(6)
Lets assume that we have a cpu load of 90%, then for degrees of membership we get:
low_load (90) = 0.0 medium_load (90) = 0.0 high_load (90) = 0.8
Example(7)
For the antecedent of the rst rule we get:
0.8 AND (0.0 OR 0.6) = min(0.8, max(0.0, 0.6)) = 0.6
Example(8)
The applicability of a scale-up is also described with the help of a linguistic fuzzy variable:
Furthermore assume that for the performance index 5 we have the following degrees of membership:
low_perf (5) = 0.0 medium_perf (5) = 0.6 high_perf (5) = 0.3
In classical logic: if the antecedents are true, then the implications are true In fuzzy logic there are several different approaches, we use min-max inference
Example(9)
Using min-max inference the result set is cut off at the degree of the antecedent (for scale-up 0.6):
Example(10)
We use the left-most point of the maximal value to defuzzify the result In this case we say that scale-up is applicable to a degree of 0.6
Applications
The rst application of fuzzy control comes from the work of Mamdani and Assilan (1975) Design of a fuzzy controller for a steam engine Objective was to maintain a constant speed by controlling the pressure on pistons Was done by adjusting the heat supplied to a boiler
1 0.6
Assuming a similar set describing the applicability of a scale-out, for the second rule we get an applicability of 0.3 Since 0.6 > 0.3, we decide to scale-up the service in this case
Applications(2)
Since then, a vast number of fuzzy controllers have been developed: Washing machines Video cameras Air conditioners Robot control Underground trains Hydro-electrical power plants
Chapter Summary
Fuzzy controllers have been very successful in commercial products Although critics argue that these applications are successful because they are quite simple (e.g. have small rule base) There have been attempts to merge fuzzy set theory and probability theory, however, there remain many open questions
Chapter 12
Outline
Can machines act intelligently? Can machines really think? Ethics and risks of developing AI
Terminology
Weak AI: assertion that machines could possibly act intelligently (or act as if they were intelligent) Strong AI: assertion that machines that do so are actually thinking Opinion of most AI researchers: Take weak AI hypothesis for granted Dont care about strong AI hypothesis (as long as programs work)
Weak AI
Some philosophers have tried to prove that AI is impossible If it is possible or impossible depends on how it is dened Engineering point of view: nding best agent program on a given architecture Philosophical point of view: Comparing two architectures: human and machine Traditionally posed the question: can machines think? Unfortunately, there is no unambiguous denition of thinking
Strong AI
Main criticism: even if a machine passes the Turing test, is it actually thinking or just simulating the thinking process? Chinese room problem: Human (who doesnt understand Chinese) is put into a room with sheets of paper and detailed instructions Sheets of paper with Chinese writing are slipped under the door Human looks up in the instructions what to do, paints some characters on a paper, slips it back From the outside this may seem like an intelligent agent understanding Chinese is at work
Automation
This is not a new problem, happens every time new technology is deployed Some people lose their jobs New jobs are created elsewhere Main problem: usually new jobs demand higher qualication
Leisure Time
Arthur C. Clarke once wrote that people might face a future of utter boredom No risk of that yet, due to integrated computerized systems that run 7/24, people tend to work longer hours Winner-Takes-All-Society: Traditional industrial economy: working 10% more result roughly in 10% more prot Fast-paced information age economy: an edge of 10% over competitor might mean 100% more prot
Uniqueness
AI research might suggest that human capabilities are not that unique after all Mankind has survived similar setbacks before: Copernicus moving the Earth out of the center of the universe Darwin putting Homo sapiens on the level of other species
Privacy Rights
Widespread wiretapping becomes possible Computer systems using language translation, speech recognition, and keyword search already sift through telephone, email and fax trafc There is an ongoing controversial debate about this: Scott McNealy (CEO Sun): You have zero privacy anyway. Get over it. Louis Brandeis (Judge, 1890): Privacy is the most comprehensive of all rights . . . the right to ones personality.
Accountability
What is the legal liability of an AI system? Who takes responsibility if something goes wrong? This is magnied when money changes hands: Who is liable for any debts made by an intelligent agent It may also play a role in life and death situations: When a physician uses a medical expert system, who is at fault if the diagnosis is wrong?
Lecture Summary
This lecture can only be seen as a brief introduction into this subject AI has made quite a progress in its short history Final word belongs to Alan Turing: We can see only a short distance ahead, but we can see that much remains to be done.