Cooperating Intelligent Systems: Learning From Observations Chapter 18, AIMA

Cooperating Intelligent Systems
Learning from observations

Chapter 18, AIMA
Two types of learning in AI
Deductive: Deduce rules/facts from already known

rules/facts. (We have already dealt with this)
( A ⇒ B ⇒ C) ⇒ ( A ⇒ C)
Inductive: Learn new rules/facts from a data set D.
D = { x(n), y (n)} n =1... N ⇒ ( A ⇒ C )
We will be dealing with the latter, inductive learning, now

Two types of inductive learning
Supervised: The machine

has access to a teacher
who corrects it.
Unsupervised: No access to
teacher. Instead, the
machine must search for
“order” and “structure” in
the environment.
Inductive learning - example A
  
   Etc...
  
 − 1  − 1  − 1
     
 − 1  − 1  − 1
 + 1 0 0
     
0 0 0
x =  + 1 , f (x) = +1 x =  + 1 , f (x) = −1 x =  + 1 , f (x) = 0
0  + 1  + 1
     
 + 1  + 1 0
0 0  + 1
     
0 0 0
• f(x) is the target function
• An example is a pair [x, f(x)]
• Learning task: find a hypothesis h such that h(x) ≈ f(x) given a training set
of examples D = {[xi, f(xi) ]}, i = 1,2,…,N
Inspired by a slide from V. Pavlovic

Inductive learning – example B
Inconsistent linear fit. Consistent sinusoidal
Consistent linear fit Consistent 7th order Consistent 6th order
polynomial fit fit
polynomial fit.
• Constructh so that it agrees with f.

• The hypothesis h is consistent if it agrees with f on all
observations.
• Ockham’s razor: Select the simplest consistent hypothesis.
• How achieve good generalization?

Inductive learning – example C
Example from V. Pavlovic @ Rutgers



Sometimes a consistent hypothesis is worse than an inconsistent

The idealized inductive learning
problem
Find appropriate hypothesis space H and find
h(x) ∈ H with minimum “distance” to f(x) (“error”)
Error f(x)
Η hopt (x) ∈ H
Our hypothesis space
The learning problem is realizable if f(x) ∈ H.

The real inductive learning
problem
Find appropriate hypothesis space H and minimize
the expected distance to f(x) (“generalization error”)
〈Egen 〉
{f(x)}
Η
{hopt (x)}
Data is never noise free and never available in infinite

amounts, so we get variation in data and model.
The generalization error is a function of both the
training data and the hypothesis selection method.
Hypothesis spaces (examples)
Η Η f(x) = 0.5 + x + x2 + 6x3

Η 3 2
1
Η 1⊂ Η 2⊂ Η 3
Η 1={a+bx}; Η 2={a+bx+cx2}; Η 3={a+bx+cx2+dx3};

Linear; Quadratic; Cubic;
Learning problems
• The hypothesis takes as input a set of
attributes x and returns a ”decision” h(x) =
the predicted (estimated) output value for
the input x.
• Discrete valued function ⇒ classification

• Continuous valued function ⇒ regression
Classification
Order into one out of several classes
X →C D K
Input space Output (category) space
 x1   c1   0 
     
 x2   c2   1 
x =  ∈ X D
c =   =  ∈ C K
  
     
 xD   cK   0 
Example: Robot color vision
Classify the Lego pieces into red, blue, and yellow.

Classify white balls, black sideboard, and green carpet.
Input = pixel in image, output = category
Regression
The “fixed regressor model”
f ( x) = g ( x) + ε
x Observed input
f(x) Observed output
g(x) True underlying function
ε I.I.D noise process
with zero mean
Example: Predict price for cotton
futures
Input: Past history
of closing prices,
and trading volume
Output: Predicted
closing price
Decision trees
• “Divide and conquer”:

Split data into smaller and
smaller subsets. x1 > α ?
yes no
• Splits usually on a single
variable
x2 > β ? x2 > γ ?
yes no yes no
The wait@restaurant decision tree
This is our true function.
Can we learn this tree from examples?
Inductive learning of decision tree
• Simplest: Construct a decision tree with one leaf

for every example = memory based learning.
Not very good generalization.
• Advanced: Split on each variable so that the
purity of each split increases (i.e. either only yes
or only no)
• Purity measured,e.g, with entropy

or only no)

or only no)
Entropy = − P ( yes) ln[ P( yes)] − P(no) ln[ P (no)]
General form: Entropy = −∑ P (vi ) ln[ P (vi )]

i
The entropy is maximal when
all possibilities are equally
likely.
The goal of the decision tree

is to decrease the entropy in
each node.
Entropy is zero in a pure ”yes”

node (or pure ”no” node).
Entropy is a measure of ”order” in a

system.
The second law of thermodynamics:

Elements in a closed system tend
to seek their most probable distribution;
in a closed system entropy always increases
Decision tree learning algorithm
• Create pure nodes whenever possible

• If pure nodes are not possible, choose the
split that leads to the largest decrease in
entropy.
Decision tree learning example
10 attributes:
1. Alternate: Is there a suitable alternative restaurant
nearby? {yes,no}
2. Bar: Is there a bar to wait in? {yes,no}
3. Fri/Sat: Is it Friday or Saturday? {yes,no}
4. Hungry: Are you hungry? {yes,no}
5. Patrons: How many are seated in the restaurant? {none,
some, full}
6. Price: Price level {$,$$,$$$}
7. Raining: Is it raining? {yes,no}
8. Reservation: Did you make a reservation? {yes,no}
9. Type: Type of food {French,Italian,Thai,Burger}
10. Wait: {0-10 min, 10-30 min, 30-60 min, >60 min}
T = True, F = False
( 12) ln( 612) − ( 612) ln( 612) = 0.30
6 True,
Entropy = − 6 6 False
Alternate?
Yes No
3 T, 3 F 3 T, 3 F
Entropy =
6
12
− [ ( ) ( ) ( ) ( )]
3
6
ln 3
6
− 3
6
ln 3 +[ ( ) ( ) ( ) ( )]
6
6 12
− 3 ln 3 − 3 ln 3 = 0.30
6 6 6 6
Entropy decrease = 0.30 – 0.30 = 0

Bar?
Yes No
3 T, 3 F 3 T, 3 F
Entropy =
6
12
− [ ( ) ( ) ( ) ( )]
3
6
ln 3
6
− 3
6
ln 3 [ ( ) ( ) ( ) ( )]
+
6
6 12
− 3 ln 3 − 3 ln 3 = 0.30
6 6 6 6

Sat/Fri?
Yes No
2 T, 3 F 4 T, 3 F
Entropy =
5
12
− [ ( ) ( ) ( ) ( )]
2
5
ln 2
5
− 3
5
ln 3 + [ ( ) ( ) ( ) ( )]
7
5 12
− 4 ln 4 − 3 ln 3 = 0.29
7 7 7 7
Entropy decrease = 0.30 – 0.29 = 0.01

Hungry?
Yes No
5 T, 2 F 1 T, 4 F
Entropy =
7
12
− [ ( ) ( ) ( ) ( )]
5
7
ln 5
7
− 2
7
ln 2 + [ ( ) ( ) ( ) ( )]
5
7 12
− 1 ln 1 − 4 ln 4 = 0.24
5 5 5 5

Raining?
Yes No
2 T, 2 F 4 T, 4 F
Entropy =
4
12
− [ ( ) ( ) ( ) ( )]
2
4
ln 2
4
− 2
4
ln 2 [ ( ) ( ) ( ) ( )]
+
8
4 12
− 4 ln 4 − 4 ln 4 = 0.30
8 8 8 8

Reservation?
Yes No
3 T, 2 F 3 T, 4 F
Entropy =
5
12
− [ ( ) ( ) ( ) ( )]
3
5
ln 3
5
− 2
5
ln 2 +
7
5 12
[ ( ) ( ) ( ) ( )]
− 3 ln 3 − 4 ln 4 = 0.29
7 7 7 7

Patrons?
None Full
Some
2F 2 T, 4 F
4T
Entropy = [ ( ) ( ) ( ) ( )]
2
12
− 0
2
ln 0
2
− 2
2
ln 2 [ ( ) ( ) ( ) ( )]
+
4
2 12
− 4 ln 4 − 0 ln 0
4 4 4 4
+
6
12
[ ( ) ( ) ( ) ( )]
− 2 ln 2 − 4 ln 4 = 0.14
6 6 6 6

Price
$ $$$
$$
3 T, 3 F 1 T, 3 F
2T
Entropy = [ ( ) ( ) ( ) ( )]
6
12
− 3
6
ln 3
6
− 3
6
ln 3 [ ( ) ( ) ( ) ( )]
+
2
6 12
− 2 ln 2 − 0 ln 0
2 2 2 2
+
4
12
[ ( ) ( ) ( ) ( )]
− 1 ln 1 − 3 ln 3 = 0.23
4 4 4 4

Type
French Burger
1 T, 1 F Italian 2 T, 2 F
Thai
1 T, 1 F 2 T, 2 F
Entropy = [ ( ) ( ) ( ) ( )] [ ( ) ( ) ( ) ( )]
2
12
− 1
2
ln 1
2
− 1
2
ln 1 +
2 12
2
− 1 ln 1 − 1 ln 1
2 2 2 2
+
4
12
[ ( ) ( ) ( ) ( )] [ ( ) ( ) ( ) ( )]
− 2 ln 2 − 2 ln 2 +
4 4 4 4 12
4
− 2 ln 2 − 2 ln 2 = 0.30
4 4 4 4
Est. waiting
time
0-10 > 60
4 T, 2 F 10-30 2F
30-60
1 T, 1 F 1 T, 1 F
Entropy = [ ( ) ( ) ( ) ( )] [ ( ) ( ) ( ) ( ) ]
6
12
− 4
6
ln 4
6
− 2
6
ln 2 +
6 12
2
− 1 ln 1 − 1 ln 1
2 2 2 2
+
2
12
[ ( ) ( ) ( ) ( )] [ ( ) ( ) ( ) ( )]
− 1 ln 1 − 1 ln 1 +
2 2 2 2 12
2
− 0 ln 0 − 2 ln 2 = 0.24
2 2 2 2
Largest entropy decrease (0.16)
Patrons? achieved by splitting on Patrons.
None Full
Some
2F 2 T,
X?4 F Continue like this, making new
splits, always purifying nodes.
4T
Induced tree (from examples)
True tree
Induced tree (from examples)
Cannot make it more complex

than what the data supports.
How do we know it is correct?
How do we know that h ≈ f ?
(Hume's Problem of Induction)
– Try h on a new test set of examples
(cross validation)
...and assume the ”principle of uniformity”,

i.e. the result we get on this test data
should be indicative of results on future
data. Causality is constant.
Inspired by a slide by V. Pavlovic

Learning curve for the decision tree algorithm on 100 randomly
generated examples in the restaurant domain.
The graph summarizes 20 trials.
Cross-validation
Use a “validation set”.
E gen ≈ Eval
Split your data set into two

Dtrain parts, one for training your
model and the other for
validating your model.
The error on the validation
Dval data is called “validation error”
(Eval )
Eval
K-Fold Cross-validation
More accurate than using only one validation set.
K
1
E gen ≈ Eval =
K
∑E
k =1
val (k )
Dtrain Dval
Dtrain
Dval
Dtrain
Dval Dtrain
Eval (1) Eval (2) Eval (3)

PAC
• Any hypothesis that is consistent with a
sufficiently large set of training (and test)
examples is unlikely to be seriously
wrong; it is probably approximately
correct (PAC).
• What is the relationship between the

generalization error and the number of
samples needed to achieve this
generalization error?
The error
X = the set of all possible examples (instance space).
D = the distribution of these examples.
H = the hypothesis space (h ∈ H).
N = the number of training data.
error(h) = P[h(x) ≠ f (x) | x drawn from D]
h instance
space X
f
f and h disagree
Image adapted from F. Hoffmann @ KTH
Probability for bad hypothesis
Suppose we have a bad hypothesis h with error(h) > ε .
What is the probability that it is consistent with N samples?
• Probability for being inconsistent with one

sample = error(h) > ε .
• Probability for being consistent with one
sample = 1 – error(h) < 1 – ε .
• Probability for being consistent with N
independently drawn samples < (1 – ε )N.
What is the probability that the set Hbad of bad
hypotheses with error(h) > ε contains a
consistent hypothesis?
P(h consistent ∧ error(h) > ε ) ≤ H bad (1 − ε ) N ≤ H (1 − ε ) N

If we want this to be less than some constant δ , then
H (1 − ε ) < δ ⇒ ln H + N ln(1 − ε ) < ln δ

N
If we want this to be less than some constant δ , then

ln(| H |) − ln(δ ) ln(| H |) − ln(δ )
N> ≈
− ln(1 − ε ) ε
Don’t expect to learn very well if H is large
How make learning work?
• Use simple hypotheses
– Always start with the simple ones first
• Constrain H with priors
– Do we know something about the domain?
– Do we have reasonable a priori beliefs on
parameters?
• Use many observations
– Easy to say...
• Cross-validation...

Cooperating Intelligent Systems: Learning From Observations Chapter 18, AIMA

Uploaded by

Copyright:

Available Formats

Cooperating Intelligent Systems: Learning From Observations Chapter 18, AIMA

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cooperating Intelligent Systems: Learning From Observations Chapter 18, AIMA

Uploaded by

Copyright:

Available Formats

Cooperating Intelligent Systems

Learning from observations

Deductive: Deduce rules/facts from already known

D = { x(n), y (n)} n =1... N ⇒ ( A ⇒ C )

We will be dealing with the latter, inductive learning, now

Supervised: The machine

Inspired by a slide from V. Pavlovic

• Constructh so that it agrees with f.

• How achieve good generalization?

Example from V. Pavlovic @ Rutgers

Example from V. Pavlovic @ Rutgers

Example from V. Pavlovic @ Rutgers

Example from V. Pavlovic @ Rutgers

Our hypothesis space

The learning problem is realizable if f(x) ∈ H.

Data is never noise free and never available in infinite

Η Η f(x) = 0.5 + x + x2 + 6x3

Η 1={a+bx}; Η 2={a+bx+cx2}; Η 3={a+bx+cx2+dx3};

• Discrete valued function ⇒ classification

Classify the Lego pieces into red, blue, and yellow.

• “Divide and conquer”:

• Simplest: Construct a decision tree with one leaf

• Simplest: Construct a decision tree with one leaf

• Simplest: Construct a decision tree with one leaf

Entropy = − P ( yes) ln[ P( yes)] − P(no) ln[ P (no)]

General form: Entropy = −∑ P (vi ) ln[ P (vi )]

The goal of the decision tree

Entropy is zero in a pure ”yes”

Entropy is a measure of ”order” in a

The second law of thermodynamics:

• Create pure nodes whenever possible

Entropy decrease = 0.30 – 0.30 = 0

Entropy decrease = 0.30 – 0.30 = 0

Entropy decrease = 0.30 – 0.29 = 0.01

Entropy decrease = 0.30 – 0.24 = 0.06

Entropy decrease = 0.30 – 0.30 = 0

Entropy decrease = 0.30 – 0.29 = 0.01

Entropy decrease = 0.30 – 0.14 = 0.16

Entropy decrease = 0.30 – 0.23 = 0.07

Cannot make it more complex

...and assume the ”principle of uniformity”,

Inspired by a slide by V. Pavlovic

Split your data set into two

Eval (1) Eval (2) Eval (3)

• What is the relationship between the

error(h) = P[h(x) ≠ f (x) | x drawn from D]

• Probability for being inconsistent with one

P(h consistent ∧ error(h) > ε ) ≤ H bad (1 − ε ) N ≤ H (1 − ε ) N

P(h consistent ∧ error(h) > ε ) ≤ H bad (1 − ε ) N ≤ H (1 − ε ) N

If we want this to be less than some constant δ , then

H (1 − ε ) < δ ⇒ ln H + N ln(1 − ε ) < ln δ

P(h consistent ∧ error(h) > ε ) ≤ H bad (1 − ε ) N ≤ H (1 − ε ) N

If we want this to be less than some constant δ , then

You might also like