Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Cooperating Intelligent Systems: Learning From Observations Chapter 18, AIMA

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 51

Cooperating Intelligent Systems

Learning from observations


Chapter 18, AIMA
Two types of learning in AI

Deductive: Deduce rules/facts from already known


rules/facts. (We have already dealt with this)

( A ⇒ B ⇒ C) ⇒ ( A ⇒ C)
Inductive: Learn new rules/facts from a data set D.

D = { x(n), y (n)} n =1... N ⇒ ( A ⇒ C )

We will be dealing with the latter, inductive learning, now


Two types of inductive learning

Supervised: The machine


has access to a teacher
who corrects it.

Unsupervised: No access to
teacher. Instead, the
machine must search for
“order” and “structure” in
the environment.
Inductive learning - example A
  
   Etc...
  
 − 1  − 1  − 1
     
 − 1  − 1  − 1
 + 1 0 0
     
0 0 0
x =  + 1 , f (x) = +1 x =  + 1 , f (x) = −1 x =  + 1 , f (x) = 0
0  + 1  + 1
     
 + 1  + 1 0
0 0  + 1
     
0 0 0
• f(x) is the target function
• An example is a pair [x, f(x)]
• Learning task: find a hypothesis h such that h(x) ≈ f(x) given a training set
of examples D = {[xi, f(xi) ]}, i = 1,2,…,N

Inspired by a slide from V. Pavlovic


Inductive learning – example B
Inconsistent linear fit. Consistent sinusoidal
Consistent linear fit Consistent 7th order Consistent 6th order
polynomial fit fit
polynomial fit.

• Constructh so that it agrees with f.


• The hypothesis h is consistent if it agrees with f on all
observations.
• Ockham’s razor: Select the simplest consistent hypothesis.

• How achieve good generalization?


Inductive learning – example C

Example from V. Pavlovic @ Rutgers


Inductive learning – example C

Example from V. Pavlovic @ Rutgers


Inductive learning – example C

Example from V. Pavlovic @ Rutgers


Inductive learning – example C
Sometimes a consistent hypothesis is worse than an inconsistent

Example from V. Pavlovic @ Rutgers


The idealized inductive learning
problem
Find appropriate hypothesis space H and find
h(x) ∈ H with minimum “distance” to f(x) (“error”)

Error f(x)
Η hopt (x) ∈ H

Our hypothesis space

The learning problem is realizable if f(x) ∈ H.


The real inductive learning
problem
Find appropriate hypothesis space H and minimize
the expected distance to f(x) (“generalization error”)

〈Egen 〉
{f(x)}
Η
{hopt (x)}

Data is never noise free and never available in infinite


amounts, so we get variation in data and model.
The generalization error is a function of both the
training data and the hypothesis selection method.
Hypothesis spaces (examples)

Η Η f(x) = 0.5 + x + x2 + 6x3


Η 3 2
1

Η 1⊂ Η 2⊂ Η 3

Η 1={a+bx}; Η 2={a+bx+cx2}; Η 3={a+bx+cx2+dx3};


Linear; Quadratic; Cubic;
Learning problems
• The hypothesis takes as input a set of
attributes x and returns a ”decision” h(x) =
the predicted (estimated) output value for
the input x.

• Discrete valued function ⇒ classification


• Continuous valued function ⇒ regression
Classification
Order into one out of several classes

X →C D K
Input space Output (category) space

 x1   c1   0 
     
 x2   c2   1 
x =  ∈ X D
c =   =  ∈ C K

  
     
 xD   cK   0 
Example: Robot color vision

Classify the Lego pieces into red, blue, and yellow.


Classify white balls, black sideboard, and green carpet.
Input = pixel in image, output = category
Regression
The “fixed regressor model”

f ( x) = g ( x) + ε
x Observed input
f(x) Observed output
g(x) True underlying function
ε I.I.D noise process
with zero mean
Example: Predict price for cotton
futures
Input: Past history
of closing prices,
and trading volume

Output: Predicted
closing price
Decision trees

• “Divide and conquer”:


Split data into smaller and
smaller subsets. x1 > α ?

yes no
• Splits usually on a single
variable
x2 > β ? x2 > γ ?

yes no yes no
The wait@restaurant decision tree
This is our true function.
Can we learn this tree from examples?
Inductive learning of decision tree

• Simplest: Construct a decision tree with one leaf


for every example = memory based learning.
Not very good generalization.
• Advanced: Split on each variable so that the
purity of each split increases (i.e. either only yes
or only no)
• Purity measured,e.g, with entropy
Inductive learning of decision tree

• Simplest: Construct a decision tree with one leaf


for every example = memory based learning.
Not very good generalization.
• Advanced: Split on each variable so that the
purity of each split increases (i.e. either only yes
or only no)
• Purity measured,e.g, with entropy
Inductive learning of decision tree

• Simplest: Construct a decision tree with one leaf


for every example = memory based learning.
Not very good generalization.
• Advanced: Split on each variable so that the
purity of each split increases (i.e. either only yes
or only no)
• Purity measured,e.g, with entropy

Entropy = − P ( yes) ln[ P( yes)] − P(no) ln[ P (no)]

General form: Entropy = −∑ P (vi ) ln[ P (vi )]


i
The entropy is maximal when
all possibilities are equally
likely.

The goal of the decision tree


is to decrease the entropy in
each node.

Entropy is zero in a pure ”yes”


node (or pure ”no” node).

Entropy is a measure of ”order” in a


system.

The second law of thermodynamics:


Elements in a closed system tend
to seek their most probable distribution;
in a closed system entropy always increases
Decision tree learning algorithm

• Create pure nodes whenever possible


• If pure nodes are not possible, choose the
split that leads to the largest decrease in
entropy.
Decision tree learning example
10 attributes:
1. Alternate: Is there a suitable alternative restaurant
nearby? {yes,no}
2. Bar: Is there a bar to wait in? {yes,no}
3. Fri/Sat: Is it Friday or Saturday? {yes,no}
4. Hungry: Are you hungry? {yes,no}
5. Patrons: How many are seated in the restaurant? {none,
some, full}
6. Price: Price level {$,$$,$$$}
7. Raining: Is it raining? {yes,no}
8. Reservation: Did you make a reservation? {yes,no}
9. Type: Type of food {French,Italian,Thai,Burger}
10. Wait: {0-10 min, 10-30 min, 30-60 min, >60 min}
Decision tree learning example

T = True, F = False
( 12) ln( 612) − ( 612) ln( 612) = 0.30
6 True,
Entropy = − 6 6 False
Decision tree learning example
Alternate?

Yes No

3 T, 3 F 3 T, 3 F

Entropy =
6
12
− [ ( ) ( ) ( ) ( )]
3
6
ln 3
6
− 3
6
ln 3 +[ ( ) ( ) ( ) ( )]
6
6 12
− 3 ln 3 − 3 ln 3 = 0.30
6 6 6 6

Entropy decrease = 0.30 – 0.30 = 0


Decision tree learning example
Bar?

Yes No

3 T, 3 F 3 T, 3 F

Entropy =
6
12
− [ ( ) ( ) ( ) ( )]
3
6
ln 3
6
− 3
6
ln 3 [ ( ) ( ) ( ) ( )]
+
6
6 12
− 3 ln 3 − 3 ln 3 = 0.30
6 6 6 6

Entropy decrease = 0.30 – 0.30 = 0


Decision tree learning example
Sat/Fri?

Yes No

2 T, 3 F 4 T, 3 F

Entropy =
5
12
− [ ( ) ( ) ( ) ( )]
2
5
ln 2
5
− 3
5
ln 3 + [ ( ) ( ) ( ) ( )]
7
5 12
− 4 ln 4 − 3 ln 3 = 0.29
7 7 7 7

Entropy decrease = 0.30 – 0.29 = 0.01


Decision tree learning example
Hungry?

Yes No

5 T, 2 F 1 T, 4 F

Entropy =
7
12
− [ ( ) ( ) ( ) ( )]
5
7
ln 5
7
− 2
7
ln 2 + [ ( ) ( ) ( ) ( )]
5
7 12
− 1 ln 1 − 4 ln 4 = 0.24
5 5 5 5

Entropy decrease = 0.30 – 0.24 = 0.06


Decision tree learning example
Raining?

Yes No

2 T, 2 F 4 T, 4 F

Entropy =
4
12
− [ ( ) ( ) ( ) ( )]
2
4
ln 2
4
− 2
4
ln 2 [ ( ) ( ) ( ) ( )]
+
8
4 12
− 4 ln 4 − 4 ln 4 = 0.30
8 8 8 8

Entropy decrease = 0.30 – 0.30 = 0


Decision tree learning example
Reservation?

Yes No

3 T, 2 F 3 T, 4 F

Entropy =
5
12
− [ ( ) ( ) ( ) ( )]
3
5
ln 3
5
− 2
5
ln 2 +
7
5 12
[ ( ) ( ) ( ) ( )]
− 3 ln 3 − 4 ln 4 = 0.29
7 7 7 7

Entropy decrease = 0.30 – 0.29 = 0.01


Decision tree learning example
Patrons?

None Full

Some
2F 2 T, 4 F
4T

Entropy = [ ( ) ( ) ( ) ( )]
2
12
− 0
2
ln 0
2
− 2
2
ln 2 [ ( ) ( ) ( ) ( )]
+
4
2 12
− 4 ln 4 − 0 ln 0
4 4 4 4

+
6
12
[ ( ) ( ) ( ) ( )]
− 2 ln 2 − 4 ln 4 = 0.14
6 6 6 6

Entropy decrease = 0.30 – 0.14 = 0.16


Decision tree learning example
Price

$ $$$

$$
3 T, 3 F 1 T, 3 F
2T

Entropy = [ ( ) ( ) ( ) ( )]
6
12
− 3
6
ln 3
6
− 3
6
ln 3 [ ( ) ( ) ( ) ( )]
+
2
6 12
− 2 ln 2 − 0 ln 0
2 2 2 2

+
4
12
[ ( ) ( ) ( ) ( )]
− 1 ln 1 − 3 ln 3 = 0.23
4 4 4 4

Entropy decrease = 0.30 – 0.23 = 0.07


Decision tree learning example
Type
French Burger

1 T, 1 F Italian 2 T, 2 F
Thai

1 T, 1 F 2 T, 2 F

Entropy = [ ( ) ( ) ( ) ( )] [ ( ) ( ) ( ) ( )]
2
12
− 1
2
ln 1
2
− 1
2
ln 1 +
2 12
2
− 1 ln 1 − 1 ln 1
2 2 2 2

+
4
12
[ ( ) ( ) ( ) ( )] [ ( ) ( ) ( ) ( )]
− 2 ln 2 − 2 ln 2 +
4 4 4 4 12
4
− 2 ln 2 − 2 ln 2 = 0.30
4 4 4 4
Entropy decrease = 0.30 – 0.30 = 0
Decision tree learning example
Est. waiting
time
0-10 > 60

4 T, 2 F 10-30 2F
30-60

1 T, 1 F 1 T, 1 F

Entropy = [ ( ) ( ) ( ) ( )] [ ( ) ( ) ( ) ( ) ]
6
12
− 4
6
ln 4
6
− 2
6
ln 2 +
6 12
2
− 1 ln 1 − 1 ln 1
2 2 2 2

+
2
12
[ ( ) ( ) ( ) ( )] [ ( ) ( ) ( ) ( )]
− 1 ln 1 − 1 ln 1 +
2 2 2 2 12
2
− 0 ln 0 − 2 ln 2 = 0.24
2 2 2 2
Entropy decrease = 0.30 – 0.24 = 0.06
Decision tree learning example
Largest entropy decrease (0.16)
Patrons? achieved by splitting on Patrons.

None Full

Some
2F 2 T,
X?4 F Continue like this, making new
splits, always purifying nodes.
4T
Decision tree learning example
Induced tree (from examples)
Decision tree learning example
True tree
Decision tree learning example
Induced tree (from examples)

Cannot make it more complex


than what the data supports.
How do we know it is correct?
How do we know that h ≈ f ?
(Hume's Problem of Induction)
– Try h on a new test set of examples
(cross validation)

...and assume the ”principle of uniformity”,


i.e. the result we get on this test data
should be indicative of results on future
data. Causality is constant.

Inspired by a slide by V. Pavlovic


Learning curve for the decision tree algorithm on 100 randomly
generated examples in the restaurant domain.
The graph summarizes 20 trials.
Cross-validation
Use a “validation set”.

E gen ≈ Eval

Split your data set into two


Dtrain parts, one for training your
model and the other for
validating your model.
The error on the validation
Dval data is called “validation error”
(Eval )
Eval
K-Fold Cross-validation
More accurate than using only one validation set.
K
1
E gen ≈ Eval =
K
∑E
k =1
val (k )

Dtrain Dval
Dtrain
Dval
Dtrain
Dval Dtrain

Eval (1) Eval (2) Eval (3)


PAC
• Any hypothesis that is consistent with a
sufficiently large set of training (and test)
examples is unlikely to be seriously
wrong; it is probably approximately
correct (PAC).

• What is the relationship between the


generalization error and the number of
samples needed to achieve this
generalization error?
The error
X = the set of all possible examples (instance space).
D = the distribution of these examples.
H = the hypothesis space (h ∈ H).
N = the number of training data.

error(h) = P[h(x) ≠ f (x) | x drawn from D]

h instance
space X
f
f and h disagree
Image adapted from F. Hoffmann @ KTH
Probability for bad hypothesis
Suppose we have a bad hypothesis h with error(h) > ε .
What is the probability that it is consistent with N samples?

• Probability for being inconsistent with one


sample = error(h) > ε .
• Probability for being consistent with one
sample = 1 – error(h) < 1 – ε .
• Probability for being consistent with N
independently drawn samples < (1 – ε )N.
Probability for bad hypothesis
What is the probability that the set Hbad of bad
hypotheses with error(h) > ε contains a
consistent hypothesis?

P(h consistent ∧ error(h) > ε ) ≤ H bad (1 − ε ) N ≤ H (1 − ε ) N


Probability for bad hypothesis
What is the probability that the set Hbad of bad
hypotheses with error(h) > ε contains a
consistent hypothesis?

P(h consistent ∧ error(h) > ε ) ≤ H bad (1 − ε ) N ≤ H (1 − ε ) N

If we want this to be less than some constant δ , then

H (1 − ε ) < δ ⇒ ln H + N ln(1 − ε ) < ln δ


N
Probability for bad hypothesis
What is the probability that the set Hbad of bad
hypotheses with error(h) > ε contains a
consistent hypothesis?

P(h consistent ∧ error(h) > ε ) ≤ H bad (1 − ε ) N ≤ H (1 − ε ) N

If we want this to be less than some constant δ , then


ln(| H |) − ln(δ ) ln(| H |) − ln(δ )
N> ≈
− ln(1 − ε ) ε
Don’t expect to learn very well if H is large
How make learning work?
• Use simple hypotheses
– Always start with the simple ones first
• Constrain H with priors
– Do we know something about the domain?
– Do we have reasonable a priori beliefs on
parameters?
• Use many observations
– Easy to say...
• Cross-validation...

You might also like